Our repository for the code, literature review, and results on SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning
- Run the setup.py file for imports to work :
python setup.py install
- Approaches:
- Plain GAReader (Baseline)
- Common-sense Reasoning
- Transformers-Based Model for Cloze-Style Question Answering
- Incorporating The Imperceptibility Rating from the Concreteness Rating Dataset
- The dataset is present in
data/Imperceptibility/Concreteness Ratings
.
- Approaches:
- Use WordNet's hypernyms, augment the dataset (replace nouns with their hypernyms), pretrain BERT on this augmented dataset.
Word/Char Python Split-Based
|
Task_1_train |
Task_1_dev |
Task_2_train |
Task_2_dev |
Samples |
3227 |
837 |
3318 |
851 |
Duplicated Examples |
0 |
0 |
0 |
0 |
Duplicated Articles |
57 |
4 |
55 |
2 |
Mean Char Count (Articles) |
1548.47 |
1582.79 |
2448.93 |
2501.41 |
Std Char Count (Articles) |
905.301 |
992.333 |
1788.42 |
1809.55 |
Max Char Count (Articles) |
10177 |
9563 |
10370 |
9628 |
Min Char Count (Articles) |
162 |
255 |
169 |
247 |
Mean Word Count (Articles) |
262.44 |
268.608 |
417.727 |
427.048 |
Std Word Count (Articles) |
154.534 |
171.253 |
307.248 |
310.994 |
Max Word Count (Articles) |
1754 |
1641 |
1700 |
1651 |
Min Word Count (Articles) |
31 |
41 |
31 |
45 |
Duplicated Questions |
0 |
0 |
1 |
0 |
Mean Char Count (Questions) |
137.826 |
137.455 |
148.234 |
149.385 |
Std Char Count (Questions) |
31.1171 |
30.6809 |
47.7494 |
46.9035 |
Max Char Count (Questions) |
389 |
387 |
443 |
413 |
Min Char Count (Questions) |
32 |
48 |
31 |
37 |
Mean Word Count (Questions) |
24.6771 |
24.4695 |
26.8852 |
27.2127 |
Std Word Count (Questions) |
6.18453 |
6.04612 |
9.1822 |
9.16159 |
Max Word Count (Questions) |
73 |
73 |
83 |
83 |
Min Word Count (Questions) |
6 |
9 |
7 |
6 |
|
Task_1_train |
Task_1_dev |
Task_2_train |
Task_2_dev |
Mean Token Count (Articles) |
332.603 |
341.452 |
528.58 |
539.303 |
Std Token Count (Articles) |
193.729 |
217.673 |
387.756 |
396.286 |
Max Token Count (Articles) |
2206 |
2043 |
2188 |
2272 |
Min Token Count (Articles) |
38 |
52 |
46 |
52 |
Mean Token Count (Questions) |
28.6319 |
28.3441 |
30.9629 |
31.1657 |
Std Token Count (Questions) |
6.83765 |
6.75112 |
10.0225 |
9.98314 |
Max Token Count (Questions) |
84 |
84 |
91 |
89 |
Min Token Count (Questions) |
9 |
11 |
9 |
8 |
Note :
- Word is based on .split() on Python strings, including all the words (stopwords).
We can also use torch/nltk tokenizers to get tokens and check their lengths.
- The spaces are also included in character counting.
- No preprocessing was done for either of the statistics.
- For questions, we do not remove @placeholder.
Note :
- Tokens were made using NLTK Tokenizer, and only alphanumeric characters, after removing stop words were used.
|
Task_1_train |
Task_1_dev |
Task_2_train |
Task_2_dev |
Unique Token Count |
41087 |
20447 |
53285 |
25779 |
Total Count |
482713 |
128258 |
775342 |
203877 |
|
Task_1_train |
|
|
Task_1_dev |
|
|
Task_2_train |
|
|
Task_2_dev |
|
|
|
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
0 |
said |
7978 |
0.0165274 |
said |
1940 |
0.0151258 |
said |
9594 |
0.0123739 |
said |
2524 |
0.01238 |
1 |
mr |
2465 |
0.00510655 |
mr |
701 |
0.00546555 |
mr |
3549 |
0.00457733 |
mr |
993 |
0.00487058 |
2 |
would |
2194 |
0.00454514 |
would |
593 |
0.00462349 |
would |
3539 |
0.00456444 |
would |
954 |
0.00467929 |
3 |
also |
2019 |
0.00418261 |
also |
506 |
0.00394517 |
people |
3335 |
0.00430133 |
people |
872 |
0.00427709 |
4 |
people |
1897 |
0.00392987 |
one |
486 |
0.00378924 |
one |
3182 |
0.004104 |
one |
814 |
0.0039926 |
5 |
one |
1705 |
0.00353212 |
people |
481 |
0.00375025 |
also |
2784 |
0.00359067 |
also |
675 |
0.00331082 |
6 |
last |
1461 |
0.00302664 |
first |
395 |
0.00307973 |
says |
2201 |
0.00283875 |
two |
562 |
0.00275656 |
7 |
two |
1349 |
0.00279462 |
last |
391 |
0.00304854 |
two |
2112 |
0.00272396 |
says |
547 |
0.00268299 |
8 |
first |
1344 |
0.00278426 |
new |
372 |
0.0029004 |
years |
2084 |
0.00268785 |
new |
545 |
0.00267318 |
9 |
new |
1341 |
0.00277805 |
told |
349 |
0.00272108 |
new |
2081 |
0.00268398 |
time |
532 |
0.00260942 |
10 |
year |
1262 |
0.00261439 |
two |
346 |
0.00269769 |
time |
2064 |
0.00266205 |
us |
531 |
0.00260451 |
11 |
years |
1248 |
0.00258539 |
us |
342 |
0.0026665 |
could |
2002 |
0.00258209 |
could |
517 |
0.00253584 |
12 |
could |
1222 |
0.00253152 |
years |
340 |
0.00265091 |
last |
1957 |
0.00252405 |
years |
516 |
0.00253094 |
13 |
told |
1203 |
0.00249216 |
year |
329 |
0.00256514 |
first |
1944 |
0.00250728 |
first |
509 |
0.0024966 |
14 |
us |
1187 |
0.00245902 |
time |
307 |
0.00239361 |
us |
1820 |
0.00234735 |
told |
493 |
0.00241812 |
15 |
time |
1159 |
0.00240101 |
could |
291 |
0.00226886 |
year |
1768 |
0.00228028 |
year |
462 |
0.00226607 |
16 |
government |
1036 |
0.0021462 |
bbc |
256 |
0.00199598 |
like |
1727 |
0.0022274 |
last |
455 |
0.00223174 |
17 |
made |
940 |
0.00194733 |
added |
246 |
0.00191801 |
told |
1659 |
0.0021397 |
bbc |
418 |
0.00205026 |
18 |
bbc |
894 |
0.00185203 |
made |
236 |
0.00184004 |
police |
1504 |
0.00193979 |
like |
408 |
0.00200121 |
19 |
police |
892 |
0.00184789 |
get |
227 |
0.00176987 |
get |
1476 |
0.00190368 |
government |
404 |
0.00198159 |
|
Task_1_train |
Task_1_dev |
Task_2_train |
Task_2_dev |
Unique Token Count |
10032 |
4602 |
11098 |
4966 |
Total Count |
43683 |
11289 |
47297 |
12255 |
|
Task_1_train |
|
|
Task_1_dev |
|
|
Task_2_train |
|
|
Task_2_dev |
|
|
|
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
0 |
placeholder |
3227 |
0.0738731 |
placeholder |
837 |
0.074143 |
placeholder |
3318 |
0.0701524 |
placeholder |
851 |
0.069441 |
1 |
said |
268 |
0.00613511 |
new |
69 |
0.00611214 |
new |
229 |
0.00484174 |
new |
71 |
0.00579355 |
2 |
year |
230 |
0.00526521 |
says |
62 |
0.00549207 |
year |
218 |
0.00460917 |
two |
60 |
0.00489596 |
3 |
new |
222 |
0.00508207 |
year |
56 |
0.00496058 |
two |
209 |
0.00441888 |
says |
57 |
0.00465116 |
4 |
says |
197 |
0.00450976 |
us |
48 |
0.00425193 |
one |
201 |
0.00424974 |
year |
54 |
0.00440636 |
5 |
two |
172 |
0.00393746 |
said |
46 |
0.00407476 |
said |
200 |
0.0042286 |
said |
54 |
0.00440636 |
6 |
first |
159 |
0.00363986 |
first |
43 |
0.00380902 |
man |
191 |
0.00403831 |
us |
49 |
0.00399837 |
7 |
world |
158 |
0.00361697 |
two |
43 |
0.00380902 |
police |
185 |
0.00391145 |
people |
48 |
0.00391677 |
8 |
one |
148 |
0.00338805 |
world |
38 |
0.00336611 |
people |
177 |
0.00374231 |
one |
47 |
0.00383517 |
9 |
man |
143 |
0.00327358 |
uk |
35 |
0.00310036 |
first |
174 |
0.00367888 |
man |
46 |
0.00375357 |
10 |
people |
141 |
0.0032278 |
one |
34 |
0.00301178 |
world |
168 |
0.00355202 |
first |
44 |
0.00359037 |
11 |
us |
129 |
0.00295309 |
man |
32 |
0.00283462 |
years |
154 |
0.00325602 |
years |
37 |
0.00301918 |
12 |
police |
127 |
0.00290731 |
people |
32 |
0.00283462 |
us |
147 |
0.00310802 |
police |
36 |
0.00293758 |
13 |
uk |
127 |
0.00290731 |
police |
31 |
0.00274604 |
uk |
133 |
0.00281202 |
uk |
32 |
0.00261118 |
14 |
wales |
118 |
0.00270128 |
former |
31 |
0.00274604 |
says |
130 |
0.00274859 |
world |
31 |
0.00252958 |
15 |
years |
115 |
0.0026326 |
england |
31 |
0.00274604 |
three |
113 |
0.00238916 |
could |
31 |
0.00252958 |
16 |
former |
112 |
0.00256393 |
wales |
29 |
0.00256887 |
england |
111 |
0.00234687 |
city |
28 |
0.00228478 |
17 |
government |
110 |
0.00251814 |
league |
29 |
0.00256887 |
old |
105 |
0.00222001 |
found |
27 |
0.00220318 |
18 |
could |
109 |
0.00249525 |
years |
28 |
0.00248029 |
could |
102 |
0.00215658 |
may |
27 |
0.00220318 |
19 |
city |
106 |
0.00242657 |
three |
25 |
0.00221455 |
last |
100 |
0.0021143 |
bbc |
27 |
0.00220318 |
Note:
- BERT Tokenizer based splits
- Process - Take passages, convert to lowercase, remove all non-alphanumeric characters, tokenize, then remove stop words.
|
Task_1_train |
Task_1_dev |
Task_2_train |
Task_2_dev |
Unique Token Count |
21952 |
15858 |
23842 |
18157 |
Total Count |
559481 |
148611 |
898160 |
235177 |
|
Task_1_train |
|
|
Task_1_dev |
|
|
Task_2_train |
|
|
Task_2_dev |
|
|
|
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
0 |
##s |
8094 |
0.014467 |
##s |
2393 |
0.0161024 |
##s |
13511 |
0.015043 |
##s |
3477 |
0.0147846 |
1 |
said |
7994 |
0.0142882 |
said |
1941 |
0.0130609 |
said |
9597 |
0.0106852 |
said |
2525 |
0.0107366 |
2 |
mr |
2466 |
0.00440766 |
mr |
702 |
0.00472374 |
##t |
4282 |
0.00476752 |
##t |
1068 |
0.00454126 |
3 |
##t |
2266 |
0.00405018 |
##t |
674 |
0.00453533 |
mr |
3550 |
0.00395253 |
mr |
993 |
0.00422235 |
4 |
would |
2204 |
0.00393937 |
would |
595 |
0.00400374 |
would |
3549 |
0.00395141 |
would |
964 |
0.00409904 |
5 |
also |
2020 |
0.00361049 |
one |
513 |
0.00345197 |
one |
3348 |
0.00372762 |
people |
873 |
0.0037121 |
6 |
people |
1897 |
0.00339064 |
also |
506 |
0.00340486 |
people |
3335 |
0.00371315 |
one |
855 |
0.00363556 |
7 |
one |
1822 |
0.00325659 |
people |
481 |
0.00323664 |
also |
2786 |
0.0031019 |
also |
675 |
0.00287018 |
8 |
two |
1479 |
0.00264352 |
first |
408 |
0.00274542 |
two |
2277 |
0.00253518 |
two |
611 |
0.00259804 |
9 |
last |
1477 |
0.00263995 |
last |
399 |
0.00268486 |
says |
2201 |
0.00245057 |
new |
557 |
0.00236843 |
10 |
first |
1417 |
0.0025327 |
new |
380 |
0.00255701 |
new |
2129 |
0.0023704 |
says |
547 |
0.00232591 |
11 |
new |
1372 |
0.00245227 |
two |
379 |
0.00255028 |
years |
2084 |
0.0023203 |
us |
546 |
0.00232166 |
12 |
year |
1281 |
0.00228962 |
told |
349 |
0.00234841 |
time |
2076 |
0.00231139 |
time |
540 |
0.00229614 |
13 |
years |
1248 |
0.00223064 |
us |
345 |
0.0023215 |
first |
2056 |
0.00228912 |
first |
525 |
0.00223236 |
14 |
could |
1224 |
0.00218774 |
years |
340 |
0.00228785 |
could |
2007 |
0.00223457 |
could |
518 |
0.0022026 |
15 |
##m |
1219 |
0.0021788 |
##e |
339 |
0.00228112 |
last |
1975 |
0.00219894 |
years |
516 |
0.00219409 |
16 |
us |
1206 |
0.00215557 |
year |
330 |
0.00222056 |
us |
1844 |
0.00205309 |
told |
493 |
0.00209629 |
17 |
told |
1203 |
0.00215021 |
bbc |
328 |
0.0022071 |
year |
1793 |
0.0019963 |
bbc |
475 |
0.00201976 |
18 |
time |
1173 |
0.00209659 |
time |
311 |
0.00209271 |
##ing |
1785 |
0.0019874 |
##e |
472 |
0.002007 |
19 |
bbc |
1112 |
0.00198756 |
could |
291 |
0.00195813 |
##e |
1761 |
0.00196068 |
year |
468 |
0.00198999 |
|
Task_1_train |
Task_1_dev |
Task_2_train |
Task_2_dev |
Unique Token Count |
9771 |
4876 |
10702 |
5251 |
Total Count |
51629 |
13284 |
55858 |
14331 |
|
Task_1_train |
|
|
Task_1_dev |
|
|
Task_2_train |
|
|
Task_2_dev |
|
|
|
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
Word |
Count |
Frequency |
0 |
place |
3256 |
0.0630653 |
place |
844 |
0.0635351 |
place |
3349 |
0.0599556 |
place |
859 |
0.05994 |
1 |
##holder |
3227 |
0.0625036 |
##holder |
837 |
0.0630081 |
##holder |
3318 |
0.0594006 |
##holder |
851 |
0.0593818 |
2 |
said |
272 |
0.00526836 |
new |
69 |
0.00519422 |
##s |
333 |
0.00596155 |
new |
72 |
0.00502407 |
3 |
year |
230 |
0.00445486 |
##s |
66 |
0.00496838 |
new |
230 |
0.00411758 |
##s |
72 |
0.00502407 |
4 |
##s |
223 |
0.00431928 |
says |
62 |
0.00466727 |
year |
218 |
0.00390275 |
two |
60 |
0.00418673 |
5 |
new |
222 |
0.00429991 |
year |
56 |
0.0042156 |
two |
209 |
0.00374163 |
says |
57 |
0.00397739 |
6 |
says |
197 |
0.00381568 |
us |
48 |
0.00361337 |
one |
201 |
0.00359841 |
year |
54 |
0.00376806 |
7 |
two |
172 |
0.00333146 |
said |
46 |
0.00346281 |
said |
200 |
0.00358051 |
said |
54 |
0.00376806 |
8 |
first |
160 |
0.00309903 |
first |
43 |
0.00323698 |
man |
196 |
0.0035089 |
man |
50 |
0.00348894 |
9 |
world |
158 |
0.0030603 |
two |
43 |
0.00323698 |
police |
185 |
0.00331197 |
us |
49 |
0.00341916 |
10 |
man |
150 |
0.00290534 |
world |
38 |
0.00286058 |
people |
177 |
0.00316875 |
people |
48 |
0.00334938 |
11 |
one |
148 |
0.00286661 |
uk |
36 |
0.00271003 |
first |
174 |
0.00311504 |
one |
47 |
0.0032796 |
12 |
people |
141 |
0.00273102 |
man |
35 |
0.00263475 |
world |
168 |
0.00300763 |
first |
44 |
0.00307027 |
13 |
uk |
135 |
0.00261481 |
one |
34 |
0.00255947 |
years |
154 |
0.00275699 |
years |
37 |
0.00258182 |
14 |
us |
130 |
0.00251796 |
people |
32 |
0.00240891 |
us |
147 |
0.00263167 |
police |
36 |
0.00251204 |
15 |
police |
129 |
0.0024986 |
police |
31 |
0.00233363 |
uk |
142 |
0.00254216 |
uk |
33 |
0.0023027 |
16 |
wales |
118 |
0.00228554 |
former |
31 |
0.00233363 |
says |
130 |
0.00232733 |
world |
32 |
0.00223292 |
17 |
years |
115 |
0.00222743 |
england |
31 |
0.00233363 |
three |
113 |
0.00202299 |
could |
31 |
0.00216314 |
18 |
former |
112 |
0.00216932 |
wales |
29 |
0.00218308 |
england |
111 |
0.00198718 |
city |
28 |
0.00195381 |
19 |
government |
110 |
0.00213059 |
league |
29 |
0.00218308 |
old |
105 |
0.00187977 |
found |
27 |
0.00188403 |