add emnlp news and wiki preprint

WPoelman · WPoelman · commit eb34d44c0caf · 2024-11-25T18:11:59.000+01:00
diff --git a/_bibliography/preprints.bib b/_bibliography/preprints.bib
@@ -15,3 +15,21 @@ @misc{ploeger2024principled
   publisher     = {arXiv},
   bibtex_show   = true
 }
+
+@misc{tatariya2024how,
+  title         = {How {{Good}} Is {{Your Wikipedia}}?},
+  author        = {Tatariya*, Kushal and Kulmizev*, Artur and Poelman, Wessel and Ploeger, Esther and Bollmann, Marcel and Bjerva, Johannes and Luo, Jiaming and Lent, Heather and de Lhoneux, Miryam},
+  year          = {2024},
+  month         = nov,
+  number        = {arXiv:2411.05527},
+  eprint        = {2411.05527},
+  publisher     = {arXiv},
+  url           = {http://arxiv.org/abs/2411.05527},
+  urldate       = {2024-11-13},
+  abstract      = {Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.},
+  keywords      = {Computer Science - Computation and Language},
+  archiveprefix = {arxiv},
+  abbr          = {arXiv},
+  primaryclass  = {cs},
+  bibtex_show   = true
+}
diff --git a/_news/2024-11-14-emnlp.md b/_news/2024-11-14-emnlp.md
@@ -0,0 +1,14 @@
+---
+layout: page
+
+title: EMNLP 2024
+date: November 14, 2024
+---
+
+Our group presented three papers at EMNLP, two oral presentations at the main conference, and one at the Multilingual Representation Learning Workshop!
+
+* Kushal Tatariya presented our work on Pixology, where we attempt to interpret what Pixel learns about language/vision. Paper: https://aclanthology.org/2024.emnlp-main.194/
+
+* Wessel Poelman presented our work on typological diversity where we investigate what the community means when using this expression and if this use makes sense (spoiler alert: it doesn't). Paper: https://aclanthology.org/2024.emnlp-main.326/
+
+* Zeno Vandenbulcke and Lukas Vermeire investigated zero-shot POS tagging, looking at the impact of language relatedness and treebank quality on the success of cross-lingual transfer, as well as asking if it even makes sense to investigate zero-shot POS tagging. (virtually presented by Miryam de Lhoneux) Paper: https://aclanthology.org/2024.mrl-1.9/