Skip to content

Commit eb34d44

Browse files
committed
add emnlp news and wiki preprint
1 parent 3e9c14f commit eb34d44

File tree

2 files changed

+32
-0
lines changed

2 files changed

+32
-0
lines changed

_bibliography/preprints.bib

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,21 @@ @misc{ploeger2024principled
1515
publisher = {arXiv},
1616
bibtex_show = true
1717
}
18+
19+
@misc{tatariya2024how,
20+
title = {How {{Good}} Is {{Your Wikipedia}}?},
21+
author = {Tatariya*, Kushal and Kulmizev*, Artur and Poelman, Wessel and Ploeger, Esther and Bollmann, Marcel and Bjerva, Johannes and Luo, Jiaming and Lent, Heather and de Lhoneux, Miryam},
22+
year = {2024},
23+
month = nov,
24+
number = {arXiv:2411.05527},
25+
eprint = {2411.05527},
26+
publisher = {arXiv},
27+
url = {http://arxiv.org/abs/2411.05527},
28+
urldate = {2024-11-13},
29+
abstract = {Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.},
30+
keywords = {Computer Science - Computation and Language},
31+
archiveprefix = {arxiv},
32+
abbr = {arXiv},
33+
primaryclass = {cs},
34+
bibtex_show = true
35+
}

_news/2024-11-14-emnlp.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
layout: page
3+
4+
title: EMNLP 2024
5+
date: November 14, 2024
6+
---
7+
8+
Our group presented three papers at EMNLP, two oral presentations at the main conference, and one at the Multilingual Representation Learning Workshop!
9+
10+
* Kushal Tatariya presented our work on Pixology, where we attempt to interpret what Pixel learns about language/vision. Paper: https://aclanthology.org/2024.emnlp-main.194/
11+
12+
* Wessel Poelman presented our work on typological diversity where we investigate what the community means when using this expression and if this use makes sense (spoiler alert: it doesn't). Paper: https://aclanthology.org/2024.emnlp-main.326/
13+
14+
* Zeno Vandenbulcke and Lukas Vermeire investigated zero-shot POS tagging, looking at the impact of language relatedness and treebank quality on the success of cross-lingual transfer, as well as asking if it even makes sense to investigate zero-shot POS tagging. (virtually presented by Miryam de Lhoneux) Paper: https://aclanthology.org/2024.mrl-1.9/

0 commit comments

Comments
 (0)