Qwanqwa (`qq`): Language Metadata

This is a language metadata toolkit to make it easier to work with a large variety of metadata from a single interface. Currently this metadata includes:

Official identifiers: iso-639-3, bcp-47, Glottocode, Wikidata, Wikipedia
Non-official identifiers: NLLB-style codes (zho_Hans, zho_Hant, etc.)¹
Geographic information
Speaker information
Writing systems
Names

Planned:

Phylogenetic information
Linking with CLLD datasets
Typological features: grambank, wals, uriel
A graph-based view to traverse between languoids (family trees, geo areas, shared features etc.)
Historical languages

Name: Qwanqwa is a phonetic spelling of 'ቋንቋ', which means language in Amharic; qq is nice and short to type.

Languoids

In qq, language-like entities as referred to as Languoids, since it includes dialects, macro-languages and language areas. Not all languoids have coverage for all features.

Number of languoids: 7511

Usage

pip install git+https://github.com/WPoelman/qwanqwa

Important: qq makes a strict distinction between None (don't know) and False (it is not the case). Make sure to keep this in mind when checking boolean values for truthiness, so if you're interested in missing values for example, avoid if not script.is_canonical:, but instead explicitly check if script.is_canonical is None:.

from qq import LanguageData, TagType

# Load from the pre-compiled database
ld = LanguageData.from_db()

# Access Languoid info using whatever official tag you have
nl1 = ld.get('nl', tag_type=TagType.BCP_47_CODE)
nl2 = ld.get('nld', tag_type=TagType.ISO_639_3_CODE)
# The `guess` method tries all known official tag types,
# be careful though since this can give unexpected resutls.
nl3 = ld.guess('dut')  # happens to be TagType.ISO_639_2_B

# In this case, these will give the same Languoid
assert nl1 == nl2 == nl3
> True

am = ld.get('am') # Default tag_type is BCP_47

# Language identifiers
am.iso_639_3_code
> 'amh'
am.glottocode
> 'amha1245'
am.wikidata_id
> 'Q28244',
am.wikipedia_id
> 'am'

# Also some non-standard identifiers, often used in NLP research
am.nllb_style_codes_iso_639_3
> ['amh_Ethi']
am.nllb_style_codes_bcp_47
> ['am_Ethi']

# Names in different languages
am.endonym
> 'አማርኛ'
am.name_data['fr'].name
> 'amharique'

# English description
am.language_description.description
> 'Semitic language of Ethiopia'

# Endangerment status
am.endangerment_status
> <Endangerment.SAFE: 'SAFE'>

# Scripts
am.canonical_scripts
> [
    Script(
        iso_15924_code='ethi',
        is_canonical=True,
        is_historical=None,
        is_religious=None,
        is_for_transliteration=None,
        is_for_accessibility=None,
        is_in_widespread_use=None,
        has_official_status=None,
        has_symbolic_value=None,
        source='GOOGLE_RESEARCH',
    )
]

# Mapping between codes
dir(ld.tag_conversion)
> [
    'bcp_47_code2glottocode',
    'bcp_47_code2iso_639_2b_code',
    'bcp_47_code2iso_639_3_code',
    'bcp_47_code2wikidata_id',
    'bcp_47_code2wikipedia_id',
    'glottocode2bcp_47_code',
    'glottocode2iso_639_2b_code',
    'glottocode2iso_639_3_code',
    'glottocode2wikidata_id',
    'glottocode2wikipedia_id',
    'iso_639_2b_code2bcp_47_code',
    'iso_639_2b_code2glottocode',
    'iso_639_2b_code2iso_639_3_code',
    'iso_639_2b_code2wikidata_id',
    'iso_639_2b_code2wikipedia_id',
    'iso_639_3_code2bcp_47_code',
    'iso_639_3_code2glottocode',
    'iso_639_3_code2iso_639_2b_code',
    ...
]

... and more, here are some full examples.

Sources

Please have a look at the sources, as this project would not have been possible without the hard work of a lot of people who created the data we use. Currently, all sources are available under Creative Commons BY or BY-SA licenses.

License

CC BY-SA 4.0

This is a combination of an iso-693-3 or bcp-47 language tag and iso-15924 script tag. This is similar to the first parts of an IETF Tag, which, confusingly, can also be referred to as a bcp-47 tag on its own. This is done in NLLB for instance. This is not wrong, but because data in qq is based on LinguaMeta, who interpret just the first part of a IETF tag to be a bcp-47 tag, we're sticking to LinguaMeta's interpretation of bcp-47 and refer to the combined tag as nllb_style. The iso-15924 part of the nllb_style tags are based on Glotscript, excluding Braille. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
scripts		scripts
src/qq		src/qq
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qwanqwa (`qq`): Language Metadata

Languoids

Usage

Sources

License

About

Uh oh!

Uh oh!

Languages

License

WPoelman/qwanqwa

Folders and files

Latest commit

History

Repository files navigation

Qwanqwa (qq): Language Metadata

Languoids

Usage

Sources

License

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

Qwanqwa (`qq`): Language Metadata