Client Project Focus:
SADiLaR
About SADiLaR
The South African Centre for Digital Language Resources (SADiLaR) is a national centre supported by the Department of Science and Innovation (DSI) as part of the new South African Research Infrastructure Roadmap (SARIR). SADiLaR has an enabling function, focusing on all official languages of South Africa, supporting research and development in the domains of language technologies and language-related studies in the humanities and social sciences.
The Centre supports the creation, management and distribution of digital language resources and applicable software, which is freely available for research purposes through the Language Resource Catalogue.
CURRENT
Linguistic Corpus Enrichment for South African Languages
PURPOSE
With this project, we aim to update the morphology protocols for all five disjunctive languages, with full descriptions of the annotation procedure and tag sets or standards used on each level. These languages include Sesotho, Sesotho sa Leboa, Setswana, Xitsonga and Tshivenda.
OUTCOMES
- 45,000 tokens in four conjunctive languages converted to current tag sets for morphology
- 75,000 tokens in five disjunctive languages converted to the current tag set for morphology.
- 25,000 tokens in two languages annotated for part of speech for five different text types.
DOWNLOAD THE RESOURCES
Coming soon
CURRENT
Dictionary App
PURPOSE
Collaborating with SADiLaR, we developed South Africa’s first dictionary application and web portal, which functions in all 11 of South Africa’s official languages. This is a hybrid mobile application developed with NativeScript, an open-source framework for building truly native mobile applications, and Angular, a platform used for building mobile and desktop web applications. With this, this application allows us to create and maintain one code base for multiple mobile operating systems. This app will thus be available on Android and iOS operating systems. There is an offline version where the dictionary content will be stored locally. This app allows users to update their offline database from the online version.
OUTCOMES
A generic Android and iOS app with customized settings will be available to download.
The dictionary manager is written in Angular and connects to the Translation Management System (TMS), amalgamated databases, and the dictionary API to facilitate the management of all the dictionaries accessible via the API. A generic Android and iOS app with customized settings.
DOWNLOAD THE RESOURCES
Coming soon
CURRENT
Autshumato 6
PURPOSE
This project aims to maintain and update the Autshumato Machine Translation (MT) systems and comprises three subprojects. Subproject 1 will focus on the maintenance and updating of the underlying software systems (Integrated Translation Environment (ITE), Terminology Management System (TMS), and MT Web service (MTWS)). In contrast, in subproject 2, the main goal is to maintain the actual MT systems using updated parallel corpora. The six language pairs covered are English-Afrikaans, English-isiZulu, English-Sepedi, English-Sesotho, English-Setswana and English-Xitsonga. The development of reverse MT systems is addressed in subproject three. It will provide eight systems for automatic translation from Afrikaans, isiNdebele, isiZulu, Sepedi, Sesotho, Setswana, Tshivenḓa and Xitsonga into English.
OUTCOMES
- Updated versions of ITE, TMS and MTWS;
- Six updated MT systems for translation from English into Afrikaans, isiZulu, Sepedi, Sesotho, Setswana, and Xitsonga
- Eight MT systems for automatic translation from Afrikaans, isiNdebele, isiZulu, Sepedi, Sesotho, Setswana, Tshivenḓa and Xitsonga into English
- Updated parallel and monolingual corpora for relevant languages.
DOWNLOAD THE RESOURCES
Coming soon
CURRENT
Python and Neural NLP Resources for South African Languages
PURPOSE
There are two main goals for this project. Firstly, making existing NCHLT core technologies available as open-source Python libraries will allow developers to access individual technologies for individual languages in the Python programming language. Secondly, the project aims to build the resources needed to create deep neural implementations of various natural language processing (NLP) technologies and release neural sequence labelling technologies for ten South African languages across three different Python NLP packages.
OUTCOMES / RESULTS
- Open-source Python package for existing NCHLT core technologies for ten South African languages (tokenisers, sentence separators, part-of-speech taggers, named entity recognisers, phrase chunkers, optical character recognisers, and a language identifier);
- Open-source Python implementation of previously released SADiLaR morphological taggers, part of speech taggers, and lemmatisers for four conjunctively written South African languages, plus integration into existing tools, CTexTools and NCHLT HLT WebAPI, as well as the newly developed Python package;
- Neural word embedding models for ten South African languages in six different embedding architectures;
- New deep neural sequence labelling core technologies, viz. part of speech taggers, named entity recognisers, and phrase chunkers, for ten South African languages, implemented in three different open-source Python NLP packages.
DOWNLOAD THE RESOURCES
Coming soon
DELIVERED
Spelling Checkers for SA Languages
PURPOSES
SADiLaR funded the repackaging and updating of the CTexT Spelling Checkers for South African Languages, making this installer is available to any person or business for download from the SADiLaR website, free of charge.
The package includes all 10 indigenous South African languages’ spelling checkers and hyphenators as one installer, with Afrikaans also now included.
OUTCOMES
All spelling checkers achieved a minimum of 90% lexical recall and 95% error recall when evaluated on a 20 000-word test text.
DOWNLOADABLE CONTENT
Spelling Checkers for South African Languages
DELIVERED
July 2019-March 2022
Parallel corpora for English-Siswati
PURPOSE
This project entailed collecting and processing bilingual data to develop a 2 million words English–Siswati parallel corpus. A monolingual corpus for Siswati was also created and packaged with the parallel corpus as an additional value-added deliverable.
OUTCOMES
- 2 million words parallel corpus English-Siswati
- 1,5 million words monolingual corpus Siswati
DOWNLOAD THE RESOURCES
DELIVERED
July 2019-December 2019
Parallel corpora for English-isiXhosa
PURPOSE
In this project a 1,85 million word parallel corpus for English-isiXhosa was developed, In addition, a monolingual isiXhosa corpus has also been made available.
OUTCOMES / RESULTS
- 1,85 million word parallel corpus for English-isiXhosa
- 2,5 million word monolingual corpus isiXhosa
DOWNLOAD THE RESOURCES