Project SADiLaR

Client Project Focus:

SADiLaR

 

About SADiLaR


The South African Centre for Digital Language Resources (SADiLaR) is a national centre supported by the Department of Science and Innovation (DSI) as part of the new South African Research Infrastructure Roadmap (SARIR). SADiLaR has an enabling function, focusing on all official languages of South Africa, supporting research and development in the domains of language technologies and language-related studies in the humanities and social sciences.

The Centre supports the creation, management and distribution of digital language resources and applicable software, which is freely available for research purposes through the Language Resource Catalogue.


 

CURRENT

Linguistic Corpus Enrichment for South African Languages

 

PURPOSE

With this project, we aim to update the morphology protocols for all five disjunctive languages, with full descriptions of the annotation procedure and tag sets or standards used on each level. These languages include Sesotho, Sesotho sa Leboa, Setswana, Xitsonga and Tshivenda.

 

OUTCOMES

  • 45,000 tokens in four conjunctive languages converted to current tag sets for morphology
  • 75,000 tokens in five disjunctive languages converted to the current tag set for morphology.
  • 25,000 tokens in two languages annotated for part of speech for five different text types.

 

DOWNLOAD THE RESOURCES

Coming soon


CURRENT

Dictionary App

 

PURPOSE

Collaborating with SADiLaR, we developed South Africa’s first dictionary application and web portal, which functions in all 11 of South Africa’s official languages. This is a hybrid mobile application developed with NativeScript, an open-source framework for building truly native mobile applications, and Angular, a platform used for building mobile and desktop web applications. With this, this application allows us to create and maintain one code base for multiple mobile operating systems. This app will thus be available on Android and iOS operating systems. There is an offline version where the dictionary content will be stored locally. This app allows users to update their offline database from the online version.

 

OUTCOMES

A generic Android and iOS app with customized settings will be available to download.

The dictionary manager is written in Angular and connects to the Translation Management System (TMS), amalgamated databases, and the dictionary API to facilitate the management of all the dictionaries accessible via the API. A generic Android and iOS app with customized settings.

 

DOWNLOAD THE RESOURCES

Coming soon


CURRENT

Autshumato 6

 

PURPOSE

This project aims to maintain and update the Autshumato Machine Translation (MT) systems and comprises three subprojects. Subproject 1 will focus on the maintenance and updating of the underlying software systems (Integrated Translation Environment (ITE), Terminology Management System (TMS), and MT Web service (MTWS)). In contrast, in subproject 2, the main goal is to maintain the actual MT systems using updated parallel corpora. The six language pairs covered are English-Afrikaans, English-isiZulu, English-Sepedi, English-Sesotho, English-Setswana and English-Xitsonga. The development of reverse MT systems is addressed in subproject three. It will provide eight systems for automatic translation from Afrikaans, isiNdebele, isiZulu, Sepedi, Sesotho, Setswana, Tshivenḓa and Xitsonga into English.

 

OUTCOMES

  • Updated versions of ITE, TMS and MTWS;
  • Six updated MT systems for translation from English into Afrikaans, isiZulu, Sepedi, Sesotho, Setswana, and Xitsonga
  • Eight MT systems for automatic translation from Afrikaans, isiNdebele, isiZulu, Sepedi, Sesotho, Setswana, Tshivenḓa and Xitsonga into English
  • Updated parallel and monolingual corpora for relevant languages.

 

 

DOWNLOAD THE RESOURCES

Coming soon


CURRENT

Python and Neural NLP Resources for South African Languages

 

PURPOSE

There are two main goals for this project. Firstly, making existing NCHLT core technologies available as open-source Python libraries will allow developers to access individual technologies for individual languages in the Python programming language. Secondly, the project aims to build the resources needed to create deep neural implementations of various natural language processing (NLP) technologies and release neural sequence labelling technologies for ten South African languages across three different Python NLP packages.

 

OUTCOMES / RESULTS

  • Open-source Python package for existing NCHLT core technologies for ten South African languages (tokenisers, sentence separators, part-of-speech taggers, named entity recognisers, phrase chunkers, optical character recognisers, and a language identifier);
  • Open-source Python implementation of previously released SADiLaR morphological taggers, part of speech taggers, and lemmatisers for four conjunctively written South African languages, plus integration into existing tools, CTexTools and NCHLT HLT WebAPI, as well as the newly developed Python package;
  • Neural word embedding models for ten South African languages in six different embedding architectures;
  • New deep neural sequence labelling core technologies, viz. part of speech taggers, named entity recognisers, and phrase chunkers, for ten South African languages, implemented in three different open-source Python NLP packages.

 

DOWNLOAD THE RESOURCES

Coming soon

 


DELIVERED

Spelling Checkers for SA Languages

 

PURPOSES

SADiLaR funded the repackaging and updating of the CTexT Spelling Checkers for South African Languages, making this installer is  available to any person or business for download from the SADiLaR website, free of charge.

 

The package includes all 10 indigenous South African languages’ spelling checkers and hyphenators as one installer, with Afrikaans also now included.

 

OUTCOMES

All spelling checkers achieved a minimum of 90% lexical recall and 95% error recall when evaluated on a 20 000-word test text.

 

DOWNLOADABLE CONTENT

Spelling Checkers for South African Languages

 


DELIVERED

July 2019-March 2022

Parallel corpora for English-Siswati

 

PURPOSE

This project entailed collecting and processing bilingual data to develop a 2 million words English–Siswati parallel corpus. A monolingual corpus for Siswati was also created and packaged with the parallel corpus as an additional value-added deliverable.

 

OUTCOMES

  • 2 million words parallel corpus English-Siswati
  • 1,5 million words monolingual corpus Siswati

 

DOWNLOAD THE RESOURCES

English Siswati Corpora

 


DELIVERED

July 2019-December 2019

Parallel corpora for English-isiXhosa

 

PURPOSE

In this project a 1,85 million word parallel corpus for English-isiXhosa was developed, In addition, a monolingual isiXhosa corpus has also been made available.

 

OUTCOMES / RESULTS

  • 1,85 million word parallel corpus for English-isiXhosa
  • 2,5 million word monolingual corpus isiXhosa

 

DOWNLOAD THE RESOURCES

Download resources