युनेस्कोको सहकार्यमा भाषा आयोगबाट भाषिक ओपन डाटाबेस तयार हुँदै ( नेपाल र युनेस्कोको कार्यपत्रसहित)

 

युनेस्कोको सहकार्यमा भाषा आयोगबाट भाषिक ओपन डाटाबेस तयार हुँदै ( नेपाल र युनेस्कोको कार्यपत्रसहित)

अन्तर्राष्ट्रिय आदिवासी भाषा दशक (२०२२०-२०३२) को सन्दर्भमा भाषा आयोग र युनेस्कोका बिच नेपालका भाषाहरूको दस्तावेजीकरण, विद्युतीय अभिलेखन तथा पुस्तान्तरण गरी भाषाहरूको संरक्षण र विकासको लागि सहकार्य हुने भएको छ। युनेस्को अफ्रिका महादेशमा विभिन्न लोपोन्मुख भाषाहरूको ओपन डाटाबेस तयार गरी आर्टिफिसियल इन्टेलिजेन्समार्फत उक्त भाषाहरूलाई प्रविधीकरण गरी पहुँच अभिवृद्धि गरेको छ । विगतमा पनि भाषा आयोग र युनेस्को नेपालको सहकार्यमा सुदुर पश्चिम प्रदेशको ९ भाषाहरूको अन्तरसम्बन्ध अध्ययन कार्य सम्पन्न गरिएको छ।

भाषा आयोगको अध्यक्ष डा.लवदेव अवस्थीको अध्यक्षतामा सम्पन्न अन्तर्क्रियामा कार्यक्रममा भाषा आयोगका सदस्य उषा हमाल, का.मु.सचिव लक्ष्मीप्रसाद भट्टराईसहित अन्य कर्मचारीहरूको सहभागिता रहेको थियो । युनेस्कोको तर्फबाट युनेस्को केन्द्रीय कार्यालयका डा.भानु न्यौपाने, युनेस्को कन्ट्री डाइरेक्टर डा.बलराम तिमल्सिना र युनेस्काका अन्य कर्मचारीहरू सहभागी रहेको थियो । उक्त कार्यक्रममा विज्ञका रूपमा भाषाविज्ञान केन्द्रीय विभाग कीर्तिपुरका डा.बलराम प्रसाइँ र काठमाडौँ विश्वविद्यालयका विज्ञान प्रविधि विभाग प्रमुख डा. बालकृष्ण बलको उपस्थिति रहेको थियो ।

भाषा आयोग मातृभाषा शिक्षा शाखा प्रमुख डा.लोकबहादुर लोप्चनले भाषा आयोगको तर्फबाट नेपालका भाषाहरूको कागजी र विद्युतीय अभिलेखीकरणका साथै भाषिक पुस्तान्तरण गरी लोपोन्मुख भाषाहरूको पुनर्जीवितीकरणसम्बन्धी कार्यपत्र प्रस्तुत गर्नुभएको थियो । त्यसपछि विज्ञ डा.बलराम प्रसाईले नेपालमा बोलिने भाषाहरूको प्रविधीकरण गरी एकल डाटाबेस तयार गर्नुपर्ने बताउनु भयो । जसमा साना र ठूला भाषाहरूलाई सँगसँगै मेसिन ट्रान्सलेसनमा लैजानुपर्ने सुझाउनु भयो । त्यसैगरी अर्का विज्ञ डा.बालकृष्ण बलले नेपालका भाषाहरूको अभिलेखीकरणका साथै त्यसलाई प्रयोगमा ल्याउन भाषासम्बन्धी डिजिटल एप बनाई प्रयोगमा जोड दिनुपर्ने बताउनुभयो ।

भाषा आयोगका अध्यक्ष, सदस्यसहित कर्मचारीबाट नेपालका भाषाहरूका संरक्षण र विकासका लागि विभिन्न विचारहरू प्रस्तुत गरिएको उक्त कार्यक्रममा युनेस्कोका केन्द्रीय प्रतिनिधि डा. भानु न्यौपाने सर्वप्रथम नेपालका भाषासम्बन्धी कार्यहरूको एकीकृत ओपन डाटाबेस तयार गर्नुपर्ने उल्लेख गर्नुभयो । साथै साना र ठूला भाषाहरूलाई आवश्यकताको आधारमा प्रविधीकरण गर्दै ओपन डाटा तयार गरी भाषाको प्रयोगमा पहुँच वृद्धि गर्न प्रतिबद्धता व्यक्त गर्नुभयो । उक्त कार्यक्रममा भाषा आयोग र युनेस्कोबाट प्रस्तुत कार्यपत्रहरू तल प्रस्तुत गरिएको छः

भाषा आयोगः डा.लोकबहादुर लोप्चन

Collaboration With Language Commission and UNESCO  Nepal for Revitalizing Five Endangered Languages of Nepal regarding to International Decade of Indingenous Language (2022-2032)

(Documentation, Digitalization and Transfer Intergeneration)

Introduction

Nepal is not only diverse in geographical landscape, it is diverse in ethnicity, religion, language, culture and others. According to population census of 2011, there are 125 ethnicities and more than 123 languages are available in Nepal. Language commission has recognized eight languages collaboration with Central Department of Linguistic,Tribhuvan University after conducting sociolinguistic studies in those languages.

Statistically, 37 language are endangered by the cause of less than one thousand speakers in those languages. Dura language has no single speaker from self Dura community, but one speaker is from another Community. He is Muktinath Ghimire, former teacher in school of Dura community. Another one Kusunda language is nearly to extinct. Kamala Sen Khatri is only one speaker of this language. Likewise, Tilung language has only two compentent speakers, no more than 6 speakers in Lungkhim language and less than one dorzan speakers are in Baram language. All of those speakers are of first generation who speak a little those in languages but second generation can’t speak those languages. So, third generation have lost the chance to learn and speak those languages. Thus, those five languages are severely endangered. For revitalizing those languages, intensive care programs should be launched and ventilate those languages by documenting, digitalization and transfer to all community members via running community language classes and teach as  a mother language all of students from concerned language Communities.

Language Commission have been running the language classes in three languages among those five languagues . Otherwise, manual and digital documentating task embarked since three years ago in all of those five languages . But the endevour of language commission is not sufficient to revitalize those kinds of languages.

International Decade of Indigeneous Language is impetus for indigeneous community to revitalize, conserve, preserve, develop and use those langages by them for them with the support of UNESCO, Language commission and all concerned development organization. The crux of this program will be no left behind any child to learn and use those mother languages. Then, Those five endargered languages will be revitalized and embarking use in their daily life.

Some status of those five languages herewith:

S.N. province District Local level Langauge Populatin Data of speaker (2011) Speaker ( vitality:fied survey) Remarks
1. Province-1 Khotang Halesi  Tuwaching Municipalty Tilung 1,424 1403 2  

Language class: Lc: 2 times

Illam Suryodaya Municiaplity Lungkhim 129 121 6 Language class: proposed, 2078
2. Gandaki Gorkha Barpak sulikot Rural municipality Baram 8,140 63 12 Language class: Lc: 2 times
Lamjung Sundarbajar Municipality Dura 5,394 1 0 Language class: proposed,2078
3. Lumbini Dang Ghorahi sub metropolitan city Kusunda 273 5 1 Language class: Lc: 2 times

 

Objectives of the Program

The objective of this programs are as follows:

  1. Conduct baseline survey, Community language mapping and setting the priority for program for those languages.
  2. Documentating manually and archieving digitally in those languages.
  3. Transfering the language to all members of the concerned community by running community language classes and teaching as a subject in schools from class 1 to class 8.
  4. Setting interlanguage learning climate for expansion of using those languages.
  5. Enhancing the capacity the member of community related organization and conduct training for language facilitator and resourse person of those languages. And intensive teacher traning or couselling program should be organize for teachers who will be joined for teaching mother language as subject in schools.
  6. Monitoring, sharing, feedbacking and Evaluation: trimester, semester and annual.

 Major programs:

Given strategies and major program have been set for achieving aforementioned objectives:

S.N. Program Title Programs in detail
1. Baselines survey, Cummunity language Mapping, Program Priortiy Setting a. Baseline survey: field observation, Consultation or interation program and field data collection

b.Community Language Mapping

c. Program priority Setting

d.Study Report preparing.

2. Manual and Digital Documentation a. Manual documentation: Arthrography, word collection, dictionary making, grammar writing, Languate history, corpus development, language class textmaterial preparation, language subjeck book 1-8 class, calendar,

b.       Digital Documentaton: Arhrography,Multiguigual dictionary, Digital corpora making(story, songs, essays, conversations, vegnettes), 30 lessons teachinglearning activities audiovideo, Video clips of  speakers (micro film of cultural program), archive of program in native language: cultural, formal assembly, celebrations, rituals)

 

 

 

3. Language Tranfering to intergeneration a. Sensetize or make enforce to transfer native language to child every member.

b.Community Language Learning class

c. Teach as a subject class 1-8 in school: every child of concerned community

4. Inter community inter language Learning climate setting a. Multicultural contest or program in community, local level and schools

b.Motivate to learn endangered language in school non native language teacher and childs like Muktinath Ghimire.

5.         Capacity enhancement: a. Energize community based organization and make them responsible for doing cooperation, coordination and playing leading role

b.Community Consultation:

c. Consultation or workshop for school teachers.

 

6. Monitoring and Evaluation a.Fild observation and technical backstepping in trimester

b. Monitor and assessing of achievement: semester

c.Report and Program review: Annually

 

Expected achievements of the program:

The following are the expected achievement of this program:

  1. Basic survey report, community language mapping and program priority will be completed.
  2. Documentation of those languages manually and digitally will be performed.
  3. Every langage will be transferred to every family member and they communicate on their mother tongue in their family, the communicate in their language in formal and ritual or religious functions.
  4. No left behind all children from concerned community to learn their language as a subject at school from class 1-8.
  5. The interlanguage learning climate will be set up so that teacher and child of other community will get chance to learn and use those languages .
  6. All of those languages will be revitalized and use those languages in their daily life every member of those community.

Concept Paper of UNESCO Nepal : Dr.Bhanu Neupane

Data to Innovation: Developing Datasets and Strengthening Capacities Innovation for Low Resource African Languages

Geographical scope/benefitting country(ies): Africa
Duration (in months): 11 months
Name, Unit and contact details  of Project Officer(s) : Communication and Information Sector

Mr. Bhanu Neupane, Universal Access to Innovation Section (b.neupane@unesco.org)

Mr. Prateek Sibal, Digital Innovation and Transformation Section (p.sibal@unesco.org)

Partner(s) institutions: 1.   AI for Development Network – Africa

2.   Data Science for Social Impact – University of Pretoria Research Group

3.   Masakhane – Machine Translation for African Languages

4.   Deep Learning Indaba

5.   UNESCO Chair in Data Science and Analytics, University of Essex, United Kingdom

6.   UNESCO Chair in Artificial Intelligence, University College London, UK

7.   UNESCO Category 2 Centre – International Research Centre on Artificial Intelligence (IRCAI), Slovenia

 

Tentative budget inclusive of   Programme Support costs: ??

 

Rationale and overall purpose

Background

Humans communicate, think, pass along information and knowledge through language. Thus, the ability to deal with human language is an essential attribute in all information and communication technologies. Although there are more than 6000 languages spoken on our planet today, only a dozen or two are flourishing in the digital world with advanced language understanding and spoken language communication technologies. Limitations in the multilingual skills of computers and mobile devices widen the digital gap which excludes hundreds of millions of people from accessing the full benefits of the Internet and digital technologies.

Reference on our work on knowledge societies and international decade of indigenous languages

Artificial Intelligence (AI) has the potential to strengthen access to information and knowledge to people when the information is not available in their own language. Machine Learning enabled techniques in Natural Language Processing (NLP) have are enabling applications across language translation systems, speech interfaces, dialogue systems, educational applications, emergency response applications and monitoring democratic processes among others. For instance, Figure 1 and 2 show how automated language translation in emergency situations can help government authorities and communities communicate in emergency situations to ensure rapid response.

We need to add something here as the jump between the two paras are abrupt and give this as an example

In the context of outbreak of diseases, text analysis methods can be used to pre-warn health authorities of the outbreak. Figure 3 shows an example of how social media posts can be analyzed for outbreak of flu. Such language technology capabilities in multiple languages would be instrumental in building capacities of governments in monitoring outbreaks like COVID-19 as more and more people participate in the digital public sphere. However, as shown in Figure 4, the information can be lost in the absence of capabilities for analysis of low resource languages

Even as state-of-the-art language technologies are available and applied for several languages, there application for low resource languages faces several barriers. Low resource languages are languages that lack large monolingual or parallel corpora and/or manually crafted linguistic resources sufficient for building statistical NLP applications.

Needs Assessment

UNESCO publication “Steering AI and Advanced ICTs for Knowledge Societies” identified “strengthening cooperation between civil society and research institutes for solving problems facing local communities, for novel data collection models based on citizen science that can create data sets for AI that respect international norms for privacy and data protection” in Africa as an option for action to address the gaps in the availability of data for development and use of AI. This project focusses on development of datasets and capacities creating of datasets for low resource languages in Africa.

Out of all 7111 (30.15%) living languages today 2144 are African languages (Eberhard et al., 2019). But only a small portion of linguistic resources for NLP research are built for African languages. This is further demonstrated by the low number of NLP publications coming out of Africa at some of the most important Machine Learning conferences in the world. For instance, in all ACL conferences in 2019, only 5 out of 2695 (0.19%) author affiliations were based in Africa (Caines, 2019).

Some of the challenges for the development of NLP for African languages identified by researchers in Africa include:

  • Low availability of resources (input data) for African languages that hinders the ability for researchers to do machine translation. [1]
  • Discoverability: The resources for African languages that do exist are hard to find. Often these resources are not available under open access licenses thus reducing the ability of research institutions to work together and share knowledge on language datasets to strengthen innovation.
  • Reproducibility: The data and code of existing research are rarely shared, which means researchers cannot reproduce the results properly.
  • Lack of benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmarks or leader boards to new compare machine translation techniques to.

Furthermore, African languages are of high linguistic complexity and variety, with diverse morphologies and phonologies, including lexical and grammatical tonal patterns, and many are practiced within multilingual societies with frequent code switching (Ndubuisi-Obi et al., 2019; Bird, 1999; Gibbon et al., 2006). Because of this complexity, cross-lingual generalization from success in languages like English are not guaranteed.

Project Scope

The project aims at addressing some of the challenges identified above through:

  1. Development of datasets in five African languages from five different countries that can be used for strengthening access to information and spur innovation based on NLP technologies
  2. Enhancement of capacities among young researchers for the development of open languages datasets and language tech applications through development of guidelines and training through open educational resources
  3. Development of a multi stakeholder network for strengthening research on language technology based on AI techniques for African languages

The project would support long term development and preservation of languages in Africa and strengthen access to information and innovation based on African languages. Model cases developed through the project would be included in UNESCO’s AI Decision Maker’s Essential to encourage policy support for further development of the project.

The project relates to UNESCO’s work on SDG 16, SDG 17 and has direct consequences for Disaster Risk Reduction, especially in the context of COVID-19 pandemic.

Why UNESCO?

UNESCO facilitates six action lines of the World Summit on Information Societies, this project directly contributes to four WSIS Action lines facilitated by UNESCO:

  1. Access to information and knowledge (C3) through development of language datasets in five African languages that will be used for different language tech enabled applications.
  2. E-Science (C7) through supporting researchers in developing countries in developing open access resources and through the mainstreaming of open science approaches in the development of language datasets and guidelines for the same.
  3. Cultural diversity and identity, linguistic diversity and local content (C8) through supporting capacities for development of digital applications in multiple African languages.
  4. Ethical dimensions of the Information Society (C10) through development of guidelines for open language datasets sourced online (news publications, social media and content platforms) that may contain contains biased sentiments (sexist, racist) and offensive material(hateful) and through guidelines for personal data protection for language data sourced from social media platforms.
  5. UNESCO coordinates global action on Decade on Indigenous languages {there has to be a channel on digitalization of languages as one of the outcomes in the resolution or in the UN DR}

The project is part of 40 C/5 Major Programme V on Communication and Information Main Line of Action concerning Knowledge Societies with a focus on Open and Inclusive Solutions, Digital Technologies and Multilingualism.

Innovation, Sustainability and Replication

  1. The project would be used as a model case to inform evidence-based policy making concerning Artificial Intelligence and would be included in UNESCO’s AI Decision maker’s Essential to inform policy makers.
  2. The project would be showcased at the Internet Governance Forum as a workshop to facilitate knowledge exchange and network development through North-South and South-South cooperation.
  3. It would be included under UNESCO platform on Atlas for Languages.
  4. The project can be replicated in Asia-pacific and Latin America to address similar challenges.

Links with 2030 Agenda

Goals 4, 10, 11, 16: the workshop will show how targeted policy measures and practical activities allow to equip all language communities with digital tools enabling access to information and full participation in the Knowledge Society. Having a glimpse beyond 2025, we should finally put an end to the language-based confusion, exclusion, and discrimination.

Goal 5: digital language technologies enable women and girls, particularly empowering the ones most distant from socio-economic melting pots – those residing in scarcely inhabited rural areas and often lacking access to advanced foreign language training.

Goal 8: numerous studies provide that many SMEs suffer from impeded digital market access because customers are less likely to buy online goods or services offered in other than the languages of their fluency.

Goal 9: the workshop will present how research community in cooperation with private sector is working on novel technologies that expand the range of technologically fit languages.

UNESCO’s Forum on AI in Africa

In 2018, the Outcome Statement of UNESCO’s Forum on Artificial Intelligence in Africa recognized “the expeditious growth of Africa’s population, as well as the opportunities and challenges this poses in terms of education, training and the employability of African youth” and the potential that AI offers for “sustainable and inclusive development on the continent.” Participants expressed concern regarding “enduring inequalities and significant disparities in the availability of the resources, capacities and infrastructures required for giving access to, and fully benefiting from, the results of scientific innovation” This project contributes directly to achieve these objectives with respect to capacity building and access to open data for development and use of AI in Africa.

Something on  ADG 2063?

A few lines on UNSDF?

Summary of outcomes, outputs and activities

Outcome N°1 Strengthened availability of datasets in low resource African languages for AI enabled innovation and access to information through downstream AI applications
  Output N°1: Development and enrichment of datasets in five African languages from five different countries
Activity 1: Development and publication of open access datasets as per best practices and standards used in the Natural Language Processing community.
Activity 2: Development of model use cases based on downstream applications (Machine Translation, Classification, Sentiment Analysis, Speech to Text etc.) of the language datasets in the strengthening emergency response (COVID-19) and access to information.
Outcome N°2: Strengthened capacities for development of open data sets through guidelines on dataset development and trainings through open educational resources
  Output N°2: Development of Guidelines for low resource language dataset development
Activity 1: Develop guidelines for identifying and ascertaining whether data obtained from online sources (news publications, social media and content platforms) contains biased sentiments (sexist, racist) and offensive material(hateful).
Activity 2: Develop guidelines outlining techniques for protecting the identities and privacy of users, in instances where data is obtained from social media/content platforms like Twitter, Facebook and YouTube.
Activity 3: Develop guidelines for best and legally aligned practices for obtaining textual, visual and audio data from a variety of sources.
Activity 4: Development of an online open educational resource on language data set development.

Implementation Strategy

The project would be implemented over a time period of 11 months that would involve selection of datasets that can be developed further based on the results of the AI4D-African Language Dataset Challenge.

The guidelines and recommendations would be developed through consultations with a network of AI and Machine learning researchers across 17 countries in Africa and would be presented at International Machine Learning conferences focussing on low resources Languages. The datasets and guidelines would be published on community owned open access platforms and would be used by UNESCO and its partners for upstream policy advocacy and by the machine learning community for development of downstream applications.

The project would be guided by an expert review committee drawn from partners at the University of Pretoria, UNESCO Chairs in Big Data and Analytics, UNESCO’s International Research Centre on AI in Slovenia, the private sector and partners from government agencies.

Gender balance would be an important consideration in the composition of the teams working on dataset development, guidelines development, model case formulation and the review committee.

The project will be implemented in a partnership mode and UNESCO field network will be involved to the extent possible,  including as an input to country-specific One-UN mechanisms

Stakeholders, beneficiaries and partners

Direct beneficiaries of the project include AI and Machine Learning researchers, universities, students in data science and speakers of low resource languages through strengthened access to information in their languages

Key partners

  1. AI for Development Network – Africa
  2. Data Science for Social Impact – University of Pretoria Research Group
  3. Masakhane – Machine Translation for African Languages
  4. Deep Learning Indaba
  5. UNESCO Chair in Data Science and Analytics, University of Essex, United Kingdom
  6. UNESCO Chair in Artificial Intelligence, University College London, UK
  7. UNESCO Category 2 Centre – International Research Centre on Artificial Intelligence (IRCAI), Slovenia

Key partners would be involved in project implementation, network development and as part of review committee for the project.

[1] Among leading architectures for pre-training models for transfer learning in NLP, pre-trained models, particularly for African languages, are barely represented mainly due to a lack of data. While these architectures are freely available for use, most are data-hungry. The GPT-2 model, for instance, used millions, possibly billions of text to train. (ref)

This gap exists due to a lack of availability of data for African languages on the Internet. The languages selected for BERT pre-training “were chosen because they are the top languages with the largest Wikipedias”. (ref) Similarly, the 157 pre-trained language models made available by fastText were trained on Wikipedia and Common Crawl. (ref)

 

 

 

युनेस्कोको सहकार्यमा भाषा आयोगबाट भाषिक ओपन डाटाबेस तयार हुँदै

 

अन्तर्राष्ट्रिय आदिवासी भाषा दशक (२०२२०-२०३२) को सन्दर्भमा भाषा आयोग र युनेस्कोका बिच नेपालका भाषाहरूको दस्तावेजीकरण, विद्युतीय अभिलेखन तथा पुस्तान्तरण गरी भाषाहरूको संरक्षण र विकासको लागि सहकार्य हुने भएको छ। युनेस्को अफ्रिका महादेशमा विभिन्न लोपोन्मुख भाषाहरूको ओपन डाटाबेस तयार गरी आर्टिफिसियल इन्टेलिजेन्समार्फत उक्त भाषाहरूलाई प्रविधीकरण गरी पहुँच अभिवृद्धि गरेको छ । विगतमा पनि भाषा आयोग र युनेस्को नेपालको सहकार्यमा सुदुर पश्चिम प्रदेशको ९ भाषाहरूको अन्तरसम्बन्ध अध्ययन कार्य सम्पन्न गरिएको छ।

भाषा आयोगको अध्यक्ष डा.लवदेव अवस्थीको अध्यक्षतामा सम्पन्न अन्तर्क्रियामा कार्यक्रममा भाषा आयोगका सदस्य उषा हमाल, का.मु.सचिव लक्ष्मीप्रसाद भट्टराईसहित अन्य कर्मचारीहरूको सहभागिता रहेको थियो । युनेस्कोको तर्फबाट युनेस्को केन्द्रीय कार्यालयका डा.भानु न्यौपाने, युनेस्को कन्ट्री डाइरेक्टर डा.बलराम तिमल्सिना र युनेस्काका अन्य कर्मचारीहरू सहभागी रहेको थियो । उक्त कार्यक्रममा विज्ञका रूपमा भाषाविज्ञान केन्द्रीय विभाग कीर्तिपुरका डा.बलराम प्रसाइँ र काठमाडौँ विश्वविद्यालयका विज्ञान प्रविधि विभाग प्रमुख डा. बालकृष्ण बलको उपस्थिति रहेको थियो ।

भाषा आयोग मातृभाषा शिक्षा शाखा प्रमुख डा.लोकबहादुर लोप्चनले भाषा आयोगको तर्फबाट नेपालका भाषाहरूको कागजी र विद्युतीय अभिलेखीकरणका साथै भाषिक पुस्तान्तरण गरी लोपोन्मुख भाषाहरूको पुनर्जीवितीकरणसम्बन्धी कार्यपत्र प्रस्तुत गर्नुभएको थियो । त्यसपछि विज्ञ डा.बलराम प्रसाईले नेपालमा बोलिने भाषाहरूको प्रविधीकरण गरी एकल डाटाबेस तयार गर्नुपर्ने बताउनु भयो । जसमा साना र ठूला भाषाहरूलाई सँगसँगै मेसिन ट्रान्सलेसनमा लैजानुपर्ने सुझाउनु भयो । त्यसैगरी अर्का विज्ञ डा.बालकृष्ण बलले नेपालका भाषाहरूको अभिलेखीकरणका साथै त्यसलाई प्रयोगमा ल्याउन भाषासम्बन्धी डिजिटल एप बनाई प्रयोगमा जोड दिनुपर्ने बताउनुभयो ।

भाषा आयोगका अध्यक्ष, सदस्यसहित कर्मचारीबाट नेपालका भाषाहरूका संरक्षण र विकासका लागि विभिन्न विचारहरू प्रस्तुत गरिएको उक्त कार्यक्रममा युनेस्कोका केन्द्रीय प्रतिनिधि डा. भीम न्यौपाने सर्वप्रथम नेपालका भाषासम्बन्धी कार्यहरूको एकीकृत ओपन डाटाबेस तयार गर्नुपर्ने उल्लेख गर्नुभयो । साथै साना र ठूला भाषाहरूलाई आवश्यकताको आधारमा प्रविधीकरण गर्दै ओपन डाटा तयार गरी भाषाको प्रयोगमा पहुँच वृद्धि गर्न प्रतिबद्धता व्यक्त गर्नुभयो । उक्त कार्यक्रममा भाषा आयोग र युनेस्कोबाट प्रस्तुत कार्यपत्रहरू तल प्रस्तुत गरिएको छः

भाषा आयोगः डा.लोकबहादुर लोप्चन

Collaboration With Language Commission and UNESCO  Nepal for Revitalizing Five Endangered Languages of Nepal regarding to International Decade of Indingenous Language (2022-2032)

(Documentation, Digitalization and Transfer Intergeneration)

Introduction

Nepal is not only diverse in geographical landscape, it is diverse in ethnicity, religion, language, culture and others. According to population census of 2011, there are 125 ethnicities and more than 123 languages are available in Nepal. Language commission has recognized eight languages collaboration with Central Department of Linguistic,Tribhuvan University after conducting sociolinguistic studies in those languages.

Statistically, 37 language are endangered by the cause of less than one thousand speakers in those languages. Dura language has no single speaker from self Dura community, but one speaker is from another Community. He is Muktinath Ghimire, former teacher in school of Dura community. Another one Kusunda language is nearly to extinct. Kamala Sen Khatri is only one speaker of this language. Likewise, Tilung language has only two compentent speakers, no more than 6 speakers in Lungkhim language and less than one dorzan speakers are in Baram language. All of those speakers are of first generation who speak a little those in languages but second generation can’t speak those languages. So, third generation have lost the chance to learn and speak those languages. Thus, those five languages are severely endangered. For revitalizing those languages, intensive care programs should be launched and ventilate those languages by documenting, digitalization and transfer to all community members via running community language classes and teach as  a mother language all of students from concerned language Communities.

Language Commission have been running the language classes in three languages among those five languagues . Otherwise, manual and digital documentating task embarked since three years ago in all of those five languages . But the endevour of language commission is not sufficient to revitalize those kinds of languages.

International Decade of Indigeneous Language is impetus for indigeneous community to revitalize, conserve, preserve, develop and use those langages by them for them with the support of UNESCO, Language commission and all concerned development organization. The crux of this program will be no left behind any child to learn and use those mother languages. Then, Those five endargered languages will be revitalized and embarking use in their daily life.

Some status of those five languages herewith:

S.N. province District Local level Langauge Populatin Data of speaker (2011) Speaker ( vitality:fied survey) Remarks
1. Province-1 Khotang Halesi  Tuwaching Municipalty Tilung 1,424 1403 2  

Language class: Lc: 2 times

Illam Suryodaya Municiaplity Lungkhim 129 121 6 Language class: proposed, 2078
2. Gandaki Gorkha Barpak sulikot Rural municipality Baram 8,140 63 12 Language class: Lc: 2 times
Lamjung Sundarbajar Municipality Dura 5,394 1 0 Language class: proposed,2078
3. Lumbini Dang Ghorahi sub metropolitan city Kusunda 273 5 1 Language class: Lc: 2 times

 

Objectives of the Program

The objective of this programs are as follows:

  1. Conduct baseline survey, Community language mapping and setting the priority for program for those languages.
  2. Documentating manually and archieving digitally in those languages.
  3. Transfering the language to all members of the concerned community by running community language classes and teaching as a subject in schools from class 1 to class 8.
  4. Setting interlanguage learning climate for expansion of using those languages.
  5. Enhancing the capacity the member of community related organization and conduct training for language facilitator and resourse person of those languages. And intensive teacher traning or couselling program should be organize for teachers who will be joined for teaching mother language as subject in schools.
  6. Monitoring, sharing, feedbacking and Evaluation: trimester, semester and annual.

 Major programs:

Given strategies and major program have been set for achieving aforementioned objectives:

 

S.N. Program Title Programs in detail
1. Baselines survey, Cummunity language Mapping, Program Priortiy Setting a. Baseline survey: field observation, Consultation or interation program and field data collection

b.Community Language Mapping

c. Program priority Setting

d.Study Report preparing.

2. Manual and Digital Documentation a. Manual documentation: Arthrography, word collection, dictionary making, grammar writing, Languate history, corpus development, language class textmaterial preparation, language subjeck book 1-8 class, calendar,

b.       Digital Documentaton: Arhrography,Multiguigual dictionary, Digital corpora making(story, songs, essays, conversations, vegnettes), 30 lessons teachinglearning activities audiovideo, Video clips of  speakers (micro film of cultural program), archive of program in native language: cultural, formal assembly, celebrations, rituals)

 

 

 

3. Language Tranfering to intergeneration a. Sensetize or make enforce to transfer native language to child every member.

b.Community Language Learning class

c. Teach as a subject class 1-8 in school: every child of concerned community

4. Inter community inter language Learning climate setting a. Multicultural contest or program in community, local level and schools

b.Motivate to learn endangered language in school non native language teacher and childs like Muktinath Ghimire.

5.         Capacity enhancement: a. Energize community based organization and make them responsible for doing cooperation, coordination and playing leading role

b.Community Consultation:

c. Consultation or workshop for school teachers.

 

6. Monitoring and Evaluation a.Fild observation and technical backstepping in trimester

b. Monitor and assessing of achievement: semester

c.Report and Program review: Annually

 

Expected achievements of the program:

The following are the expected achievement of this program:

  1. Basic survey report, community language mapping and program priority will be completed.
  2. Documentation of those languages manually and digitally will be performed.
  3. Every langage will be transferred to every family member and they communicate on their mother tongue in their family, the communicate in their language in formal and ritual or religious functions.
  4. No left behind all children from concerned community to learn their language as a subject at school from class 1-8.
  5. The interlanguage learning climate will be set up so that teacher and child of other community will get chance to learn and use those languages .
  6. All of those languages will be revitalized and use those languages in their daily life every member of those community.

Concept Paper of UNESCO Nepal

Data to Innovation: Developing Datasets and Strengthening Capacities Innovation for Low Resource African Languages

Geographical scope/benefitting country(ies): Africa
Duration (in months): 11 months
Name, Unit and contact details  of Project Officer(s) : Communication and Information Sector

Mr. Bhanu Neupane, Universal Access to Innovation Section (b.neupane@unesco.org)

Mr. Prateek Sibal, Digital Innovation and Transformation Section (p.sibal@unesco.org)

Partner(s) institutions: 1.   AI for Development Network – Africa

2.   Data Science for Social Impact – University of Pretoria Research Group

3.   Masakhane – Machine Translation for African Languages

4.   Deep Learning Indaba

5.   UNESCO Chair in Data Science and Analytics, University of Essex, United Kingdom

6.   UNESCO Chair in Artificial Intelligence, University College London, UK

7.   UNESCO Category 2 Centre – International Research Centre on Artificial Intelligence (IRCAI), Slovenia

 

Tentative budget inclusive of   Programme Support costs: ??

 

Rationale and overall purpose

Background

Humans communicate, think, pass along information and knowledge through language. Thus, the ability to deal with human language is an essential attribute in all information and communication technologies. Although there are more than 6000 languages spoken on our planet today, only a dozen or two are flourishing in the digital world with advanced language understanding and spoken language communication technologies. Limitations in the multilingual skills of computers and mobile devices widen the digital gap which excludes hundreds of millions of people from accessing the full benefits of the Internet and digital technologies.

Reference on our work on knowledge societies and international decade of indigenous languages

Artificial Intelligence (AI) has the potential to strengthen access to information and knowledge to people when the information is not available in their own language. Machine Learning enabled techniques in Natural Language Processing (NLP) have are enabling applications across language translation systems, speech interfaces, dialogue systems, educational applications, emergency response applications and monitoring democratic processes among others. For instance, Figure 1 and 2 show how automated language translation in emergency situations can help government authorities and communities communicate in emergency situations to ensure rapid response.

We need to add something here as the jump between the two paras are abrupt and give this as an example

In the context of outbreak of diseases, text analysis methods can be used to pre-warn health authorities of the outbreak. Figure 3 shows an example of how social media posts can be analyzed for outbreak of flu. Such language technology capabilities in multiple languages would be instrumental in building capacities of governments in monitoring outbreaks like COVID-19 as more and more people participate in the digital public sphere. However, as shown in Figure 4, the information can be lost in the absence of capabilities for analysis of low resource languages.

Even as state-of-the-art language technologies are available and applied for several languages, there application for low resource languages faces several barriers. Low resource languages are languages that lack large monolingual or parallel corpora and/or manually crafted linguistic resources sufficient for building statistical NLP applications.

Needs Assessment

UNESCO publication “Steering AI and Advanced ICTs for Knowledge Societies” identified “strengthening cooperation between civil society and research institutes for solving problems facing local communities, for novel data collection models based on citizen science that can create data sets for AI that respect international norms for privacy and data protection” in Africa as an option for action to address the gaps in the availability of data for development and use of AI. This project focusses on development of datasets and capacities creating of datasets for low resource languages in Africa.

Out of all 7111 (30.15%) living languages today 2144 are African languages (Eberhard et al., 2019). But only a small portion of linguistic resources for NLP research are built for African languages. This is further demonstrated by the low number of NLP publications coming out of Africa at some of the most important Machine Learning conferences in the world. For instance, in all ACL conferences in 2019, only 5 out of 2695 (0.19%) author affiliations were based in Africa (Caines, 2019).

Some of the challenges for the development of NLP for African languages identified by researchers in Africa include:

  • Low availability of resources (input data) for African languages that hinders the ability for researchers to do machine translation. [1]
  • Discoverability: The resources for African languages that do exist are hard to find. Often these resources are not available under open access licenses thus reducing the ability of research institutions to work together and share knowledge on language datasets to strengthen innovation.
  • Reproducibility: The data and code of existing research are rarely shared, which means researchers cannot reproduce the results properly.
  • Lack of benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmarks or leader boards to new compare machine translation techniques to.

Furthermore, African languages are of high linguistic complexity and variety, with diverse morphologies and phonologies, including lexical and grammatical tonal patterns, and many are practiced within multilingual societies with frequent code switching (Ndubuisi-Obi et al., 2019; Bird, 1999; Gibbon et al., 2006). Because of this complexity, cross-lingual generalization from success in languages like English are not guaranteed.

Project Scope

The project aims at addressing some of the challenges identified above through:

  1. Development of datasets in five African languages from five different countries that can be used for strengthening access to information and spur innovation based on NLP technologies
  2. Enhancement of capacities among young researchers for the development of open languages datasets and language tech applications through development of guidelines and training through open educational resources
  3. Development of a multi stakeholder network for strengthening research on language technology based on AI techniques for African languages

The project would support long term development and preservation of languages in Africa and strengthen access to information and innovation based on African languages. Model cases developed through the project would be included in UNESCO’s AI Decision Maker’s Essential to encourage policy support for further development of the project.

The project relates to UNESCO’s work on SDG 16, SDG 17 and has direct consequences for Disaster Risk Reduction, especially in the context of COVID-19 pandemic.

Why UNESCO?

UNESCO facilitates six action lines of the World Summit on Information Societies, this project directly contributes to four WSIS Action lines facilitated by UNESCO:

  1. Access to information and knowledge (C3) through development of language datasets in five African languages that will be used for different language tech enabled applications.
  2. E-Science (C7) through supporting researchers in developing countries in developing open access resources and through the mainstreaming of open science approaches in the development of language datasets and guidelines for the same.
  3. Cultural diversity and identity, linguistic diversity and local content (C8) through supporting capacities for development of digital applications in multiple African languages.
  4. Ethical dimensions of the Information Society (C10) through development of guidelines for open language datasets sourced online (news publications, social media and content platforms) that may contain contains biased sentiments (sexist, racist) and offensive material(hateful) and through guidelines for personal data protection for language data sourced from social media platforms.
  5. UNESCO coordinates global action on Decade on Indigenous languages {there has to be a channel on digitalization of languages as one of the outcomes in the resolution or in the UN DR}

The project is part of 40 C/5 Major Programme V on Communication and Information Main Line of Action concerning Knowledge Societies with a focus on Open and Inclusive Solutions, Digital Technologies and Multilingualism.

Innovation, Sustainability and Replication

  1. The project would be used as a model case to inform evidence-based policy making concerning Artificial Intelligence and would be included in UNESCO’s AI Decision maker’s Essential to inform policy makers.
  2. The project would be showcased at the Internet Governance Forum as a workshop to facilitate knowledge exchange and network development through North-South and South-South cooperation.
  3. It would be included under UNESCO platform on Atlas for Languages.
  4. The project can be replicated in Asia-pacific and Latin America to address similar challenges.

Links with 2030 Agenda

Goals 4, 10, 11, 16: the workshop will show how targeted policy measures and practical activities allow to equip all language communities with digital tools enabling access to information and full participation in the Knowledge Society. Having a glimpse beyond 2025, we should finally put an end to the language-based confusion, exclusion, and discrimination.

Goal 5: digital language technologies enable women and girls, particularly empowering the ones most distant from socio-economic melting pots – those residing in scarcely inhabited rural areas and often lacking access to advanced foreign language training.

Goal 8: numerous studies provide that many SMEs suffer from impeded digital market access because customers are less likely to buy online goods or services offered in other than the languages of their fluency.

Goal 9: the workshop will present how research community in cooperation with private sector is working on novel technologies that expand the range of technologically fit languages.

UNESCO’s Forum on AI in Africa

In 2018, the Outcome Statement of UNESCO’s Forum on Artificial Intelligence in Africa recognized “the expeditious growth of Africa’s population, as well as the opportunities and challenges this poses in terms of education, training and the employability of African youth” and the potential that AI offers for “sustainable and inclusive development on the continent.” Participants expressed concern regarding “enduring inequalities and significant disparities in the availability of the resources, capacities and infrastructures required for giving access to, and fully benefiting from, the results of scientific innovation” This project contributes directly to achieve these objectives with respect to capacity building and access to open data for development and use of AI in Africa.

Something on  ADG 2063?

A few lines on UNSDF?

Summary of outcomes, outputs and activities

Outcome N°1 Strengthened availability of datasets in low resource African languages for AI enabled innovation and access to information through downstream AI applications
  Output N°1: Development and enrichment of datasets in five African languages from five different countries
Activity 1: Development and publication of open access datasets as per best practices and standards used in the Natural Language Processing community.
Activity 2: Development of model use cases based on downstream applications (Machine Translation, Classification, Sentiment Analysis, Speech to Text etc.) of the language datasets in the strengthening emergency response (COVID-19) and access to information.
Outcome N°2: Strengthened capacities for development of open data sets through guidelines on dataset development and trainings through open educational resources
  Output N°2: Development of Guidelines for low resource language dataset development
Activity 1: Develop guidelines for identifying and ascertaining whether data obtained from online sources (news publications, social media and content platforms) contains biased sentiments (sexist, racist) and offensive material(hateful).
Activity 2: Develop guidelines outlining techniques for protecting the identities and privacy of users, in instances where data is obtained from social media/content platforms like Twitter, Facebook and YouTube.
Activity 3: Develop guidelines for best and legally aligned practices for obtaining textual, visual and audio data from a variety of sources.
Activity 4: Development of an online open educational resource on language data set development.

 

Implementation Strategy

The project would be implemented over a time period of 11 months that would involve selection of datasets that can be developed further based on the results of the AI4D-African Language Dataset Challenge.

The guidelines and recommendations would be developed through consultations with a network of AI and Machine learning researchers across 17 countries in Africa and would be presented at International Machine Learning conferences focussing on low resources Languages. The datasets and guidelines would be published on community owned open access platforms and would be used by UNESCO and its partners for upstream policy advocacy and by the machine learning community for development of downstream applications.

The project would be guided by an expert review committee drawn from partners at the University of Pretoria, UNESCO Chairs in Big Data and Analytics, UNESCO’s International Research Centre on AI in Slovenia, the private sector and partners from government agencies.

Gender balance would be an important consideration in the composition of the teams working on dataset development, guidelines development, model case formulation and the review committee.

The project will be implemented in a partnership mode and UNESCO field network will be involved to the extent possible,  including as an input to country-specific One-UN mechanisms

Stakeholders, beneficiaries and partners

Direct beneficiaries of the project include AI and Machine Learning researchers, universities, students in data science and speakers of low resource languages through strengthened access to information in their languages.

Key partners

  1. AI for Development Network – Africa
  2. Data Science for Social Impact – University of Pretoria Research Group
  3. Masakhane – Machine Translation for African Languages
  4. Deep Learning Indaba
  5. UNESCO Chair in Data Science and Analytics, University of Essex, United Kingdom
  6. UNESCO Chair in Artificial Intelligence, University College London, UK
  7. UNESCO Category 2 Centre – International Research Centre on Artificial Intelligence (IRCAI), Slovenia

Key partners would be involved in project implementation, network development and as part of review committee for the project.

[1] Among leading architectures for pre-training models for transfer learning in NLP, pre-trained models, particularly for African languages, are barely represented mainly due to a lack of data. While these architectures are freely available for use, most are data-hungry. The GPT-2 model, for instance, used millions, possibly billions of text to train. (ref)

This gap exists due to a lack of availability of data for African languages on the Internet. The languages selected for BERT pre-training “were chosen because they are the top languages with the largest Wikipedias”. (ref) Similarly, the 157 pre-trained language models made available by fastText were trained on Wikipedia and Common Crawl. (ref)

 

 

 

 

 

 

 

 

 

 

You might also like