endangered-languages
v1.0.0
Published
Resources for conservation, development, and documentation of endangered, minority, and low or under-resourced human languages.
Downloads
6
Maintainers
Readme
Endangered Languages
Resources for conservation, development, and documentation of endangered, minority, and low or under-resourced human languages.
There is no centralized list of open-source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages. According to some estimates, half of the 7,000~ currently spoken languages are expected to become extinct this century (Wikipedia). However, there is a lot of work by academics, independent scholars, organizations, communities, and individuals which goes towards stopping or slowing this trend. This list is intended to provide a central location to document those efforts.
Publication
A white paper describing this repository was published at the LREC 2016 CCURL Workshop (Collaboration and Computing for Under-Resourced Languages). The paper is in this repository, in the papers
folder. Download the raw paper here: Open Source Code Serving Endangered Languages.
Contribute
To edit this list, simply click here. If you would like to discuss anything at all related to this, please open an issue. Please edit the list, either using the link above or by submitting pull requests, if you know of any resource available that is not on this list.
In general, please link directly to the resource or to the page describing the resource. The blurb after the link should be something short - the GitHub description generally works well, although the blurb may have to be written manually for non-GitHub links or for GitHub links which lack descriptions. Please make sure each link is on one line, to help with automatic alphabetization.
Definitions
Endangered languages are human languages that are in danger of extinction. This list also encompasses minority languages - languages which are spoken by a stable, but small, population (for example, Maltese or Hawai'ian); and low- or under-resourced languages, which are spoken by a significant population but under-represented on the web (for instance, Quechua). These languages share certain characteristics in common; the most pertinent is sparse data and a lack of resources, ranging from spell-checkers to grammars to machine translation corpora. Other under-resourced languages that do not fall under this list include constructed languages (for instance, Klingon or Na'vi), computer languages (for instance, Javascript or Lua), and extinct languages that are so sparse as to be rendered computationally irrelevant for most purposes (for instance, Tocharian).
Open Source "promotes a universal access via a free license to a product's design or blueprint, and universal redistribution of that design or blueprint, including subsequent improvements to it by anyone." (Wiki). This is important because money and resources allocated towards a language or project that are not open source is spent at the expense of possible extensibility elsewhere.
Regarding the name, Endangered Languages may not be the best term, as many low resource languages are not necessarily endangered. But this term is the most accessible to the widest amount of people. Low Resource Languages would also suit this list.
Looking for resources for code languages? Take a look at the awesome lists collection.
Table of Contents
Table of Contents generated with DocToc
- Generic Repositories
- Annotation
- Format Specifications
- i18n-related Repositories
- Audio automation
- Text automation
- Experimentation
- Flashcards
- Natural language generation
- Computing systems
- Android Applications
- Chrome Extensions
- FieldDB
- Academic Research Paper-Specific Repositories
- Example Repositories
- Language & Code Interfaces
- Fonts
- Corpora
- Organizations
- Language Specific Projects
- License
Generic Repositories
Massive Dictionary and Lexicography projects
- ABVD Austronesian Basic Vocabulary Database
- CBOLD Comparative Bantu OnLine Dictionary
- IE Indo-european comparative lexical resource
- REFLEX a comparative dictionary project for Africa based out of CNRS in France.
- Southeast Asian lexicography Several Southeast Asian lexicons hosted.
- STEDT Tibeto-burman focused project where dictionaries from several languages are comparable.
- Tibeto-burman lexicography
Single language lexicography projects and utilities
Utilities
- Project for Free Electronic Dictionaries Is a project for a java MIDlet for mobile phones - for indigenous language dictionaries.
- Webonary Site which hosts digital dictionaries for single languages.
- WeSay Allows language communities to build their own dictionaries. http://software.sil.org/wesay/ (by the SIL International)
Interactions and presentations of data
- Dict.cc An exemplar model of a successful bilingual (German-English) dictionary as it has grown from a hobby to a business employing 22 people.
- Koasati Digital Dictionary The Coushatta Tribe of Louisana
- Ojibwe People's Dictionary
- Talking dictionary of Khinina-ang Bontok: The language spoken in Guina-ang, Bontoc, Mountain Province, the Philippines. Notice that this dictionary is best viewed with Firefox 3.0 on Windows XP... what is the lifespan of these works which we create and how do we create a sustainable infrastructure? this has really been the bane of the digital age and many academics are not able to overcome this challenge.
- [Template for Multilayered Language Learning Resources] (https://github.com/eddersko/web-template) This is a web-based template that may be used to present language learning resources to aid language revitalization efforts. It includes a talking dictionary, and a phrasicon, containing sentences and phrases.
- The Yurok Langauge Project
- Yami Dictionary
Software
- accentuate.us a.k.a. "charlifter". Statistical Unicodification of plain text for many languages
- alignment-with-openfst - This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing.
- ANNIS Search and Visualization in Multilayer Linguistic Corpora
- Apertium Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs.
- ark-tweet-nlp CMU ARK Twitter Part-of-Speech Tagger (Fork)
- ArtOfReading Index and processing scripts related to the Art Of Reading illustration collection
- bayesline A Multinomial Bayesian Classification for Language Identification
- bible-corpus-tools A collection of tools for reading/processing the multilingual Bible corpus.
- BloomDesktop Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia… http://bloomlibrary.org/
- BloomLibrary - Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend. http://www.bloomlibrary.org
- brain Neural networks in JavaScript
- Bristol Uni MT Morphology tools This repo is a mirror of scripts available on http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus. Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis.
- brown-cluster C++ implementation of the Brown word clustering algorithm.
- CasualCon CasualConc is a concordance program that runs natively on Mac OS X 10.5 Leopard or later. It was originally designed for casual use (preliminary analysis or non-research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count.
- cdec - Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
- charlint Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model.
- chorus A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed.
- clam Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
- CMU Sphinx CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems.
- cnminlangwebcollect Chinese minorities website languages detection and websites collection
- Cog Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties. http://sillsdev.github.io/cog/
- convertextract Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting.
- CorpusTools Phonological CorpusTools http://phonologicalcorpustools.github.io/CorpusTools/
- CTK Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: http://champollion.sourceforge.net)
- CuPED CuPED ('Customizable Presentation of ELAN Documents') is a tool for transforming time-aligned transcripts, such as those produced by ELAN, into a variety of presentation formats.
- DataTags A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. (Fork)
- dataverse A data repository framework to share and publish research data.
- dative A single-page application that interacts with multiple linguistic fieldwork web service databases. Website.
- DeepLearnToolbox Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started.
- Desmeme Database and tools for exploring linguistic templates
- dictdb dictionary database for language translation
- discoursegraphs Python-based tool to convert and merge multilayer annotated linguistic data
- DLTK Deutsch Language Tool Kit. More
- ELDER: Endangered Language Data Electronic Repository Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool.
- EMMA A Novel Evaluation Metric for Morphological Analysis
- enchant enchant spellchecking library https://abiword.github.io/enchant
- fast_align Simple, fast unsupervised word aligner.
- fastText - Library for fast text representation and classification.
- FieldWorks FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. http://software.sil.org/fieldworks/ FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology
- Franc Natural language detection http://wooorm.com/franc/
- FwDocumentation Developer documentation for FieldWorks (software tools for language and cultural data, with support for complex scripts).
- FwLocalizations Localizations for FieldWorks
- FwSupportTools Additional tools for FieldWorks development
- Gaia Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see the wiki. If you're interested in setting up a keyboard in new language, see this.
- giza-pp GIZA++ is a statistical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model. This package also contains the source for the mkcls tool which generates the word classes necessary for training some of the alignment models.
- gv-crawl - Global Voices bitext crawler for creating parallel corpora.
- Glottolog data Glottolog provides comprehensive reference information for the world's languages.
- Gramadóir Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources.
- grind An InDesign 5.5 plug-in designed allow graphite enabled smart fonts to be used in Adobe InDesign. This project integrates SIL's Graphite 2 smart font technology with our own implementation of a paragraph composer plugin.
- hermitcrab HermitCrab.NET is a flexible morphological/phonological parser that takes an item-and-process approach.
- Hfst Helsinki Finite-State Technology - a set of command-line tools to work with finite state transducers. Is heavily used by the Giella infrastructure (see Divvun and Giellatekno under Sami further down)
- hunspell Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding
- icu-dotnet C# wrapper for ICU4C
- icu4c Mirror of svn project at http://source.icu-project.org/repos/icu/icu/. The FieldWorks branch has some FieldWorks specific enhancements.
- iLanguage A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. Input: a corpus. Uses compression, maximum entropy and fieldlinguistics.
- ipa-help IPA Helps
- itweets-geodata Geodata from Indigenous Tweets
- jQuery.ime jQuery based input methods library
- koreksyon Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages
- l20n.js L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n. http://l20n.org
- langid.py Stand-alone language identification system.
- langtech A host of resources provided in SVN by the University of Tromsø. Details are here and in English here.
- leebock/languages Application files for the Smithsonian endangered languages story map.
- LEGO Unified Concepticon Material relating to the LEGO Unified Concepticon
- Lex4All pronunciation LEXicons for Any Low-resource Language http://lex4all.github.io/lex4all/
- LfMerge Send/Receive for languageforge.org
- liblevenshtein - A library for generating Finite State Transducers based on Levenshtein Automata.
- libpalaso Palaso Library: A set of .Net libraries useful for developers of Language Software.
- LinGO Grammar Matrix The LinGO Grammar Matrix is a framework for the development of broad-coverage, precision, implemented grammars for diverse languages.
- Lingpy LingPy: Python library for quantitative tasks in historical linguistics http://lingpy.org
- Linguistica Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed.
- long-press jQuery plugin to ease the writing of accented or rare characters. http://toki-woki.net/lab/long-press
- low-resource-pos-tagging-2014 Low-Resource POS-Tagging: 2014
- lrl For work concerning low resource languages.
- Machine Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx)
- Make-extensions Scripts for generating hunspell spellchecking extensions
- MARY TTS MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java http://mary.dfki.de
- maxent Maximum Entropy Modeling Toolkit for Python and C++ http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html
- mgiza A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
- Minority Translate Minority Translate is a simple program for helping content generation on smaller sized Wikipedias (actually any sized) by giving pointers to existing articles in other language Wikipedias, so that the user can easily translate or adapt existing texts and thus increase the size and useability of their Wikipedia editions.
- morfessor Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
- morpholm Morphology-aware language models.
- mosesdecoder Moses, the machine translation system
- moz-l10n-tiers Creates a pseudo-locale to evaluate string prioritization for l10n
- myWorkSafe Smart & Simple Backup for Language Development Workers http://myWorkSafe.palaso.org
- Natural Javascript general natural language facilities for node
- NIST 2008 Open Machine Translation Evalutation
- NLTK Python Natural Language Tool Kit. NLTK Source http://www.nltk.org/
- node-panlex node.js client for PanLex
- norma A tool for automatic spelling normalization
- nplm Fork of https://nlg.isi.edu/software/nplm/ with some efficiency tweaks and adaptation for use in mosesdecoder.
- octothorpe CouchDB-powered wiki thing
- OdtXslt Perform XSLT transform on contents of a package (such as ODT, Docx, etc.)
- old-webapp Online Linguistic Database --- software for creating web applications to collaboratively document languages.http://www.onlinelinguisticdatabase.org
- old-pyramid Online Linguistic Database migrated to the Pyramid framework.
- OpenDataKit Open Data Kit (ODK) is an open-source suite of tools that helps organizations author, field, and manage mobile data collection solutions
- OpenNLP The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. Website.
- ops-devbox Ansible playbook for a (linux) developer machine
- panlex-tools This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at https://dev.panlex.org
- paradigm PARADIGM is a .Net (C#) implementation of Joseph E. Grimes' 1983 work entitled "Affix Positions and Cooccurrences: The PARADIGM Program".
- pathway Preparing language data for publication
- pdfdroplet Library and GUI for imposition of PDF pages (e.g. 2-up) http://pdfdroplet.palaso.org
- pepper Pepper is a pluggable, Java-based, open source converter framework for linguistic data.
- phonology-assistant Phonology Assistant is a discovery tool. Provided with a corpus of phonetic data, it automatically charts the sounds and through its searching capabilities, helps a user discover and test the rules of sound in a language.
- pressagio Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string.
- PrimerPro The purpose of PrimerPro is to assist the literacy worker in the development of primers for a given language.
- pyDelphin Python libraries for DELPH-IN (Friendly Fork)
- RBGParser Graph-based Dependency Parser.
- Rosetta Pangloss The Rosetta Project's Pangloss system
- salm SALM: Suffix Array and its Applications in Empirical Language Processing by Joy
- Salt A graph-based model to store and manipulate linguistic data.
- saymore - A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription.
- Secwepemc-Facebook Translate Facebook into unsupported languages
- SegParser Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
- SeedLing Building and Using A Seed Corpus for the Human Language Project
- Skype in your language Translate Skype into unsupported languages
- solid Solid is a software tool that can be used to check, clean up, and convert Standard Format (e.g. Toolbox) lexicon data.
- SPHERE Conversion Tools Many LDC corpora contain speech files in NIST SPHERE format. The programs below convert SPHERE files to other formats.
- StandardFormatLib Standard Format Library
- Stanford CoreNLP Stanford CoreNLP: A Java suite of core NLP tools. https://stanfordnlp.github.io/CoreNLP/
- Stanford CoreNLP Python Python wrapper for Stanford CoreNLP tools
- stanza Stanford NLP group's shared Python tools.
- str2ipa Pronunciation dictionaries for languages with close-to-phonetic writing systems
- sugali This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.
- SuGarLike Language Identification for Low Resource Languages (by Susanne, Guy and Liling)
- teny Tools for low-resource machine translation.
- TeraDict Translate English words into hundreds of languages!
- Tesseract.js Pure Javascript OCR for 62 Languages 📖🎉🖥 http://tesseract.projectnaptha.com/
- TexNLP TexNLP: Texas Natural Language Processing tools
- TiMBL TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases.
- Toney Tone Classification Software
- Toolbox Scripts for ELAN Mirror of Alexander Koenig's Toolbox Scripts https://tla.mpi.nl/tools/tla-tools/elan/thirdparty/
- ToolsForFieldLinguistics A collection of scripts and recipes for linguistics
- translitit-engine A transliteration engine written in JavaScript
- Tsammalex data Tsammalex is a multilingual lexical database on plants and animals.
- tweet2learn An app to make it easier to use your native language on Twitter
- twitter_langid A hierarchical character-word neural network for language identification
- UniversalDependencies docs Universal Dependencies online documentation http://universaldependencies.org/docs/
- UniversalDependencies tools Various utilities for processing the data.
- VocBench VocBench is a web-based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL.
- wavesurfer.js Navigable waveform built on Web Audio and Canvas https://wavesurfer-js.org/ (Also has an ELAN plugin)
- web-scriptureforge platform for Scripture-related web apps
- Word Generator WordGenerator generates hypothetical words from specifications of their syllable structure.
- WordBoundary An experiment in the detection and segmentation of word boundaries
- wordbyword WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages.
- WSI4URLang Word Sense Induction (WSI) for Under-resourced Languages (URLang)
- XDXF_Makedict XDXF dictionary format and "makedict" dictionary converting software (official repository)
Annotation
- AGTK AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge: https://sourceforge.net/projects/agtk/)
- brendano - Graph Fragment Language for Easy Syntactic Annotation https://www.cs.cmu.edu/~ark/FUDG/
- ELAN ELAN is a professional tool for the creation of complex annotations on video and audio resources.
- eopas ETHNOER Online Presentation and Annotation System
- FLAT - FoLia Linguistic Annotation Tool FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia/), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure.
- gfl_syntax Graph Fragment Language for Easy Syntactic Annotation http://www.ark.cs.cmu.edu/FUDG
- graf-python The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python.
- LDC Word Aligner LDC Word Aligner is a software tool used for manual annotation of word alignment developed to support Arabic-English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources.
- poio-analyzer Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt.
- poio-api Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F… http://www.poio.eu/
- poio-doc Documentation of the Poio project.http://www.poio.eu
- pyannotation PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files.
- XTrans Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.
Format Specifications
- dlx-spec The official specification for the DLx linguistic data format. http://developer.digitallinguistics.io/spec/
- FoLiA FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange. http://proycon.github.io/folia/
i18n-related Repositories
- Express-Lingua An i18n middleware for the Express.js framework.
- Polyglot.js Give your JavaScript the ability to speak many languages.
- Transifex - System for providing a nice, userfriendly/project oriented approach to translating
.po
files. Great for non-technical users, free for open-source projects, decent for minority languages; however, it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared.
Audio automation
- arctic-prompts Generate prompts PDF for CMU ARCTIC dataset
- AudioWebService a simple nodejs server which accepts upload of audio and runs it through praat
- AuToBI Automatic prosodic annotation tool written in Java.
- BashScriptsForPhonetics (Fork of a dormant project)
- esv-text-audio-aligner ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio
- et-pocketsphinx-tutorial Tutorial of Estonian speech recognition using PocketSphinx
- html5-audio-read-along HTML5 Audio Read-Along
- ipa-chart International Phonetic Alphabet (IPA) Unicode Chart and Character Picker
- kaldi-svn-archive An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available)
- kaldi This is now the official location of the Kaldi project.
- lex4all pronunciation LEXicons for Any Low-resource Language (Fork of a student project)
- node-pocketsphinx
- opensauce GNU Octave-compatible version of VoiceSauce
- pocketsphinx PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop
- pocketsphinx-ios-demo Simple demo for iOS
- pocketsphinx-python Python module installed with setup.py
- pocketsphinx-ruby Ruby speech recognition with Pocketsphinx
- pocketsphinx-wp-demo Demo to run pocketsphinx on WP8 platform
- pocketsphinx.js Speech recognition in JavaScript
- praat-py From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. (Fork of a dormant project)
- Praat-Scripts Mietta's Scripts
- PraatTextGridJS A small library which can parse TextGrid into json and json into TextGrid
- PraatontheWeb - Web implementation of Praat. Source code, running demo scripts on web, samples and documentation
- prosodicParsing different kinds of HMMs to use for incorporating prosody into basic parsing
- Prosodylab-Aligner Python interface for forced audio alignment using HTK and SoX
- prosodylab.alignertools
- Recordmp3js Record MP3 files directly from the browser using JS and HTML
- sphinx4 Pure Java speech recognition library
- sphinxbase
- sphinxtrain
- TLSphinx Swift wrapper around Pocketsphinx
Text automation
- clld Cross Linguistic Linked Data python library
- LaTeX2HTML5 LaTeX web components
- MultilingualCorporaExtractor Node io Spider for extracting multilingual corpora (Fork of a student project)
- SeedLing Building and Using A Seed Corpus for the Human Language Project (Fork of a student project)
Experimentation
- experigen A framework for creating linguistic experiments
- GamifyPsycholinguisticsExperiments A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. (Fork of a dormant project)
- OpenSesame Graphical experiment builder for the social sciences
- OPrime Open Source Experimentation Libraries - Online and Offline for Android and HTML5
- psychopyMegProsody Runs MegProsody using PsychoPy.
- PsychScript A HTML5/Javascript library for running behavioural experiments online.
Flashcards
- Anki Anki is a program to make and share flaschard decks (including audio) for any language or writing system. https://apps.ankiweb.net/
- flashfork An Anki addon for copying decks of flashcards, with or without also copying their note types.
- flashgrab An Anki addon for pulling flashcard data (one-way sync) from XML. Optimized for LIFT XML (from WeSay or FLEx). [This is now the official repo. -pconstrictor]
- flashgrid ![GitHub stars](https://img.shields.io/github/stars/sillsdev/flashgrid.svg An Anki addon for drilling flashcards by selecting the correct card from a grid layout of several cards. See the Anki website's list of provided addons.
- VocabLift ![GitHub stars](https://img.shields.io/github/stars/somelinguist/VocabLift.svg Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay.
Natural language generation
- hailo A conversation bot using Markov chains
- ngram-natural-language-generator Takes in a text file and generates random sentences that sound like they could have been in the file
- OpenCCG OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others.
- SimpleNLG SimpleNLG is a simple Java API designed to facilitate the generation of Natural Language. It was originally developed at the University of Aberdeen's Department of Computing Science. English at this moment but there exist forks in French and German.
- See more at Downloadable NLG systems at the ACL Wiki. Of particular interest there might be the List of resources by language at the wiki.
Computing systems
- Common Language Resources and Technology Infrastructure Norway / Clarino - One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the ABEL cluster.
Android Applications
- Aikuma Android software for recording and translation
- Android Speech Recognition Trainer Speech recognition training app for low resource languages which interfaces with FieldDB corpora
- AndroidFieldDB An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers.
- AndroidFieldDBElicitationRecorder A general purpose video recording tool
- AndroidLanguageLessons Lets heritage speakers create self designed language lessons. https://play.google.com/store/apps/details?id=com.github.opensourcefieldlinguistics.fielddb.lessons.georgian
- AndroidProductionExperiment Android App to run perception experiments
- Bevara Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages
- ojoVoz A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to http://sautiyawakulima.net/ojovoz/
- pocketsphinx-android-demo
- pocketsphinx-android pocketsphinx build for Android
- [Template for Word-Learning App] (https://github.com/eddersko/android-template) This is