Pidgin english dictionary

PIDGIN ENGLISH DICTIONARY CODE

We also make sure the same latent space is used for both language modelling and translation so that the language model can be transferred nicely to the translation task.Īt each training step, we perform the following: The encoder is trained to fool the discriminator such that latent representations of both source or target are indistinguishable. We do this by training a discriminator to classify encodings of source and target sentences. We perform this enforcement by adversarial training following (Lample et al., 2018a) where we constrain the encoder to map the two languages to the same feature space. This ensures the decoder can translate regardless of the input source language. Hence, we make sure the encoder encodes sentences from source and target languages to the same latent space. There are 4 encoder and 4 decoder layers with 3 encoder and decoder layers shared across both languages.įor a decoder to work well, its inputs should be produced by the encoder it was trained with or they should come from a similar distribution as that encoder. Unsupervised NMTįor this, we used the Transformer with 10 attention heads. Given a word, below are some examples of its three English nearest neighbor words after alignment and their cosine similarity: The latter method outperformed the former achieving a Nearest Neighbor accuracy of 0.1282, compared to 0.0853 for the former and a baseline of 0.009, which is the probability of selecting the right nearest neighbor from the validation set of 108 pairs.

Evaluation of methods was done on a validation set of 108 pairs.Īlignment of the word vectors was performed with the Procrustes method in which an orthogonal weight matrix is learned to map source to target word vectors (Conneau et al., 2018) and with the Retrieval Criterion following (Joulin et al., 2018). A dictionary consisting of 1097 word pairs was scraped and manually edited for supervised alignment. Given the absence of parallel data, we performed Unsupervised Translation, which relies on cross-lingual embeddings.

We initialized word vectors with Glove and fine tuned on the corpus with a CBOW model trained with 8 negative samples, window size of 5, dimension of 300 and a batch size of 3000 for 5 epochs. dem say na serious gbege if dem catch anybody with biabia for inside di campus dis one na one of di first songs wey commot dis year for nigeria but as dem release am, yawa dey. Below are examples of sentences in the corpus: In total, we obtained a corpus consisting of 56048 sentences and 32925 unique words by scraping pidgin news websites. Obtaining Corpus and Training Word Vectors

Unsupervised Machine Translation from Pidgin to Englishġ.

Cross-lingual embedding of Pidgin and English.

Provision of a Pidgin corpus and training of Pidgin Word Vectors.

The problems this research addresses are the following: There are over 75 million speakers in Nigeria alone, however, there is no known Natural Language Processing work on this language. Despite the obvious diversity amongst these languages, one language significantly unifies them all - Pidgin English. Over 1000 languages are spoken across West and Central Africa, with over 250 of them being Nigerian.

However, little work has been done on African languages. Many Machine Translation works have focused on popular languages like English, French, German, Chinese and so on. Translation is an important area of research in Artificial Intelligence, and, most of all, communication.

PIDGIN ENGLISH DICTIONARY CODE

You can skip to the Results section at the end of the article to see some example translations by our model and the link to the code on github. TLDR: We trained a model that can translate sentences from West African Pidgin (Creole) to English - and vice versa - without showing it a single parallel sentence (a Pidgin sentence and its English equivalent) to learn from.