EDIT (2016-09-16): as of a great suggestion from hisham, I reused some symbols from Russian Cyrilic (Э э for /ɛ/, X x for /x/) instead of using Latin alphabet or IPA replacements. I still kept S s for /z/ with readability in mind, as З з looks very similar to Э э to non-seasoned cyrilic readers like me.
I recently attented BalCCon2k16 in Novi Sad, Serbia with my good friend @daltojr and, apart from the amazing people and delicious food, something that we both enjoyed was reading Serbian Cyrilic. The fact that every sound was represented directly by one letter made it very easy for us to (badly) read with the Serbian language very quickly. Due to the similarities of the sounds between Serbian and Portuguese, we quickly started joking around writing portuguese words in that script. After hearing from one of the cool staff members on the hostel we were staying that Brazilians and Serbs were very much alike -- the so called Serbilians -- I decided to roll up my sleeves and write some software to translate pt-BR into cyrilic.
Before writing any line of code, I had to define how each sound in Brazilian Portuguese would be written. I was sure that some of the sounds would have a corresponding cyrilic letter, so that also meant creating a new script: the Serbilian (Alfabeto Serbileiro/Auфабεтu Сербилeрu). My process was simple: I started off with the Portuguese phonemes in IPA, then matched those to the Serbian Cyrilic characters and finally added the missing symbols. At that point, I followed some guidelines to fill the gaps:
- reuse symbols from sounds similar sounds in Serbian not present in pt-BR: /v/ is written as "В" or "в", which originally represent /ʋ/.
- group sounds with negligible pronounciation differences: /r/ and /ɾ/ are both written as "Р" or "р".
- use latin letters not used in the Serbian Cyrilic script: /w/, /ʊ/ and /ʊ̃/ are represented as "U" or "u".
- grab Russian alphabet letters that represent sounds that are not present in Serbian: /ɛ/ becomes "Э" or "э".
- use the IPA symbol rather than adding an accent over a letter: /ɐ/ is written as "∀" or "ɐ", rather than "â".
One downside of the result is that the stress mark, normally written with acute accents in pt-BR, is totally lost. This could be easily solved by adding stress marks like a dot under the letter (I didn't implement this in my code).
|/a/ /æ/||dá, Jaime||А а|
|/ã/ /ɐ/ /ɐ̃/||andaime, itapuã, pão||∀ ɐ (not in Serbian)|
|/b/||beiço, cabeça||Б б|
|/k/||cor, quente, kiwi||К к|
|/d/||dedo, idade||Д д|
|/dʒ/||digo, idade||Џ џ|
|/e/||prêmio, medo||Е е|
|/ɛ/||meta, sé, Émerson, cafezinho||Э э (not in Serbian)|
|/f/||fado, café||Ф ф|
|/ɡ/||gato, guerra||Г г|
|/j/ /ɪ/||saia, pais,||Ј ј|
|/i/ /y/||dia, rainha, pais||И и|
|/l/||lua, alô||Л л|
|/ʒ/||já, gente||Ж ж|
|/ʎ/||lhe, velho||Љ љ|
|/w/ /ʊ/ /ʊ̃/ /y/||o, mal, mau, frequente, quão, Cauã, vejam||У y|
|/m/||mês, somo||М м|
|/n/||não, sono||Н н|
|/ɲ/||nhoque, sonho||Њ њ|
|/ɔ/||avó, famosa||Ɔ ɔ (not in Serbian)|
|/o/||avô, famoso||О о|
|/õ/||põe||Õ õ (not in Serbian)|
|/p/||pó, sopa||П п|
|/ʁ/ /χ/ /x/||rio, carro, por favor||X x (not in Serbian)|
|/r/ /ɾ/||frio, caro, por acaso||Р р|
|/s/||saco, isso, braço, máximo, escola||С с|
|/z/||casa, os, doze, existir||S s (not in Serbian)|
|/ʃ/||chave, baixo, sushi||Ш ш|
|/tʃ/||tchau, ritmo, ponte||Ч ч|
|/t/||tempo, átomo||Т т|
|/u/||rua, lúcido, saúde||У у|
|/v/||vela, livro||В в (reused from /ʋ/, which is not in pt-BR)|
Some example transliterations:
- Foi o cão que botou pra nós beber: Фoи у кɐo кi ботоу пра нɔс бебер.
- Acho a velocidade um prazer de cretinos. Ainda conservo o deleite dos bondes que não chegam nunca.: Ашу а велосидаџi ун праsер џi кречинус. Аинда консэрву у делeчi дус бонџiс кi нɐo шегɐo нунка.
- Ao vencedor, as batatas: Aу вeнседор, ас бататас.
- Nada separa as classes como a língua. Fora a renda, claro: Нада сепара ас класiс кому а лингуа. Фɔра а xeнда, клару.
Now that I had a clear specification of the script, I could put together some code to transliterate pt-BR from the Latin to Serbilian Cyrilic. My weapon of choice for writing quick and dirty code for this was Ruby: loads of libraries and nice Unicode support. The algorithm is pretty simple: grab the text in latin script, tokenize it, extract the IPA pt-BR pronounciation, convert the phonemes to Serbilian, put the text together again.
I used pragmatic_tokenizer for separating tokens and
eSpeak for extracting the IPA pronounciation.
One interesting thing I found in during the implementation is that the case of a character depends on locale.
Ruby handles this by just not handling natively at all, i.e.:
'é'.upcase == 'é'.
For this reason, I also used the unicode_utils gem, which does the locale-aware conversion.
One shortcoming of this approach is that sounds in pt-BR that depend on the context ("por favor" vs "por acaso", where the "r" changes its sound when before a vowel on the next word) get the wrong phonetic transliteration. This is a though problem to tackle: either we give up on having a 1x1 phonetic transcription, or we get words that have multiple forms, depending on the next word on the phrase. I won't go there for my little fun project -- if you have an easy solution please let me know!
Finally, the code: