Showing posts with label apertium. Show all posts
Showing posts with label apertium. Show all posts

Tuesday, February 12, 2008

Apertium ... three lines to localize ...

Hi Apertium is a MT (machine translation) tool which is particularly strong in translating similar languages. It is also used for pre-translation of Wikipedia articles for certain language pairs (Spanish-Catalan, Spanish-Occitan etc.) and shall be included in the Ubuntu distribution. Therefore it would be nice to have the three needed sentences localized in as many languages as possible.

To see which languages are actually there please have a look at

http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tolk/po/

From there you can create the .po file for your language. Once you translated the file please send it to: apertium-stuff@lists.sourceforge.net maybe sending also a copy to me s.cretella (at) voxhumanitatis.org (just to make sure the file goes through - you never know).

Of course, I understand if you don't know how to work with .po files.
In that case please indicate the language code (and language) you are translating in in the subject of your e-mail and send the lines below translated to the apertium-stuff list above and a copy to me. Please note that only the lines starting with msgstr need translation, the rest remains untouched.

If you work with particular languages that could create utf-8 coding problems in e-mails, pleas just copy and paste the text into a file and send us the file with the translated lines.

#: main_window.glade.h:2
msgid "Mark unknown words"
msgstr "Mark unknown words"

#: main_window.glade.h:3
msgid "This program is licensed under the GPL."
msgstr "This program is licensed under the GPL."

msgid "Translator"
msgstr "Translator"

An example of the screenshot can be seen here.

Thank you in advance for your help :-)

Friday, December 07, 2007

Translating Wikipedia articles (2)

Like I already said yesterday, I would come back to this argument today.

Apertium is already used in some projects, one of which is the Occitan Wikipedia. For those who are not familiar with Wikis: there you have the possibility to compare the not proofread version with the proofread version and that is something you will see by clicking here.

What you see on the left hand side is the text as it was after the machine translation and on the right hand side the proofread version of the text. The changes are highlighted in green on the left and in blue on the right hand side. There are even some parts of the text that were not changed at all.

The work on the glossary and the grammar rules (well I am not using the specific terminology here to make things understandable for all) has been going on for approximately one year now.

At a certain stage the problems arise from vocabulary that is missing and not so much from the rules. Of course these translations will probably never be a 100% perfect, but the quality depends very much on us and our adding terminology and classifying it.

Comparing the above result to what you would see for Spanish-Catalan, well the last one having been under development for years is much better.

You can find further reading about co-operation between Wikipedia and Apertium on the Apertium Wiki.

Language pairs that are right now available are:

  • Spanish←→Catalan
  • Spanish←→Galician
  • Spanish←→Portuguese (pt and pt_BR)
  • Catalan←→English
  • Catalan←→French
  • Catalan←→Occitan (oc and oc@aran)
  • Romanian→Spanish

Many other language pairs are under development. Of course: you may start on any language combination that is comfortable for you. Please keep in mind: the more similar two languages are the easier it is to program the rules, the faster the translation engine will produce good translations.

If you want to start to work on wordlists, please write me at: s.cretella (at) voxhumanitatis.org and tell me which language pair you are interested in. You can also reach me by skype at: sabinecretella

I will upload a wordlist to google docs and give you access. Please let me know if you have difficulties to work online (that is if you work with a dial-in connection).

The Apertium Chat is on Freenode.

One more thing I just received criticism since machine translation would flatten the language: well any translated text, in particular when it comes to literature translations, is post edited by a second person. The translation is never published directly since during translation - and you can be the best translator of the world - there are always some bits and pieces that sound a little strange or that do not really transport the scene into the other culture. And please allow me to introduce the concept of cultural localization here that will be explained in one of the future posts here and that was coined by Dr. Martin Benjamin who is part of the advisory board of Vox Humanitatis. The concept of cultural localization became then immediately part of the scope of the association.

And since I am adding notes here: please remember that the Fundraiser of the Wikimedia Foundation is still running and that you can help by donating and telling others that the fundraiser is on. For more information and to donate please click here.

Thursday, December 06, 2007

Translating Wikipedia articles ...

... into less resourced languages. Well, time has come that we can start to think about how to go about a faster creation of contents for the many small Wikipedias. As you all know, often we have just a handful of people creating and translating and then adapting articles. Well ... combining various Open Source and Open Content projects we can now go a further step into the direction of fast contents creation, but that does not mean: stub upload. This is a completely different way of doing things.

Apertium is a machine translation tool that works really great with similar languages. Approx. a year ago I had a translation from Spanish to Catalan done by Apertium through the online interface (http://xixona.dlsi.ua.es/apertium/) and asked some people of the Catalan Wikipedia to have a look at it. They told me that of course it was not perfect, but that it would be easy to proofread it and much faster than actually translating it. In March I made a similar test during a masters for translation studies in Pisa. I asked one of the students who was bilingual Spanish and Catalan to have a look at the outcome of the machine translation of a general text. The grammar was almost perfect and and also the terminology. There were just 5 corrections in a bit more than half a page (A4).

Now what does this mean to us: if we have a bilingual wordlist for two similar languages under a free license, we can pass it on to the Apertium people. From there we are a step closer of getting machine translation for that specific language combinations on their way.

One note inbetween for the Apertium people who might read this: please don't mind me not using specific terminology to describe what needs to be done. It could become to techy.

So the next step is to identify what a term is and how it needs to be handled. That is for example a verb needs to be declared as such, then one needs to give it a tag that indicates which conjugation scheme needs to be applied. This needs doing for all word types, that is verbs, nouns, adjectives etc. After that grammar rules need to be considered. Step by step the correctness level will be improved and the time invested to complete wordlists which will be available as google doc spreadsheet and to add all the additional information will help to save a lot of time. That is: now it will take longer, once the engine "learnt" how to deal with the terminology and grammar for that specific language combination creating contents will become much faster. This will help the small projects in such a way that the few editors can concentrate on proof reading and adapting and will result in a faster contents growth that has quite high quality.

This project that is going to care about less resourced languages will be one of the first lead through Vox Humanitatis. Should you be interested in helping with the wordlists, please let us know which language combination you would like to work on (that is starting from English right now and step by step from others since most of the Terminology is there in English). We will get you the access to the online document. If you need to work offline, please let us know. You can contact me by e-mail: s.cretella (at) voxhumanitatis.org

I just received a list of the supported language combinations as well as an example for Catalan-Occitan and some notes on evaluation of machine translation co-operating with a Wikipedia community. This means I have quite some further stuff to tell you. I'll post that info tomorrow, otherwise this blog would become too long.

Please also note that the documents will be released under CC-BY license and therefore they can be integrated into any wiktionary.

Khalil Gibran über die Musik

Die Musik wirkt wie die Sonne, die alle Blumen des Feldes mit ihrem Strahlen zum Leben erweckt. ( Khalil Gibran ) Image by Pete Linforth fr...