WordInfo - Useful linguistic information of headwords

qhwa · April 23, 2020, 1:17am

Hi fellows,

word_info is a small library providing some useful information for headwords.

So far it provides:

Syllables
Frequency of use
Pronunciations, in IPA and ARPABET formats

This library was developed in need of my dictionary project. I hope someone else can find it useful too!

Any feedback is welcome.

Please have fun!

qhwa · April 24, 2020, 10:20am

FYI. word_info v0.2.0 has been published, changelogs:

some improvements on performance.
In the first released version, dictionary data is compiled directly into BEAM codes. This may cause freezing (~3s) on starting the application, which may be a problem for an application who needs to boot as fast as possible.

In v0.2.0, the data is converted into an ETS table dump at compile-time. This allows the code to be tidy and provides a fast boot.
fix a typo on returning value

PS. I want to thank all the people and their valuable discussions involved in this thread, which leads me to a better approach:

Phillipp · April 24, 2020, 10:32am

Seems like it only supports english. Is there a way to support all languages? Or is there no data available for that?

qhwa · April 24, 2020, 2:53pm

Oh actually I didn’t know it works for other languages too. What language do you need? And does it have rules like those of English?

Phillipp · April 27, 2020, 12:40pm

I do not actually need this right now, but I am sure that if someone has a need for such a thing, it might also be needed for other languages other than english.

Cochonours · April 27, 2020, 1:26pm

That’s what I was going to say as well. I would like to be able to use such a library, but I would consider it only if it works for the different languages supported on my website.

It would be super nice to get the IPA pronunciation of words in many different languages as well. Arpabet seems to be restricted to English so that function would only work for that language I guess.

I think both pronunciation and syllables can be derived programmatically for quite a few languages (easy mode for Spanish/Italian/Korean), if no extensive resource exists on the net.

qhwa · April 28, 2020, 1:26am

Thank you for your suggestions, Phillipp and Cochonours, and I agree with you.

IMO, it will be easy as long as the target language has similar rules to English, but the most hard part is the data source.

Can anyone in the community give some feedback or guide on what will we be need if it is going to cover your language? Thanks in advance.

Cochonours · April 28, 2020, 7:30am

French wiktionary (and probably others) fr.wiktionary.org usually have the IPA info.

What rules do you need aside from IPA and syllables? Most Latin languages have automatic syllabic rules (not random as in English) so you wouldn’t even need a dictionary to derive them. Korean letters are grouped by syllables as well so it’s even more straightforward (한글 : 2 syllables, 조선글 : 3 syllables).

As for frequency of use, it’s more difficult for languages whose words change depending on their grammatical function, but an idea would be to use a selection of movies with good dialogues and derive the word frequencies per translation.

qhwa · May 6, 2020, 6:24am

I read the instructions on syllables about French, unfortunately, I found myself lack the necessary knowledge to support other languages at this moment. It would be left for others who are more suitable for this job. Thank you for the head-up of potential usage of this library and I’ll keep it in mind for future updates.