Sharing with the community: text transcoding libraries

sasajuric · November 16, 2018, 6:18pm

I still don’t understand. With the library, we distribute the source code, not the binaries. So binary compatibility is not the problem of the library author.

dimitarvp · November 16, 2018, 6:24pm

Oh, I see. I was adressing what I said earlier: I wonder if I can modify the library with pre-packaged .beam files that represent all coding pages, compiled on my machine.

But then I stumble upon the problem of striking a balance between a minimum Erlang version that (1) ensures maximum backwards compatibility but (2) ensures fast code.

(Michal mentioned missing out on improved APIs or improved compiler.)

sasajuric · November 16, 2018, 6:28pm

Why would you do that? As you can see there are clearly some issues with this approach, so I wonder what is the gain?

dimitarvp · November 16, 2018, 6:29pm

Mostly to dodge the original problem that sparkled this discussion – so people don’t have to dig about mix deps.compile codepagex --force after they have already installed the library and now want to add coding pages.

Also trying to figure out a good solution that doesn’t choke CI/CD machines.

sasajuric · November 16, 2018, 6:36pm

There are simpler and more reliable ways of achieving this, for example by having the library accept the parameters via macro or function args, or by generating and exposing all the modules for all supported encodings (but doing it during compilation, not offering them as external .beam files).

Given that binaries are usually cached on such machines, I wouldn’t expect this to be a problem.

dimitarvp · November 16, 2018, 6:41pm

Okay, I am beaten.

Do you stand by behind your initial suggestion on how to improve Codepagex or do you now think you’d do it another way?

dimitarvp · November 16, 2018, 8:59pm

btw, @tallakt merged the documentational PR into Codepagex.

sasajuric · November 16, 2018, 10:34pm

I didn’t really look deeper into the library so I can’t make stronger recommendations. That said, based on this discussion, I see the following options:

Generate all the code for all encodings at compile time

In this approach, we’d be able to invoke something like Codepagex.CP1251.from_string(...)

I believe this is the most straightforward solution, and it’s the one I’d try first. Possible blocker is that this might increase the first compilation time, and cause a significant disk/memory usage overhead. I think it should be tried out, and the overhead should be measured.

The use solution I originally proposed

In this approach, the client can cherry pick the encodings it needs, so we generate the minimal amount of code. The interface is a bit clumsier, but it resolves possible blockers from #1.

There are some other options that could be considered, such as deferring the decision to runtime, or splitting the library into multiple libraries, but for a few reasons I’m not inclined to try out those (again, I didn’t study the lib code in details, so my assumptions might be wrong).

tallakt · November 17, 2018, 7:20am

Hi. Id be happy to improve the interface for Codepagex. Some of the encodings are quite large an this is why Codepagex is configurable in the first place. Option 2 seems fine, except it seems a bit clumsy. An option where codepages are dynamically compiled on use seems to be the best option, I’d have to think about how this could be implemented.

Whatever I/we decide here, compatibility will also be an issue here, as Codepagex does compile a few codepages by default, and changing the selection algorithm could break code

Speaking on the subject of Codepagex, there is still one important feature missing which is support for io lists. These would boost performance as for many conversins, string parts could be reused…

LostKobrakai · November 17, 2018, 11:01am

For encoding big bunches of behavior into elixir modules one can look into ex_cldr. It‘s taking the officially provided json files and generates modules based on them.

If the external files -> code translation should happen not on the library users machine it could also be done like e.g. nimble_parsec does it with generating a already processed elixir module, which would then be published.

dimitarvp · November 17, 2018, 11:56am

I might be ignorant here – but does ex_cldr do text transcoding as described in my OP?

josevalim · November 17, 2018, 1:52pm

How many codepages there are? If a dozen, then you can generate empty modules for all of them that when invoked tell user exactly how they can enable those modules. So you would something like:

You are using CP1234 but we have not compiled it. Please add this to your config and then run mix deps....

Another option is for you to dynamically generate something whenever it is missing but print a warning. That’s what we do for mime: https://github.com/elixir-plug/mime/blob/master/lib/mime/application.ex

dimitarvp · November 17, 2018, 1:56pm

iex> Codepagex.encoding_list(:all) |> length
63

tallakt · November 17, 2018, 2:31pm

The problem is basically that you select codepages compile time. All the small ones are included by default but a few are left out, eg japanses which are really big, and the UTF-16 stuff that I dont much like anyways. But its there for anyone who wants it.

The string conversion functions are normal functions where the code page is an atom parameter, executed runtime. So its not easy with the current structure to make sure that any necessary codepages are automatically compiled.

If we made each codepage a module, you would need to implement your own «codepage name» to codepage module. This is maybe the best solution, but I cant see a way to make that compatible with the existing library.

I thought about the empty module functionality, but this is in reality not much different from the current functionality, with an additional runtime description added. This would be an improvement over the current code.

The solution we have is basically ok, except mix will not compile if you change the config (this is my understanding though I have not had time to look into this in detail).

Perhaps we should search for solutions in mix? perhaps a sort of callback function in a module that triggers recompilation compile time? One solution could be that any library could generate a hash value based on the mix.exs condiguration and any other external assets, then mix would recompile every time the hash value changes

josevalim · November 17, 2018, 3:44pm

We have been thinking about introducing a specific storage for compile time configuration. But it will take a while for this to become a reality so I would recommend at least doing what plug-mime is doing, which is to “store in a module” the config you used to compile and compare it with the config value of the app boot. Basically what you suggested, by done by hand.

tallakt · November 17, 2018, 5:44pm

Thanks - I’ll look at that project for guidance. Are they able to trigger a recompilation or just a warning to the console with this technique?

josevalim · November 17, 2018, 9:32pm

In that particular example we redefine the module. But you don’t need to do something so drastic, so yeah, you could warn for example.

sasajuric · November 18, 2018, 7:14am

A straightforward solution to that problem is to provide a macro which accepts codepages as parameters and generates the corresponding code.

Another option is to provide the small ones in one library and the big ones in separate libraries. That’s more hassle for library maintainers, but it’s straightforward for clients to use.

This approach has a couple of issues:

We still need to manually remove the cached binaries on the CI/CD machine
Boot time is unexpectedly increased.
It won’t help if the library is used at compile time (e.g. @foo Codepagex.bar(...))

Which existing library? Are you worried about backwards compatibility?

tallakt · November 18, 2018, 9:22am

Yes I was thinking of backwards compatibility. I like your suggestions @sasajuric, as we would bring the configuration into the client code in a straightforward way. I might even consider making this a breaking change. So we’d end up with what you suggested:

defmodule MyCodepagex do
  use Codepagex, encodings: ~w(iso_8859_1 UTF16_LE)

  # ...
end


# ...
MyCodepagex.to_string(:iso_8859_1, my_binary)
#...

If we decide on this, the __using__ macro could be added now, and direct use of Codepagex functions may be deprecated.

I think the bad points with this solution is that it’s less nice for the casual user who just wants to convert a binary from one of the default included conversions. Without knowing about the compiling issues, the structure above does not make much sense to me.

After thinking about the other proposal which is adding a warning on change in the config, I dont think it’s a good idea as we have no way to perform this check at startup. It would have to be run every time the Codepagex.to_string function is used, causing a lot of overhead.

@sasajuric your proposal seems to be the only option to have no really big downsides…

dimitarvp · November 18, 2018, 1:41pm

It’s the same with the current implementation for myself as well – I just wanted a library that supported cp1251 out of the box and definitely wouldn’t have ever guessed I have to run mix deps.compile codepagex --force after changing the config.