Data_url: Parse data: URLs and access the data contained therein

ericlathrop · November 1, 2020, 10:58pm

I needed to parse some data: URLs to get the binary data, and the built-in URI module didn’t do it. So here’s a small library for parsing them. If someone wants to add %DataUrl{} -> to_string(), I’d gladly accept it.

From the README:

data_url

Hex.pm Documentation

Parse data: URLs and access the data contained therein

Usage

iex(1)> DataUrl.parse("data:text/plain;charset=ISO-8859-1;base64,SGVsbG8sIFdvcmxkIQ==")
{:ok,
 %DataUrl{data: "Hello, World!", mime_type: "text/plain;charset=ISO-8859-1"}}

kip · November 2, 2020, 12:59am

ex_url also parses data URIs (as well as geo, tel, mailto and uuid URIs). It also includes a to_string/1 function

iex> URL.parse "data:text/plain;charset=ISO-8859-1;base64,SGVsbG8sIFdvcmxkIQ=="                   
%URL{
  authority: nil,
  fragment: nil,
  host: nil,
  parsed_path: %URL.Data{
    data: "Hello, World!",
    mediatype: "text/plain",
    params: %{"charset" => "ISO-8859-1", "encoding" => "base64"}
  },
  path: "text/plain;charset=ISO-8859-1;base64,SGVsbG8sIFdvcmxkIQ==",
  port: nil,
  query: nil,
  scheme: "data",
  userinfo: nil
}

iex> URL.parse("data:text/plain;charset=ISO-8859-1;base64,SGVsbG8sIFdvcmxkIQ==") |> URL.to_string
"data:text/plain;charset=ISO-8859-1;base64,SGVsbG8sIFdvcmxkIQ=="

(I’m the author)

ericlathrop · November 2, 2020, 5:09am

Cool! I did try searching hex & elsewhere for a library to parse data URLs before making this. I did manage to eventually find ex_url, but there were a few minor issues I saw with it so I decided to publish data_url anyway. Maybe I should just make a PR to ex_url.

the charset should default to US-ASCII if the mediatype is omitted
the charset is part of the media type, not a separate thing. If it were to be parsed, I’d prefer a separate MimeType struct.
URL.parse("data:") explodes, but I would prefer an :error
URL.parse("data:;base64,=") explodes, but I would prefer an :error

kip · November 2, 2020, 5:32am

Wow, appreciate the bug reports - thank you! I’m on it right away.

Also, hope I didn’t seem like I was hijacking your announcement - great to see new libraries getting built. Well done and much appreciated.

Now back to some bug eradication

kip · November 2, 2020, 6:21am

On reviewing RFC 6838 it seems like UTF-8 should now be considered the default charset parameter for a mediatype? Your thoughts?

From Section 4.2.1:

If a “charset” parameter is specified, it SHOULD be a required parameter, eliminating the options of specifying a default value. If there is a strong reason for the parameter to be optional despite this advice, each subtype MAY specify its own default value, or alternatively, it MAY specify that there is no default value. Finally, the “UTF-8” charset [RFC3629] SHOULD be selected as the default. See [RFC6657] for additional information on the use of “charset” parameters in conjunction with subtypes of text.

Regardless of what approach is chosen, all new text/* registrations MUST clearly specify how the charset is determined; relying on the US-ASCII default defined in Section 4.1.2 of [RFC2046] is no longer permitted. If explanatory text is needed, this SHOULD be placed in the additional information section of the registration.

ericlathrop · November 2, 2020, 6:25am

I was going mostly off this MDN page, which refers to RFC2397 which says:

If <mediatype> is omitted, it defaults to text/plain;charset=US-ASCII. As a shorthand,
“text/plain” can be omitted but the charset parameter supplied.

kip · November 2, 2020, 6:31am

Good call. I suspect is a case of RFP’s falling out of sync but I agree, US-ASCII it is.