How to read id3v2 tags from mp3 files?

Hi, I’m new with Elixir and for learning I try to read id3v2 tags from mp3 files. The stdlib for working with binary seems awesome with pattern matching.
I have question about how to parse some data (title, artist, etc)
By following the spec it seems to have a flag with encodage information but I don’t know how to read it. I think I have utf16 for the title (TIT2) because I have some bytes to 0.

My current code:

file = File.stream!("./lib/file.mp3", [:read, :binary], 128)
id3tag_bytes = file |> Enum.take(1) |> hd

<<header::binary-size(10), rest::binary>> = id3tag_bytes

<<"ID3", major::binary-size(1), revision::binary-size(1), flags::binary-size(1),
  size::binary-size(4)>> =
  header

case flags do
  <<0>> -> IO.puts("No extended header")
  _ -> IO.puts("Extended header")
end

<<frame_overview::binary-size(10), frame::binary>> = rest

<<frame_id::binary-size(4), frame_size::binary-size(4), frame_flags::binary-size(2)>> =
  frame_overview

<<frame_size_int::size(4)-unit(8)>> = frame_size
IO.inspect(frame_size_int, label: "Frame Size")
<<title::binary-size(frame_size_int - 10), _rest::binary>> = frame

# id3 = <<header::binary-size(3), _v::binary-size(1), _flags::binary-size(1), _size::binary-size(4)>>

IO.inspect(header, label: "Header")
IO.inspect(:binary.decode_unsigned(major), label: "Version")
IO.inspect(:binary.decode_unsigned(revision), label: "Revision")
IO.inspect(:binary.decode_unsigned(flags), label: "Flags")
IO.inspect(frame_id, label: "Frame ID")
IO.inspect(:binary.decode_unsigned(frame_size), label: "Frame Size")
IO.inspect(size, label: "Size")
IO.inspect(frame_overview, label: "Frame Overview")
IO.inspect(title, label: "Title")
IO.inspect(title |> :unicode.characters_to_binary(:utf16, :utf8), label: "Title")

You can find the first 100 bytes from the file: id3tag_bytes

<<73, 68, 51, 3, 0, 0, 0, 4, 109, 110, 84, 73, 84, 50, 0, 0, 0, 19, 0, 0, 1,
  255, 254, 76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 234, 0, 116, 0, 101, 0, 84, 80,
  69, 49, 0, 0, 0, 17, 0, 0, 1, 255, 254, 79, 0, 114, 0, 101, 0, 108, 0, 115, 0,
  97, 0, 110, 0, 84, 65, 76, 66, 0, 0, 0, 57, 0, 0, 1, 255, 254, 67, 0, 105, 0,
  118, 0, 105, 0, 108, 0, 105, 0, 115, 0, 97, 0, 116, 0, 105, 0, 111>>

The title in TIT2 frame should be “La Quête” which can be found here: <<76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 234, 0, 116, 0, 101>> but like I said previously there are some bytes to 0 so I guess it’s encoded in utf16 and the tag is readable without the bytes 0.
<<76, 97, 32, 81, 117, 234::utf8, 116, 101>>

The frame size is 19 in header but we need to remove the header size so 19 - 10 = 9 so I think I need to convert utf16 to uf8 before reading the title, otherwise the size of my title is not the same.

Currently, the last IO.inspect print {:incomplete, "ǿ﹌a ", <<0>>}

Thx :wink:

1 Like

Welcome to the Elixir community!

If we look at what the encoding of the expected title is we can see its probably UTF16-LE since its a true subset of the frame payload:

iex> :iconv.convert("utf8", "utf16le", title)
<<76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 234, 0, 116, 0, 101, 0>>

So now we know what we are looking for, we can examine the actual frame content, which appears to be:

iex> <<header::binary-size(10), rest::binary>> = id3tag_bytes
iex> <<"ID3", major::binary-size(1), revision::binary-size(1), flags::binary-size(1), size::binary-size(4)>> = header
iex> <<frame_overview::binary-size(10), frame::binary>> = rest
iex> <<frame_id::binary-size(4), frame_payload_size::unsigned-integer-32, frame_flags::binary-size(2)>> = frame_overview
iex> frame_payload = <<frame::binary-size(frame_payload_size)>>
iex> <<encoding::unsigned-integer-8, payload::binary>> = frame_payload
  
iex> IO.inspect(frame_payload_size, label: "Frame Payload Size")
Frame Payload Size: 19
iex> IO.inspect encoding, label: "Payload encoding (0 == ISO8859, 1 == UTF16)"
Payload encoding (0 == ISO8859, 1 == UTF16): 1
iex> IO.inspect payload, label: "Frame payload"
Frame payload: <<255, 254, 76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 234, 0, 116, 0, 101, 0>>

Observations of the frame payload

  1. <<255, 254>> is the BOM indicating UTF16 little endian.
  2. The rest of the data is the actual string which we can decode with the following. Note that we don’t include the BOM, but we do use it to determine the endianness for conversion:
iex> title = <<76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 234, 0, 116, 0, 101, 0>>
iex> :unicode.characters_to_binary(title, {:utf16, :little}, :utf8)
"La Quête"

Note that :unicode.characters_to_binary/3 does not detect endianness from a BOM, so the BOM needs to be excluded. :iconv.convert/3 has the same consideration.

Possible next steps

  1. For ISO-8859-1 conversion you may need to include :iconv as a dependency.
3 Likes

The following script is proof-of-concept you might find useful:

<<header::binary-size(10), rest::binary>> = id3tag_bytes
<<"ID3", major::binary-size(1), revision::binary-size(1), flags::binary-size(1), size::integer-size(32)>> = header
<<frame_overview::binary-size(10), frame::binary>> = rest
<<frame_id::binary-size(4), frame_size::unsigned-integer-32, frame_flags::binary-size(2)>> = frame_overview

frame_payload_size = frame_size
frame_payload = <<frame::binary-size(frame_payload_size)>>

{title, encoding} =
  case frame_payload do
    <<1, 0xff, 0xfe, title::binary>> -> {title, {:utf16, :little}}
    <<1, 0xfe, 0xff, title::binary>> -> {title, {:utf16, :big}}
    <<0, title::binary>> -> {title, :iso8859}  
  end

case encoding do
  :iso8859 -> :iconv.convert("iso8859-1", "utf8", title)
  encoding -> :unicode.characters_to_binary(title, encoding, :utf8)
end
7 Likes

Thank you for your answer with great explanation :slight_smile:

But in your case frame_payload statement I have an error:

(CaseClauseError) no case clause matching: <<1, 195, 191, 195, 190, 76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 195, 170, 0, 116>> Elixir  [127, 3]

It’s because I use this code to read the file and not stream anymore so the bytes are not the same :thinking:

{:ok, pid} = File.open("./lib/file.mp3", [:read, :binary])
mp3_bytes = IO.read(pid, :eof)
File.close(pid)

With File.stream:

<<73, 68, 51, 3, 0, 0, 0, 4, 109, 110, 84, 73, 84, 50, 0, 0, 0, 19, 0, 0, 1,
  255, 254, 76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 234, 0, 116, 0, 101, 0, 84, 80,
  69, 49, 0, 0, 0, 17, 0, 0, 1, ...>>

With File.open:

<<73, 68, 51, 3, 0, 0, 0, 4, 109, 110, 84, 73, 84, 50, 0, 0, 0, 19, 0, 0, 1,
  195, 191, 195, 190, 76, 0, 97, 0, 32, 0, 81, 0, 117, 0, 195, 170, 0, 116, 0,
  101, 0, 84, 80, 69, 49, 0, 0, 0, 17, ...>>

The second line is different. I need to figured why because File.open as more sense for me to read id3tag. I think it’s because the function open the file with different encodage but not sure yet.

At least your solution works :slight_smile:

IO.read and File.read return different types. chardata() is not the same as binary()

IO.read/2
@spec read(device(), :eof | :line | non_neg_integer()) :: chardata() | nodata()

File.read/1
@spec read(Path.t()) :: {:ok, binary()} | {:error, posix()}

In your case, I think I would just be doing:

case File.read(path) do
  {:ok, binary} -> extract_tags(binary)
  error -> error
end

Also note that I edited the case clause for :iso8859 to correctly binary pattern match.

1 Like

doesn’t this solve it?

1 Like

Here’s my final experiment in case its useful to you:

defmodule MP3 do
  def extract_tags(<<>>) do
    []
  end

  def extract_tags(<<0, _rest::binary>>) do
    []
  end

  def extract_tags(<<"ID3", _major::binary-size(1), _revision::binary-size(1), _flags::binary-size(1), _size::integer-size(32), rest::binary>>) do
    extract_tags(rest)
  end
  
  def extract_tags(data) do
    <<frame_overview::binary-size(10), frame::binary>> = data
    <<frame_id::binary-size(4), frame_payload_size::unsigned-integer-32, _frame_flags::binary-size(2)>> = frame_overview
    <<frame_payload::binary-size(frame_payload_size), rest::binary>> = frame

    {text, encoding} =
      case frame_payload do
        <<1, 0xff, 0xfe, text::binary>> -> {text, {:utf16, :little}}
        <<1, 0xfe, 0xff, text::binary>> -> {text, {:utf16, :big}}
        <<0, text::binary>> -> {text, :iso8859}  
      end
    
    frame_text =
      case encoding do
        :iso8859 -> :iconv.convert("iso8859-1", "utf8", text)
        encoding -> :unicode.characters_to_binary(text, encoding, :utf8)
      end

    [{frame_id, frame_text} | extract_tags(rest)]
  end
end

Example

iex> id3tag_bytes = File.read!("../Hells Bells.mp3")
iex> MP3.extract_tags(id3tag_bytes)
[
  {"TPE1", "AC/DC"},
  {"TYER", "1980"},
  {"TALB", "Back In Black"},
  {"TRCK", "01"},
  {"TIT2", "Hells Bells"}
]
3 Likes

Awesome. Thank you. I need to practice more with recursion to loop over thing. Too many habit with OOP and imperative language in general.

It seems an amazing language :). Dommage qu’Elixir ne soit pas destiné aux applications CLI to have only one language for everything :slight_smile:

I need to continue with id3tag, sometimes it can contains extended header, some flag for compression, encoding, etc. Still lot of works to do.

And why you don’t need to get info from header flags? id3v2.3.0 - ID3.org
It’s useful for writing and not for reading?