Difference between Kernel.byte_size and String.codepoints |> Enum.count

I have been working on downloads. One of the things that I have to do is calculate the size of the streamed download. I thought I was doing this fine, but then found that if I compared the size I calculated and the size of the file when it was downloaded, my calculation was always smaller.
Originally, I was using String.length. That counts graphemes. OK, got it, a grapheme may be made up of multiple bytes. So we went to Kernel.byte_size. Now, we were a few bytes larger in the calculation than what Mac OS was showing with ls -l. So closer, but still not there. Then we went to String.codepoints |> Enum.count and it was spot on. Looking at byte_size I noticed this message:

Returns the number of bytes needed to contain bitstring .
That is, if the number of bits in bitstring is not divisible by 8, the resulting number of bytes will be rounded up (by excess). This operation happens in constant time.

That seems to explain the problem, but I don’t understand the value of rounding up. Is that an optimization, sacrificing some accuracy for speed?

Code points and grapheme cluster

As per the standard, a code point is a single Unicode Character, which may be represented by one or more bytes.


iex(1)> temp = "fπŒ†"
"fπŒ†"
iex(2)> String.codepoints(temp) |> Enum.count()
2
iex(3)> Kernel.byte_size(temp)
5
iex(4)> {:ok, file} = File.open("temp.txt", [:write])
{:ok, #PID<0.109.0>}
iex(5)> IO.binwrite(file, temp)
:ok
iex(6)> File.close(file)
:ok
iex(7)> File.stat("temp.txt")
{:ok,
 %File.Stat{
   access: :read_write,
   atime: {{2019, 11, 20}, {21, 29, 22}},
   ctime: {{2019, 11, 20}, {21, 29, 21}},
   gid: 20,
   inode: 24398376,
   links: 1,
   major_device: 16777222,
   minor_device: 0,
   mode: 33188,
   mtime: {{2019, 11, 20}, {21, 29, 21}},
   size: 5,
   type: :regular,
   uid: 501
 }}
iex(8)> 
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
a
$ ls -l temp.txt
-rw-r--r--  1 ___  staff  5 20 Nov 16:29 temp.txt
$ 
5 Likes

That is very interesting. Thank you.