File.stream vs File.read

Hello everyone

I am creating a CLI that implements the same functions as de WC CLI for unix system. I was able to achieve the expected result using File.read! but it loads everything into memory so I decided to use stream to solve the problem but somehow the results is different. Does anyone know what I missing here? Thanks

Could you please provide an example and the results for both approaches?

test.txt

this file is the one that im using.

the returned values are respectively

339292
332147

File is 7145 lines long. Difference between results is 7145.
Explanation: File.stream! streams file line by line. Lines in files are separated by \n symbol. So File.read returns this symbol, while File.stream! doesn’t, and that is where the difference comes from.

Anyway, in your solution, you should stream file by bytes, but not by lines

2 Likes

Nice! Thanks for the tip! So basically I should do something like that?

    file_path
    |> File.stream!([], 1024)
    |> Stream.map(fn line ->
      String.codepoints(line)
    end)
    |> Enum.reduce(0, fn code_points, acc ->
      Enum.count(code_points) + acc
    end)

it gives a closer result but still wrong and i alse changes the final
value depending of the size of chunk_bytes I pass in…
I thought I should read byte by byte but that did not work as well so
I tried random numbers 1024, 2048 and the all were wrong and not consistent.

Why do you count codepoints? UTF-8 requires alighment while wc -m counts chars (which is more like grapheme, but I am not sure that wc -m works with UTF-8)

To be honest, I dont know, just trying out when I was using file.read then .codepoints it worked as expected just assumed it was correct.

String.codepoints returns list of UTF-8 codepoints. It is not the same as ascii char (which, as I assume wc -m is counting).

Inspecting the results of those two reads will help make the cause clearer:

iex(7)> File.read!(file_path)
"\uFEFFThe Project Gutenberg eBook of The Art of War\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.\r\n\r\nTitle: The Art of War\r\n\r\n\r\nAuthor: active 6th century B.C. Sunzi\r\n\r\nTranslator: Lionel Giles\r\n\r\nRelease date: May 1, 1994 [eBook #132]\r\n                Most recently updated: October 16, 2021\r\n\r\nLanguage: English\r\n\r\nOriginal publication: , 1910\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE ART OF WAR ***\r\n\r\n\r\n\r\nSun Tzŭ\r\non\r\nThe Art of War\r\n\r\nTHE OLDEST MILITARY TREATISE IN THE WORLD\r\nTranslated from the Chinese with Introduction and Critical Notes\r\n\r\nBY\r\nLIONEL GILES, M.A.\r\n\r\nAssistant in the Department of Oriental Printed Books and MSS.\r\nin the British Museum\r\n\r\n\r\n\r\n\r\n1910\r\n\r\n\r\n\r\nTo my brother\r\nCaptain Valentine Giles, R.G.\r\nin the hope that\r\na work 2400 years old\r\nmay yet contain lessons worth consideration\r\nby the soldier of today\r\nthis translation\r\nis affectionately dedicated.\r\n\r\n\r\n\r\nContents\r\n\r\n\r\n Preface to the Project Gutenberg Etext\r\n Preface by Lionel Giles\r\n INTRODUCTION\r\n Sun Wu and his Book\r\n The Text of Sun Tzŭ\r\n The Commentators\r\n Appreciations of Sun Tzŭ\r\n Apologies for War\r\n Bibliography\r\n Chapter I. Laying plans\r\n Chapter II. Waging War\r\n Chapter III. Attack by Stratagem\r\n Chapter IV. Tactical Dispositions\r\n Chapter V. Energy\r\n Chapter VI. Weak Points and Strong\r\n Chapter VII Manœuvring\r\n Chapter VIII. Variation of Tactics\r\n Chapter IX. The Army on the March\r\n Chapter X. Terrain\r\n Chapter XI. The Nine Situations\r\n Chapter XII. The Attack by Fire\r\n Chapter XIII. The Use of Spies\r\n\r\n\r\n\r\nPreface to the Project Gutenberg Etext\r\n\r\nWhen Lionel Giles began his translation of Sun Tzŭ’s _Art of War_, the\r\nwork was virtually unknown in Europe. Its introduction to Europe began\r\nin 1782 when a French Jesuit Father living in China, Joseph Amiot,\r\nacquired a copy of it, and translated it into French. It was not a good\r\ntranslation because, according to Dr. Giles, \"[I]t contains a great\r\ndeal that Sun Tzŭ did not write, and very little indeed of what he\r\ndid.\"\r\n\r\nThe first translation into English was published in 1905 in Tokyo by\r\nCapt. E. F. Calthrop, R.F.A. However, this translation is, in the words\r\nof Dr. Giles, \"excessively bad.\" He goes further in this criticism: \"It\r\nis not merely a question of downright blunders, from which none can\r\nhope to be wholly exempt. Omissions were frequent; hard passages were\r\nwillfully distorted or slurred over. Such offenses are less pardonable.\r\nThey would not be tolerated in any edition of a Latin or Greek classic,\r\nand a similar standard of honesty ought to be insisted upon in\r\ntranslations from Chinese.\" In 1908 a new edition of Capt. Calthrop’s\r\ntranslation was published in London. It was an improvement on the\r\nfirst—omissions filled up and numerous mistakes corrected—but new\r\nerrors were created in the process. Dr. Giles, in justifying his\r\ntranslation, wrote: \"It was not undertaken out of any inflated estimate\r\nof my own powers; but I could not help feeling that Sun Tzŭ deserved a\r\nbetter fate than had befallen him, and I knew that, at any rate, I\r\ncould hardly fail to improve on the work of my predecessors.\"\r\n\r\nClearly, Dr. Giles’ work established much of the groundwork for the\r\nwork of later translators who published their own editions. Of the\r\nlater editions of the _Art of War_ I have examined; two feature Giles’\r\nedited translation and notes, the other two present the same basic\r\ninformation from the ancient Chinese commentators found in the Giles\r\nedition. Of these four, Giles’ 1910 edition is the most scholarly and\r\npresents the reader an incredible amount of information concerning Sun\r\nTzŭ’s text, much more than any other translation.\r\n\r\nThe Giles’ edition of the _Art" <> ...
iex(8)> File.stream!(file_path) |> Enum.join("")                                                                             
"\uFEFFThe Project Gutenberg eBook of The Art of War\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Art of War\n\n\nAuthor: active 6th century B.C. Sunzi\n\nTranslator: Lionel Giles\n\nRelease date: May 1, 1994 [eBook #132]\n                Most recently updated: October 16, 2021\n\nLanguage: English\n\nOriginal publication: , 1910\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE ART OF WAR ***\n\n\n\nSun Tzŭ\non\nThe Art of War\n\nTHE OLDEST MILITARY TREATISE IN THE WORLD\nTranslated from the Chinese with Introduction and Critical Notes\n\nBY\nLIONEL GILES, M.A.\n\nAssistant in the Department of Oriental Printed Books and MSS.\nin the British Museum\n\n\n\n\n1910\n\n\n\nTo my brother\nCaptain Valentine Giles, R.G.\nin the hope that\na work 2400 years old\nmay yet contain lessons worth consideration\nby the soldier of today\nthis translation\nis affectionately dedicated.\n\n\n\nContents\n\n\n Preface to the Project Gutenberg Etext\n Preface by Lionel Giles\n INTRODUCTION\n Sun Wu and his Book\n The Text of Sun Tzŭ\n The Commentators\n Appreciations of Sun Tzŭ\n Apologies for War\n Bibliography\n Chapter I. Laying plans\n Chapter II. Waging War\n Chapter III. Attack by Stratagem\n Chapter IV. Tactical Dispositions\n Chapter V. Energy\n Chapter VI. Weak Points and Strong\n Chapter VII Manœuvring\n Chapter VIII. Variation of Tactics\n Chapter IX. The Army on the March\n Chapter X. Terrain\n Chapter XI. The Nine Situations\n Chapter XII. The Attack by Fire\n Chapter XIII. The Use of Spies\n\n\n\nPreface to the Project Gutenberg Etext\n\nWhen Lionel Giles began his translation of Sun Tzŭ’s _Art of War_, the\nwork was virtually unknown in Europe. Its introduction to Europe began\nin 1782 when a French Jesuit Father living in China, Joseph Amiot,\nacquired a copy of it, and translated it into French. It was not a good\ntranslation because, according to Dr. Giles, \"[I]t contains a great\ndeal that Sun Tzŭ did not write, and very little indeed of what he\ndid.\"\n\nThe first translation into English was published in 1905 in Tokyo by\nCapt. E. F. Calthrop, R.F.A. However, this translation is, in the words\nof Dr. Giles, \"excessively bad.\" He goes further in this criticism: \"It\nis not merely a question of downright blunders, from which none can\nhope to be wholly exempt. Omissions were frequent; hard passages were\nwillfully distorted or slurred over. Such offenses are less pardonable.\nThey would not be tolerated in any edition of a Latin or Greek classic,\nand a similar standard of honesty ought to be insisted upon in\ntranslations from Chinese.\" In 1908 a new edition of Capt. Calthrop’s\ntranslation was published in London. It was an improvement on the\nfirst—omissions filled up and numerous mistakes corrected—but new\nerrors were created in the process. Dr. Giles, in justifying his\ntranslation, wrote: \"It was not undertaken out of any inflated estimate\nof my own powers; but I could not help feeling that Sun Tzŭ deserved a\nbetter fate than had befallen him, and I knew that, at any rate, I\ncould hardly fail to improve on the work of my predecessors.\"\n\nClearly, Dr. Giles’ work established much of the groundwork for the\nwork of later translators who published their own editions. Of the\nlater editions of the _Art of War_ I have examined; two feature Giles’\nedited translation and notes, the other two present the same basic\ninformation from the ancient Chinese commentators found in the Giles\nedition. Of these four, Giles’ 1910 edition is the most scholarly and\npresents the reader an incredible amount of information concerning Sun\nTzŭ’s text, much more than any other translation.\n\nThe Giles’ edition of the _Art of War_, as stated above, was a\nscholarly work. Dr. Giles was a leading sinologue at the time and an\nassistant in the Depar" <> ...

In the default :line mode of File.stream, "\r\n" sequences will be normalized to "\n"

Codepoints can span multiple bytes. If you chunk by a fixed number of bytes you might split things within the boundary of a single codepoint leaving parts of it in both chunks of the file. To count correctly you’d need to be able to detect that happening.

So basically is not possible to stream without this behaviour? I’ll have to somehow check first how the line end in this file and then do something to get the correct count?

The behaviour is specific to chunking by lines. You can go with chunking by fixed bytes to get around that issue. That’s the better approach when you don’t know your inputs very well anyways, because you cannot depends on the input actually be broken up into lines in the first place.

not sure if the best approach but ended with something like this