Emulate File.stream! for a string variable

Hello there!

I have a mix task which grabs some data from a remote API, collects it into a file (a biggish xml), then sucks it in as a stream and processes doing a bunch of text transformations.
Now I’m trying to sketch an integration test for the processing part (skipping the getting-the-data part) and I wonder… is it possible to use a string variable instead of a file for that purpose?

This is what I have in my mix task code:

    File.stream!("#{@download_dir}/#{category}.xml", [:read])
    |> ... processing part I want to test follows

And this is how I’m trying to simulate (unsuccessfully so far) the File.stream! part:

    input_xml = """
     ... xml fragment ...
    ...
    """
    input_xml
    |> Stream.unfold( &(String.split(&1, "~n")) )
    |> ... processing part ...

Of course what I’m getting after this Stream.unfold is different from File.stream! – the Stream.unfold results in all the newlines being removed.
And then my processing part breaks as it relies on newlines in certain places (yeah, it sounds crazy, but inside that xml I have wiki markup-formatted fragments where newlines do matter).

So my question is: is it possible to split a string by newlines in such a way that I can preserve those "\n"s? That is, can I emulate FIle.stream! without an actual file?

Would appreciate any hints. Thank you!

2 Likes

You’re in luck, you can simply use StringIO.open/1 and IO.binstream in order to stream any String:

{:ok, stream} =
  "abc\ndef\nghi\n"
  |> StringIO.open()

stream
|> IO.binstream(:line)
|> #your own stream processing
7 Likes

Thank you.

I’ve rewritten my spec to use the suggested approach, however, now I cannot get my tests to finish running.
When I try to running it in IEX (create a stream, start splitting it as a binstream and then just print every line using Enum.each), my system just goes down to its knees. Output stops after ~ 20 iterations and my Activity Monitor app just shows beam and kernel_task on top of the list for both CPU and memory consumption. The laptop fans are roaring and the window UI in general becomes sloppy.

I’m running macOS Sierra on a Macbook Air (1.6 Ghz i5, 4 Gb RAM), elixir and erlang are installed with homebrew (1.4.2 and 19.3 respectively).

Ah, and the data size is actually tiny for this test: 107 lines.

Could you post the code with which you’re trying this?

str = """
  <page>
    <title>Acura CSX</title>
    <ns>0</ns>
    <id>3161370</id>
    <revision>
      <id>772698868</id>
      <parentid>770005822</parentid>
      <timestamp>2017-03-28T20:19:55Z</timestamp>
      <contributor>
        <username>GreenC bot</username>
        <id>27823944</id>
      </contributor>
      <minor/>
      <comment>Rescued 1 archive link; reformat 1 link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="10135">{{Infobox automobile
|name           = Acura CSX 
|image          = '07 Acura CSX.JPG
|caption=2007 Acura CSX
|manufacturer   = [[Honda]]
|successor      = [[Acura ILX]]
|layout         = [[Front-engine, front-wheel drive layout|FF layout]]
|aka     = Honda L30 (for sedan)&lt;br /&gt;Honda L40 (for coupe)
|production     = 2005–2011
|model_years     = 2006–2011
|predecessor    = [[Acura EL]]
|related        = [[Honda Civic (eighth generation)]]&lt;br&gt;[[Acura RSX]]&lt;br&gt;[[Honda CR-V]]&lt;br&gt;[[Honda Element]]
|class          = [[Entry-level luxury car]]
|body_style     = 4-door [[sedan (car)|sedan]]
|engine         = 2.0&amp;nbsp;L [[Honda K engine#K20Z2|K20Z2]] [[Straight-4|I4]]&lt;br&gt;2.0&amp;nbsp;L [[Honda K engine#K20Z3|K20Z3]] [[Straight-4|I4]] (Type-S)
|transmission   = 5-speed [[manual transmission|manual]]&lt;br&gt;5-speed [[automatic transmission|automatic]]&lt;br&gt;6-speed manual (Type-S)
|wheelbase      = {{convert|2700|mm|in|1|abbr=on}}
|length         = {{convert|4544|mm|in|1|abbr=on}}
|width          = {{convert|1752|mm|in|1|abbr=on}}
|height         = {{convert|1435|mm|in|1|abbr=on}}
|assembly       = [[Alliston, Ontario]], [[Canada]]
|weight         = {{convert|1313|kg|lb st|abbr=on}}&lt;br&gt;{{convert|1343|kg|lb|abbr=on}} (AT)
|designer       = Motoaki Minowa (2003)
}}

The '''Acura CSX''' (Compact Sportscar eXperimental), or [[Honda Civic]] for the [[Japanese domestic market]] (JDM), was [[Acura]]'s [[entry-level luxury car]] exclusively designed for the Canadian market. The CSX is the first Acura model with two predecessors, the [[Acura Integra|Integra sedan]] (1986–1996) and the [[Acura EL|EL]] (1997–2005). Like the EL, it was only available in [[Canada]] and built in [[Alliston, Ontario]], Canada. In 2012, the [[Acura ILX|ILX]] was introduced as the CSX's replacement, now available in both the [[United States]] and Canada.

==Design==

The CSX is not a rebadged [[Japanese Domestic Market|JDM]] Honda Civic, rather Honda Japan chose the Canadian-designed CSX as the template for the JDM Civic.&lt;ref&gt;{{cite web|url=http://www.thetruthaboutcars.com/2008-acura-csx-navi-premium-review/ |title=Acura CSX Review |publisher=The Truth About Cars |date=2008-09-03 |accessdate=2010-11-11}}&lt;/ref&gt;&lt;ref&gt;{{cite web|url=http://www.auto123.com/en/info/news/roadtest,view,Acura.spy?artid=56355&amp;pg=1 |title=2006 Acura CSX Road Test - Auto123.com - Canadian automotive network |publisher=Auto123.com |date=2006-02-10 |accessdate=2010-11-11}}&lt;/ref&gt;&lt;ref&gt;{{cite web|url=http://www.hondatuningmagazine.com/features/htup_0909_2008_acura_csx_type_s/index.html |title=2008 Acura CSX - Type-S |publisher=Honda Tuning Magazine |date= |accessdate=2010-11-11}}&lt;/ref&gt;  Externally, the CSX shares its [[cab forward]] architecture with the American-market Civic.&lt;ref&gt;{{cite web|url=http://www.wheels.ca/article/25775 |title=2006 Acura CSX |publisher=Wheels.ca |date= |accessdate=2010-11-11}}&lt;/ref&gt;  Differentiating the Acura from its mainstream North American counterpart include a slightly longer nose with shaped headlamp clusters, a full-width lower air intake and a slight crease up the hood's centreline. At the rear, jewelled taillamps and the shaping of the trunk's sheet metal contrast the upscale-marketed CSX from the Civic.

The CSX shares some features with the [[Japan domestic market|JDM]] Civic, most notably the 2.0&amp;nbsp;L [[DOHC]] [[i-VTEC]] [[Internal combustion engine|engine]] rated at {{convert|155|hp|abbr=on}} at 6000 [[revolutions per minute|rpm]] and {{convert|139|lb·ft|N.m|abbr=on}} at 4500&amp;nbsp;rpm. Also shared with the JDM Civic are the front and rear fascias; the steering wheel is used in Japanese, European, and American-market Civic models.

==Debut==

The CSX went on sale on November 2005 as a 2006 model. The 2006 CSX was introduced in 3 trims: Touring (the base model), Premium, and Premium + Navi.  Standard features on the touring model include 16-inch alloy wheels, anti-lock brakes, side and curtain airbags, leather-wrapped steering wheel with audio controls, [[Semi-automatic transmission|paddle shifters]] for automatic transmission models, heated door mirrors with integrated turn signals, 6-speaker audio system with CD/MP3/WMA capability, automatic climate control, cruise control, chrome door handles, and 60/40 split folding rear seats.  Key additions in the Premium model included [[High-intensity discharge lamp|high-intensity discharge (HID)]] headlights, leather upholstery, heated front seats, power moonroof, and an in-dash 6 disc CD changer.  The Navi model was only available as an upgrade to the Premium trim, adding a bilingual voice-activated navigation system, illuminated steering wheel controls, and a digital audio card reader.  The navigation system and HID headlamps are among the features not available for Honda Civics sold in Canada.

The resulting car is {{convert|62|kg|lb|abbr=on}} to {{convert|88|kg|lb|abbr=on}} heavier than the Civic EX sedan, with fuel consumption raised to {{convert|8.7|l/100 km }} city, {{convert|6.4|l/100 km }} highway for manual model; and {{convert|9.5|l/100 km }} city, {{convert|6.5|l/100 km }} highway  for [[automatic transmission|automatic]] model. The CSX uses regular unleaded gasoline (Min 91 [[Octane rating|RON]]), while the Type S model uses premium gasoline (Min 95 [[Octane rating|RON]]).

==Type-S==
[[File:'07 Acura CSX Type-S.JPG|250px|thumb|right|2007 Acura CSX Type-S.]]

The Type-S variant debuted as a 2007 model and uses the identical drivetrain found in the US and Canadian market 2006+ [[Honda Civic Si]] which consists of a 2.0L [[Straight-4|I4]], {{convert|197|hp}} i-VTEC engine, 6-speed [[manual transmission]] and helical [[limited-slip differential]].  The &quot;sport-tuned&quot; suspension is identical to the US-only 2006+ Honda Civic Si sedan with stiffer springs, firmer damping and thicker stabilizer bars compared to the regular CSX and is supported on 215/45R17 all-season tires and 17-inch alloy wheels.  Unlike the Canadian market Civic Si coupe, the CSX Type-S employs Honda's version of [[traction control system|traction control]] (Vehicle Stability Assist, or VSA) and brake assist.  Other amenities include 17-inch aluminium-alloy wheels, rear wing spoiler with integrated [[LED]] brake light, fog lights, bilingual navigation system, a 350-[[watt]] 7-speaker audio system, digital audio card reader, Type-S badging and illuminated foot wells.

Fuel consumption is {{convert|10.2|L/100 km|abbr=on}} city, {{convert|6.8|L/100 km|abbr=on}} highway with a recommendation of Premium (91+ [[Octane rating|octane]]) fuel.

The 2007 CSX Type-S went on sale on November 6, 2006 with an [[suggested retail price|MSRP]] of $33,400 [[Canadian dollars|Canadian]] at the same time as the [[Acura TL|Acura TL Type-S]].

==Trim &amp; Mid-model changes==
[[File:CSX Interior.jpg|thumb|Interior of 2008 CSX Technology]]
[[File:Acura-CSX.JPG|thumb|right|Acura CSX]]

For 2007 models, an auxiliary input jack for the audio system was added for all CSX models.

For 2008 models, leather upholstery, a tire pressure monitoring system, illuminated vanity mirrors, and vehicle stability assist (VSA) was made standard equipment on all CSX models.  The CSX Premium was renamed CSX Technology, which added high-intensity discharge headlights, fog lamps, XM Satellite Radio with roof-mounted antenna, premium stereo and bilingual voice-activated navigation system.

Like the Honda Civic, the CSX received a mid-model change in 2009, most notably giving it Acura's trademark Power Plenum grille.  Other exterior changes include black-housing headlights, octagonal tail-lamps, and revised front bumper and fog lights.  While not new to the line-up, the 17-inch alloy wheels once exclusive to the Type S are now standard on all models.  New features for 2009 include USB audio connectivity for all models, and Bluetooth handsfree wireless link for Technology and Type-S models.
For 2010, the base model has been discontinued and the now-entry level Technology model has been renamed &quot;iTech&quot;.

For the car's final model year, Acura has simplified the CSX line for 2011 offering 2 trim levels, Base and iTech, both have significantly reduced MSRP from the previous 2010 models. The Type-S trim has been discontinued for 2011. Only four colours are available: Crystal Black Pearl, Alabaster Silver Metallic, Polished Metal Metallic, and Taffeta White.

==Discontinuation==

Despite being the best-selling vehicle in Acura Canada's lineup from 2006–07 and late 2009/early 2010, Honda announced the discontinuation of the Acura CSX after the 2011 model year.&lt;ref&gt;[http://www.theglobeandmail.com/globe-drive/acura-aims-to-revive-its-image/article2287374/?utm_medium=Feeds%3A%20RSS%2FAtom&amp;utm_source=Home&amp;utm_content=2287374 Acura aims to revive its image ] -- [[The Globe and Mail]] (Retrieved 2012-01-23)&lt;/ref&gt; The Civic-based [[Acura ILX|ILX]] was confirmed as the car's successor for the 2013 model year.&lt;ref&gt;[http://www.autoweek.com/article/20111212/CARNEWS/111219999 Acura retools lineup to put the focus on mpg] -- [[Autoweek]] (Retrieved 2011-12-12)&lt;/ref&gt;  As such, the Acura CSX becomes the fourth Acura to have only sold one model generation, after the [[Acura Vigor|Vigor]] (1992–94), the U.S.-exclusive [[Acura SLX|SLX]] (1996–99), and the [[Acura RSX|RSX]] (2002–06).{{Citation needed|date=April 2012}}

{{Clear}}

==References==
{{Reflist}}

==External links==
{{Commons category|Acura CSX}}
* [http://www.auto123.com/en/info/news/roadtest,view,Acura.spy?artid=56355&amp;pg=1 Auto123 review and design discussion]
* [http://www.guideautoweb.com/en/specifications/acura/csx/2011 Acura CSX specifications]
* [https://archive.is/20130108163623/http://www.autonet.ca/Spotlight/NewModels/story.cfm?story=/Spotlight/NewModels/2006/10/20/2078786.html News release of Acura CSX Type-S]
{{Acura}}

{{DEFAULTSORT:Acura Csx}}
[[Category:2000s automobiles]]
[[Category:2010s automobiles]]
[[Category:Acura vehicles|CSX]]
[[Category:Cars of Canada]]
[[Category:Compact cars]]
[[Category:Compact executive cars]]
[[Category:Front-wheel-drive vehicles]]
[[Category:Goods manufactured in Canada]]
[[Category:Sedans]]
[[Category:Cars introduced in 2005]]</text>
      <sha1>6rrq7eeum01r95valkdtchsdw0dghf0</sha1>
    </revision>
  </page>
"""
{:ok, stream} = str |> StringIO.open
stream |> IO.binstream(:line) |> Enum.each( fn(line) -> IO.puts("--- #{line} ---") end )

Why not use one of the string functions? I understand you want to get an enumerable of lines from the string. This can be achieved eagerly with String.split(str, "\n") or lazily with String.splitter(str, "\n").

1 Like

Because String.split/2 and String.splitter/2 will remove the split-points, but the OP said he needs them intact.

1 Like

This is really odd though, running the code @vivus-ignis posted in IEx completely froze my Ubuntu 16.10 with an i7 after just a couple of lines and I had to do a hard reset.
Even more strange is the fact, that this even happens if the last line was the following:

stream |> IO.binstream(:line) |> Enum.each(fn(line) -> :noop end)

Could this be a bug in the standard library?

Should I report that to elixir-lang github issues?

I did the experiment as well in an VBox, 6th gen i5 here. The Box had 4 GiB of RAM available, another 4 GiB of swap.

I did not reset the Box, but let it run,

The Box took all available CPU I assigned (4 cores at 75%) and I was able to see how memory consumption in the box rose, I had htop open in another terminal on that box. After about 2 minutes iex died: eheap_alloc: Cannot allocate 3280272216 bytes of memory (of type "old_heap")..

So this seems in fact something we need to report. Since it was only some very quick testing, I took the elixir version available in the ubuntu repositories, 1.1.0. I’d be glad if someone could verify that we get killed because of OOM somewhen even on newer bversions or if at least memory consumption rises before the system gets stale.

Also I have to tell what I observed regarding the speed of the output lines that came. The first one were really quick and then it took longer and longer for each single line until iex died.

I can also reproduce this with Elixir 1.4.2 / OTP 19. For a string with a few lines is works fine, but with more lines it takes more and more wall-clock time per line. Using a progressively longer string, this behavior becomes quite obvious, and it looks very much like O(n^2) type behavior with n being the number of lines in the string opened with StringIO.open.

Ok … I think I have found the issue …

In every call to get a line, the process returned by StringIO.open does this:

defp io_request({:get_line, encoding, prompt}, s) do
    get_line(encoding, prompt, s)
end

get_line, in turn call Erlang’s :unicode.characters_to_list. This converts the whole bitstring to a list with the proper encoding.

If this succeeds, StringIO.do_get_line is called which iterates over the items in the list until it finds a termination (newline or no more data) and returns that line and the rest of the string. It then goes back to Erlang calling :unicode.characters_to_binary on both the line just retrieved and the remainder of the string.

Which means that the longer the string the bigger the lists and then resulting binaries that are being generated will be on each iteration requesting a line. I expect this is doing some unhappy things to the memory management. A potential fix would be to do the conversion to a list once and keep that in the state data of the StringIO process and then iterate over that one line at a time.

1 Like

p.s. I do see why it was probably written this way… there is no guarantee at all that after one call to get_line, another will follow … it’s just as valid for the next call to be to get one char or 5 chars or whatever … since it can not assume it will just all be a bunch of get_line calls, StringIO is doing its best to return the string to a state where whatever the next call happens to be, it will work.

Still, probably should be keeping some bookkeeping information in its state so it knows where it can pick up from …

Issue is already known to the elixir team

also there is an PR addressing this issue

That PR looks very nice … hopefully it gets merged soon! As usual, great to see the pace of devel around Elixir in action :slight_smile:

Alternative solution to using StringIO + IO.binstream would be to use Stream.unfold with binary pattern matching:

  def binary_stream(b, chunk_size \\ 50_000) when is_binary(b) do
    Stream.unfold(0, fn skip ->
      case b do
        <<_skipped::binary-size(skip), chunk::binary-size(chunk_size), _rest::binary>> ->
          {chunk, skip + chunk_size}

        <<_skipped::binary-size(skip)>> ->
          nil

        <<_skipped::binary-size(skip), chunk::binary>> ->
          {chunk, skip + byte_size(chunk)}
      end
    end)
  end

I tested this on pretty big XML docs (hundreds of megabytes) and it seems to be performant and doesn’t require much memory due to not copying/cloning any part of the binary data.

PS: for this particular case (XML parsing), it’s not necessary to read by line, and in fact some XML documents (or SOAP API responses) return whole XML doc as a single long line without line breaks.

8 Likes

Thanks for this post, it was very usefull !

But I think the post marked as solution is not the more appropriate.
The solution using String.splitter(str, "\n") seems to behave more as expected.

Explanations:
When using this:

str = "abc\ndef\nghi\n"

{:ok, stream} = str |> StringIO.open()

s = stream |> IO.binstream(:line)

it do not work as expected:

iex> s |> Enum.take(1)
["abc\n"]
iex> s |> Enum.take(1)
["def\n"]

=> The result should always be the same as it’s the same operation.

Like we have here:

iex> s = File.stream!("/tmp/foo.csv")
iex> s |> Enum.take(1)
["hey\n"]
iex> s |> Enum.take(1)
["hey\n"]

Instead, String.splitter(str, "\n"), do the job as expected:

iex> str = "abc\ndef\nghi\n"
iex> s = String.splitter(str, "\n")

iex> s |> Enum.take(1)
["abc"]
iex> s |> Enum.take(1)
["abc"]
1 Like