Removing a list of strings from a file

Okay, I find myself regularly struggling with the following kind of problem so I thought it would be a good case for a first question!

I am doing some processing on a file, let’s call it README.md. What I want to do is remove some lines from the file. What I have is a list of strings:

redact = ["string1", "string2", "string3", ...]

Everytime that one of those strings appears in a line of the file I want to remove the line (let’s call that “redacting”).

So my general approach is to do something like:

a. Enum.reject(File.stream!("README.md"), fn x -> String.contains?(x, "string1")

This makes sense to me – if a line of the file (coming from the stream) contains “string1” that we want redacted, reject that line. At the end of it I can put the file back together from the lines left over.

The problem I have is now running my list of various strings through it. If I do something like:

b. Enum.map(redact, fn x -> Enum.reject(File.stream!("README.md"), fn y -> String.contains?(x,y) end) end )

I get something like a list of lists, each of the sublists is the original file minus the lines that match one of the words in the redacted list. So the first entry is the whole file minus any lines with “string1”, the second entry is the whole file minus any lines with “string2” but including those with “string1” etc), like

[["string1", "string 3", "string 4"], ["string2", "string3", "string4"], ["string1", "string2", "string4"]]

From which I would like:

["string4"]

This is where I get stuck. I’ve implemented this a different way, making each update to the file as we go along (ie. storing the state in the file rather than in memory) but I’d really like to be able to crack this as I’m coming across this kind of problem all the time.

So, my questions are:

  1. If I have a list of strings (i.e. the streamed file) and a list of things I want to redact from those strings, is there an “easy” way to compare them and throw away the entries in the list where a redacted word appears?

  2. If b. above is a reasonable way of doing this, how do I find the intersection of a list of lists? I.e. let’s say I have n lists [[1,2,3,4], [1,3,4,5], [2,3,5,6] etc], how do I find the intersection of those n lists?

  3. Is there a better way of doing b. ?

Sorry for the long message, first time I’ve posted so I wanted to get it all out there!

2 Likes

If I understand you correctly, you want to reject every line that contains any given substring?

This is from memory without beeing able to test:

bad_words = ["string1", "string2", ...]

"README.md"
|> File.stream()
|> Enum.reject(fn line -> Enum.any?(bad_words, &String.contains(line, &1)))
2 Likes

Thanks @NobbZ – that does do the trick (with the addition of an end).

I hadn’t come across using Enum.any? before, very useful. I got tied up with the scope as well!

1 Like