Given a large body of text (such as one you may find in a database), how would you intelligently separate it into two paragraphs? Maybe not based on content, but at least split down the middle at the closest end of a sentence.
Just some regex I’d use, maybe something like if you define a paragraph based on two newlines then split on the first instance of that, or split on the first instance of [a-zA-Z]\.\s
after a certain length or whatever seems decent to use for your use-case? It is very use-case dependent.
Yeah I just thought there’d be a library for this ha, shouldn’t be too hard to do myself.
If you make a good set of re-usable things for doing such a thing then definitely publish a library for it!
I’m writing an NLP library called Essence that does exactly what you are asking for
Look at the Essence.Chunker
module.
Awesome project!
You could just treat the text as one long string and split it in the middle.