How do I split text into paragraphs?

Given a large body of text (such as one you may find in a database), how would you intelligently separate it into two paragraphs? Maybe not based on content, but at least split down the middle at the closest end of a sentence.

1 Like

Just some regex I’d use, maybe something like if you define a paragraph based on two newlines then split on the first instance of that, or split on the first instance of [a-zA-Z]\.\s after a certain length or whatever seems decent to use for your use-case? It is very use-case dependent. :slight_smile:

Yeah I just thought there’d be a library for this ha, shouldn’t be too hard to do myself.

If you make a good set of re-usable things for doing such a thing then definitely publish a library for it! :slight_smile:

I’m writing an NLP library called Essence that does exactly what you are asking for :slight_smile:

Look at the Essence.Chunker module.

7 Likes

Awesome project!

1 Like

You could just treat the text as one long string and split it in the middle.