How to set our own segmentation rules to break up docs into sentences based on our own rules?
For adding rules, Spacy's built-in sentencizer relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens
Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.
In some cases, we want to replace spaCy's default sentences with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentence that breaks on line breaks.