npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

sbd-splitter

v0.1.1

Published

Sentence boundary detection document splitter for langchain

Downloads

584

Readme

SbdSplitter

The SbdSplitter is a custom text splitter class for LangChain.js that extends the RecursiveCharacterTextSplitter class. It utilizes the sbd library for sentence boundary detection and provides additional options for customizing the text splitting process.

Because sentence boundaries are not a reliable break point for a given text, a "softmax" target chunk size option is included, which will allow one additional sentence to overrun. The chunkSize option remains the strict chunkSize max length.

Installation

To use the SbdSplitter in your LangChain.js project, follow these steps:

  1. Install the required dependencies:
npm install langchain sbd
  1. Import the SbdSplitter class in your code:
import { SbdSplitter } from 'sbdsplitter';

Usage

To create an instance of the SbdSplitter, you can provide various options to customize its behavior:

const splitter = new SbdSplitter({
  chunkSize: 1000,
  keepSeparator: true,
  delimiters: ['\n\n','\n','&#&#&#', ' ',''];
  sbd_marker: '&#&#&#',
  softMaxChunkSize: 800,
  sbd_options: {
    newline_boundaries: false,
    html_boundaries: false,
    sanitize: false,
    allowed_tags: false,
    preserve_whitespace: true,
    abbreviations: null,
  },
});
  • chunkSize: The absolute maximum chunk size (default: 1000). Recommended to be 20% higher than softmax.
  • keepSeparator: Whether to keep the separator in the split chunks (default: true).
  • sbd_marker: The marker used to join sentences after splitting (default: '&#&#&#'). Allows transformation of sentence boundaries once sentences are recombined into text for chunking. Must be unique within the document. Will be stripped out.
  • softMaxChunkSize: The soft maximum chunk size (default: 800).
  • sbd_options: Additional options for the sbd library (see the sbd documentation for available options).

To split a text using the SbdSplitter, you can call the splitText method:

const text = 'Your input text goes here...';
const chunks = await splitter.splitText(text);

The splitText method returns an array of split text chunks.

Additional Options

The SbdSplitter allows you to customize the behavior of the sbd library by providing additional options through the sbd_options parameter. These options include:

  • newline_boundaries: Whether to treat newlines as sentence boundaries (default: false).
  • html_boundaries: Whether to treat HTML tags as sentence boundaries (default: false).
  • sanitize: Whether to sanitize the input text (default: false).
  • allowed_tags: An array of allowed HTML tags (default: false).
  • preserve_whitespace: Whether to preserve whitespace in the split chunks (default: true).
  • abbreviations: An array of abbreviations to consider during sentence boundary detection (default: null).

Please refer to the sbd library documentation for more details on these options.

Combining with Markdown

The default delimiters are basic line breaks with the sentence barrier: ['\n\n','\n','&#&#&#', ' ','']. But this can be extended to work with markdown as well.

    const delimiters = [
        "\n# ",
        "\n## ",
        "\n### ",
        "\n#### ",
        "\n##### ",
        "\n###### ",
        "```\n\n",
        "\n\n***\n\n",
        "\n\n---\n\n",
        "\n\n___\n\n",
        "\n\n",
        "\n",
        "&#&#&#", //sentence delimiter goes here typically.
        " ",
        "",
    ];

With the above settings, the following code would breakdown markdown as shown:

    const splitterConfig = { chunkSize: 90, separators: delimiters, softMaxChunkSize: 55 }
    const text = `# Header\n` + `a`.repeat(95) + `. First sentence goes here. Second sentence goes here. Third sentence goes here. Fourth sentence goes here. Fifth sentence goes here. Now for a really long sentences that will break on the hard max length so we can see that happen. \n## Second Section Title is also really long and goes past the soft max. \n### Third Section Title is also really long and goes past the soft max and past the chunk size by quite a bit, in fact it should stop here. \n## Fourth Section\nThis is the fourth section.  It has two short sentences.  And one lonnnnnnnnnnnnnnnnnnng run on sentence that seems to go on forever but it can't be stopped, oh no it just keeps on going and going and going.`
    const splitter = new SbdSplitter(splitterConfig);
    const chunks = await splitter.splitText(text);
    console.log(chunks)

This would be the result.

let result = [
    // breaks on \n at the end of the header
    "# Header",
    // breaks on hard max length
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
    'aaaaaa.',
    // breaks on softmax
    'First sentence goes here. Second sentence goes here. Third sentence goes here.',
    'Fourth sentence goes here. Fifth sentence goes here.',
    // breaks on hard max with spaces
    'Now for a really long sentences that will break on the hard max length so we can see that',
    'happen.',
    // breaks on softmax
    '## Second Section Title is also really long and goes past the soft max.',
    // breaks on hardmax
    '### Third Section Title is also really long and goes past the soft max and past the chunk',
    'size by quite a bit, in fact it should stop here.',
    // breaks on \n 
    '## Fourth Section',
    // breaks on soft max
    'This is the fourth section.  It has two short sentences.',
    // breaks on hard max with spaces
    "And one lonnnnnnnnnnnnnnnnnnng run on sentence that seems to go on forever but it can't be",
    'stopped, oh no it just keeps on going and going and going.'
]

License

The SbdSplitter class is released under the MIT License.

Contributing

Contributions to the SbdSplitter class are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

Acknowledgements

The SbdSplitter class is built on top of the RecursiveCharacterTextSplitter class from the LangChain.js library and utilizes the sbd library for sentence boundary detection.