npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

fast-dotproduct

v1.3.0

Published

Fast dot product calculation for the web platform.

Downloads

13

Readme

fast-dotproduct

Fast dot product calculations for the web platform. Speeds up your 🐌 dot product calculations by up to 103457% ⚡.

🔬 My experiments have shown that the fastest algorithms to calculate the dot product between n-dimensional vectors currently are WebAssembly (SIMD) optimized and JIT-optimized JavaScript (unrolled loops for 16 byte aligned instructions to be vectorized by the optimizer) for workloads < 1 million vectors (aka. "a typical local vector store").

⚡ It's fast!

For reference: The following results were produced on typical a consumer machine: Macbook Air, 13", M3, 2024, 8 core, 16 GiB RAM, model: MXCV3D

🐌 Slow: baseline/naive JS:

  • Runs: 100k single dot product calculations on 2 n-dimensional vectors, loop-inlined
  • Took:
    • 35756 ms for 4 dimensions,
    • 36012 ms ms for 384 dimensions,
    • 36525 ms ms for 1024 dimensions

🦆 Faster: pure JavaScript, but JIT-optimized: (468,27x faster, or 46827% faster)

  • Runs: 100k single dot product calculations on 2 n-dimensional vectors, loop-inlined
  • Took:
    • 1 ms for 4 dimensions,
    • 30 ms for 384 dimensions,
    • 78 ms for 1024 dimensions

🐇 Fastest: WebAssembly, SIMD-optimized (1034,57x faster, or 103457%):

  • Runs: 100k single dot product calculations on 2 n-dimensional vectors, loop-inlined
  • Took:
    • 3 ms for 4 dimensions (suffering from inital invocation overhead),
    • 13 ms for 384 dimensions,
    • 35 ms for 1024 dimensions

Do you see any potential for further improvements? Please contribute to this project! Let's build the fastest dotproduct library for the web!

📚 Usage

1. Install fast-dotproduct:

npm/yarn/bun install fast-dotproduct

2. Use it

2.1 Tensor API

The Tensor API variant is the default API, allowing for passing many vectors to be calculated in one bulk operation. This ensures JIT optimizations taking place in the JS runtime/VM (V8, JavaScriptCore).

For every vector passed in the Array of the first argument, the dot product is caclulated with the same index vector in the second argument Array (vectorA[n] ⋅ vectorB[n]).

import { dotProduct, initWasm } from "fast-dotproduct"
// make sure the WebAssembly Module is loaded
await initWasm(); // or: await initWasm(await getWasmModule()); for custom WASM runtime - see ./src/index.spec.ts

const vectorsA = [
  new Float32Array([
    0.1785489022731781,
    0.6047865748405457,
    -0.29714474081993103,
    0.8343878388404846
  ])
]

const vectorsB = [
  new Float32Array([
    -0.12137561291456223,
    0.4885213375091553,
    0.3105606138706207,
    -0.23960202932357788
  ])
]

const result = dotProduct(vectorsA, vectorsB)
// Float32Array(-0.01842280849814415)

2.2 Atomic Vector API

There are cases, when the Tensor API is not what you're looking for and creating/unwrapping Array's would be unnecessary overhead. You can use the atomic vector operation API in those cases.

import { singeDotProduct, initWasm } from "fast-dotproduct"
// make sure the WebAssembly Module is loaded
await initWasm(); // or: await initWasm(await getWasmModule()); for custom WASM runtime - see ./src/index.spec.ts

const vectorA = new Float32Array([
  0.1785489022731781,
  0.6047865748405457,
  -0.29714474081993103,
  0.8343878388404846
])

const vectorB = new Float32Array([
  -0.12137561291456223,
  0.4885213375091553,
  0.3105606138706207,
  -0.23960202932357788
])

const result = singeDotProduct(vectorA, vectorB)
// -0.01842280849814415

2.3 Low-level API access

If you want to spare on safety and call a runtime function directly, you may do so as well:

Tensor API: dotProductJS(vectorsA: Array<Float32Array>, vectorsB: Array<Float32Array>) dotProductWASM(vectorsA: Array<Float32Array>, vectorsB: Array<Float32Array>)

Atomic Vector API: singleDotProductJS(vectorA: Float32Array, vectorB: Float32Array) singleDotProductWASM(vectorA: Float32Array, vectorB: Float32Array)

For examples on how to use the WASM or JS-implementation specifically, please refer to the tests.

Why shouldn't I use a simple, naive dot product implementation in JS?

"Wouldn't a simple implementation not do it? I read that the V8 and JavaScriptCore optimizers can do a great job, optimizing!"

vectorA.reduce((sum, ai, j) => sum + ai * vectorB[j], 0)

Well.. unfortunately, sometimes they do, sometimes they don't. If we narrow the scope of what the optimizer need to speculate about, we get a performance boost. This is, what happens in this repo's JIT optimized implementation. But it's still a JS/JIT based implementation. Using a WebAssembly runtime, and explicitly using the v128 instruction set, as well as explicitly choosing the instructions to use, and on top of that, calculating 4 float32 dot product calculations per instruction, we get the greatest boost.

You can run the performance tests on your machine. Checkout the repo and run: npm run test.

Err, cool, but for what do I need this anyway?

Vector search, for example. Dot products are an essential part of the math behind cosine similarity.

But wouldn't it be faster when we use the GPU??

Unfortunately, not necessarily. The time it takes to finish a program execution not only depends on the computation speed. There are factors like memory allocation and memory alignment overhead. And when you're done with memory management, there is still shader compilation. After you're done with the computation, you need to unpack/read the results back. Using the GPU can be a great boost for heavy workloads, but isn't always benefitial for small ticket size workloads.

Can't we use pthread in WebAssembly and many Workers for parallel execution?

Well, you can, but it isn't necessarily faster either. Raming up a Worker can be pre-computed, but you still have to use the message loop to pass data from A to B and back, map memory and organize workloads. I couldn't find a scenario under the given workfload ticket sizes, where pthread and Worker-based calculation wouldn't show a drastically bad impact on performance, unfortunately.

Help improve this project!

Setup

Clone this repo, install the dependencies (bun is recommended for speed), install emscripten and run npm run test to verify the installation was successful. You may want to play with the experiments.

Optimize/introduce new algos

Because of the current limits with the WebGL, WebGL2 and WebGPU API's, including the FP16 extension and dot4U8packed feature, I couldn't find any implementation yet, that would be faster than simply passing a pointer of the JavaScript heap memory 1:1 to the WebAssebly module that would use the Intel, AMD and ARM vector instruction sets to do the computation, passing back the number. Not even passing chunks of memory or using consecutive or interleaved flat memory layouts yielded better results. JIT-optimized vector calculation led to the V8 and JavaScriptCore optimizers to pick on the aligned memory/instructions and seemingly, to vectorize them automatically. The WebAssembly implementation did not benefit from pthread and the use of Worker in parallel.

Can you find any faster algorithm? Please share it via a Pull Request!