weighted-random-item-sampler

v1.0.2

Published

2 months ago

A weighted random item sampler (selector), where the probability of selecting an item is proportional to its weight, with replacement allowed between samples. In other words, an item can be sampled more than once. The sampling method utilizes a binary sea

Downloads

The WeightedRandomItemSampler class implements a random sampler where the probability of selecting an item is proportional to its weight, with replacement allowed between samples. In other words, an item can be sampled more than once.

For example, given items [A, B] with respective weights [5, 12], the probability of sampling item B is 12/5 higher than the probability of sampling item A.

Weights must be positive numbers, and there are no restrictions on them being natural numbers. Floating point weights such as 0.95, 5.4, and 119.83 are also supported.

Use case examples include:

Distributed Systems: The sampler can assist in distributing workloads among servers based on their capacities or current load, ensuring that more capable servers handle a greater number of tasks.
Surveys and Polls: The sampler can be used to select participants based on demographic weights, ensuring a representative sample.
Attack Simulation: Randomly select attack vectors for penetration testing based on their likelihood or impact.
ML Model Training: Select training samples with weights based on their importance or difficulty to ensure diverse and balanced training data.

If your use case requires sampling each item exactly once without replacement, consider using non-replacement-weighted-random-item-sampler instead.

Table of Contents :bookmark_tabs:

Key Features :sparkles:

Weighted Random Sampling :weight_lifting_woman:: Sampling items with proportional probability to their weight.
With Replacement: Items can be sampled multiple times.
Efficiency :gear:: O(log(n)) time and O(1) space per sample, making this class suitable for performance-critical applications where the set of items is large and the sampling frequency is high.
Comprehensive documentation :books:: The class is thoroughly documented, enabling IDEs to provide helpful tooltips that enhance the coding experience.
Tests :test_tube:: Fully covered by unit tests.
TypeScript support.
No external runtime dependencies: Only development dependencies are used.
ES2020 Compatibility: The tsconfig target is set to ES2020, ensuring compatibility with ES2020 environments.

API :globe_with_meridians:

The WeightedRandomItemSampler class provides the following method:

sample: Randomly samples an item, with the probability of selecting a given item being proportional to its weight.

If needed, refer to the code documentation for a more comprehensive description.

Use Case Example: Training Samples for a ML model :man_technologist:

Consider a component responsible for selecting training-samples for a ML model. By assigning weights based on the importance or difficulty of each sample, we ensure a diverse and balanced training dataset.

import { WeightedRandomItemSampler } from 'weighted-random-item-sampler';

interface TrainingSampleData {
  // ...
}

interface TrainingSampleMetadata {
  importance: number; // Weight for sampling.
  // ...
}

interface TrainingSample {
  data: TrainingSampleData;
  metadata: TrainingSampleMetadata;
}

class ModelTrainer {
  private readonly _trainingSampler: WeightedRandomItemSampler<TrainingSample>;

  constructor(samples: ReadonlyArray<TrainingSample>) {
    this._trainingSampler = new WeightedRandomItemSampler(
      samples, // Items array.
      samples.map(sample => sample.metadata.importance) // Respective weights array.
    );
  }

  public selectTrainingSample(): TrainingSample {
    return this._trainingSampler.sample();
  }
}

Algorithm :gear:

This section introduces a foundational algorithm, which will later be optimized. For simplicity, we assume all weights are natural numbers (1, 2, 3, ...). A plausible and efficient solution with O(1) time complexity and O(weights sum) space complexity involves allocating an array with a size equal to the sum of the weights. Each item is assigned to its corresponding number of cells based on its weight. For example, given items A and B with respective weights of 1 and 2, we would allocate one cell for item A and two cells for item B. This approach is valid when the number of items and their weights are relatively small. However, challenges arise when weights can be non-natural (e.g., 5.4, 0.23) or when the total weight sum is substantial, leading to significant memory overhead.

Next, we introduce an optimization over this basic idea. We calculate a prefix sum of the weights, treating each cell in the prefix sum array as denoting an imaginary half-open range. Using the previous example with items A and B (weights 1 and 2), the first range is denoted as [0, 1), while the second range is [1, 3). We can then randomly sample a number (not necessarily a natural number) within the total range [0, 3) and match it to its corresponding range index, which corresponds to a specific item. This random-to-interval matching can be performed in O(log n) time using a left-biased binary search to find the leftmost index i such that randomPoint < prefix_sum[i]. A key observation that enables this binary search is the monotonic ascending nature of the prefix sum array, as weights are necessarily positive.

License :scroll:

Apache 2.0

Published

Vulnerabilities

Links

Maintainers

Keywords