npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

crawlee-one

v2.0.4

Published

CrawleeOne is a framework built on top of Crawlee and Apify for writing robust and highly configurable web scrapers

Downloads

109

Readme

CrawleeOne

The web scraping framework you can't refuse.

CrawleeOne is a feature-rich and highly configurable web scraping framework that empowers both scraper developers and their users.

It is built on top of Crawlee and Apify*. Read here for the recap of how Crawlee and Apify work.

The appeal of CrawleeOne is that it works seamlessly with Apify platforn, but can also be easily re-purposed to work with other web scraping platforms or your custom services.

When deployed to Apify, or otherwise made available to be used by others, the users of your scraper will have the freedom to transform, filter, limit, or otherwise modify both the scraped data and the requests to scrape.

CrawleeOne should be especially your choice if:

  • You're developing a long-lasting integrations.
  • Or your scraper will be part of a data pipeline.
  • Or you wish to make your scrapers available to others in your team / org, whether it's programmatically or via Apify UI.

NOTE: crawleeOne allows you to easily switch between different implementations - Playwright, Cheerio, Puppeteer, ...

However, you still need to write data extraction logic that's specific to the implementation.

To make the transition between different implementations seamless, you can use portadom, which offers a single interface across all these implementations.

Pre-requirements

To make the most of CrawleeOne, you should be familiar with:

  • Crawlee (AKA how to scrape data).
  • Apify platform (AKA how to manage a scraped dataset and request queue).

Table of contents

Minimal example

Following example defines a CheerioCrawler scraper with 2 routes (mainPage and otherPage) that process the incoming URLs either based on the URL, or on the page HTML.

pushData is used to save the scraped data, while pushRequests enqueues more URLs to be scraped.

import { crawleeOne } from 'crawlee-one';

await crawleeOne({
  type: 'cheerio',
  routes: {
    mainPage: {
      match: /example\.com\/home/i,
      handler: async (ctx) => {
        const { $, request, pushData, pushRequests } = ctx;
        // Scrape data from the page
        const data = [
          /* ... */
        ];

        // Save the scraped data.
        await pushData(data, {
          privacyMask: { author: true },
        });

        // If there's more URLs to scrape, enqueue them.
        const reqs = ['https://...'].map((url) => ({ url }));
        await pushRequests(reqs);
      },
    },
    otherPage: {
      match: (url, ctx) => url.startsWith('/') && ctx.$('.author').length,
      handler: async (ctx) => {
        // ...
      },
    },
  },
  hooks: {
    onReady: async (inst) => {
      await inst.runCrawler(['https://...']);
    },
  },
});

If you're familiar with Crawlee, the minimal example above is roughly equivalent to:

import { Actor } from 'apify';
import { CheerioCrawler, createCheerioRouter } from 'crawlee';

await Actor.main(async () => {
  const rawInput = await Actor.getInput();
  const input = {
    ...rawInput,
    ...(await fetchInput(rawInput.inputFromUrl)),
    ...(await runFunc(rawInput.inputFromFunc)),
  };

  const router = createCheerioRouter();

  router.addHandler('mainPage', async (ctx) => {
    await onBeforeHandler(ctx);

    // Scrape data from the page
    const data = [
      /* ... */
    ];
    const finalData = await transformAndFilterDataWithUserInput(data, ctx, input);

    // Save the scraped data
    const dataset = await Actor.openDataset(input.datasetId);
    await dataset.pushData(data);

    // If there's more URLs to scrape, enqueue them.
    const reqs = ['https://...'].map((url) => ({ url }));
    const finalReqs = await transformAndFilterReqsWithUserInput(reqs, ctx, input);
    const queue = await Actor.openRequestQueue(input.requestQueueId);
    await queue.addRequests(finalReqs);

    await onAfterHandler(ctx);
  });

  router.addDefaultHandler(async (ctx) => {
    await onBeforeHandler(ctx);

    const url = ctx.request.loadedUrl | ctx.request.url;

    if (url.match(/example\.com\/home/i)) {
      const req = { url, userData: { label: 'mainPage' } };
      const finalReqs = await transformAndFilterReqsWithUserInput([req], ctx, input);
      const queue = await Actor.openRequestQueue(input.requestQueueId);
      await queue.addRequests(finalReqs);
    }

    await onAfterHandler(ctx);
  });

  const crawler = new CheerioCrawler({
    ...input,
    requestHandler: router,
  });

  if (onReadyFn) await onReadyFn({ crawler, router, input });
  else crawler.run(['https://...']);
});

As you can see, there's a lot going on behind the scenes, and that's far from everything.

* Apify can be replaced with your own implementation, so the data can be sent elsewhere, not just to Apify. This is set by the io options.

What can CrawleeOne do?

Beside the main crawleeOne function for running crawlers, CrawleeOne also includes helpers and types for:

  • Actor boilterplating
  • Code generation
  • Configuring logging and error handling
    • E.g. Save errors to separate dataset or send to telemetry
  • Data and request filtering and post-processing
    • E.g. Enrich data with metadata
  • Routing
  • Testing actors
  • Actor migration (conceptually similar to database migration)
    • CLI utility for updating actors via apify-client
  • Privacy compliance
  • Metamorphing

CrawleeOne supports many common and advanced web scraping use cases. See the Use cases for the overview of the use cases.

See the section Usage (for end users) for how CrawleeOne looks from user's perspective.

Playbook & Use cases

Web crawlers written with CrawleeOne can be configured via their input field to handle following advanced use cases:

Actor input reference

See here the full list of all possible input options that a CrawleeOne crawler can have.

CrawleeOne allows you to configure the following via the input:

Usage (for developers)

Let's revisit the previous example this time with more options and explanations:

import { Actor } from 'apify';
import { crawleeOne, apifyIO, createSentryTelemetry } from 'crawlee-one';

// When you call `crawleeOne`, following happens:
// 1. Crawler initializes.
// 2. Crawler processes the URLs in the RequestQueue.
// 3. For each request, it decides which handler to give
//    the request to, based on the `match` fields.
// 4. The handler receives the request.
// 5. The handler may scrape data and save them with `pushData`.
// 6. The handler may find more URLs to scrape, and enqueue
//    them with `pushRequests`.
// 7. If crawler runs out of Requests and `keepAlive` is not set,
//    the crawler ends.
await crawleeOne({
  // Specify Crawlee Crawler class. Changing this has 2 consequences:
  // 1. It changes the crawler strategy - E.g. whether it uses Playwright,
  //    or makes plain HTTP requests.
  // 2. It decides what data and methods will be available on the `context`
  //    object in the route `handler`.
  //
  // Available types are: 'basic', 'http', 'jsdom', 'cheerio', 'playwright', 'puppeteer'
  type: 'cheerio',

  // (Optional) Input allows you to configure everything:
  // - The crawler class and configuration like max/min concurrency
  //   and timeouts.
  // - To load URLs to scrape from remote URL or database.
  // - Data and Request transformations, filtering, and more.
  // - Run requests in batches.
  // - Downstream crawler with Apify's "metamorph".
  //
  // See the Actor input reference for all input fields.
  // https://github.com/JuroOravec/crawlee-one/blob/main/docs/reference-input.md
  //
  // Specify input if you plan to use the crawler yourself,
  // otherwise use `inputDefaults` or set `mergeInput`.
  //
  // If you deploy the scraper on platforms like Apify,
  // then your users will populate this field via `Actor.getInput()`.
  input: {
    outputTransform: (item) => { ... },
  },

  // Input fields that can be overriden by you or the crawler users
  // See the Actor input reference for all input fields.
  inputDefaults: {
    // ...
  },

  // By default, if you specify the `input`, then user's input
  // (from `Actor.getInput()`) will be ignored. Set this to `true`
  // or to a custom merge function to include both your and user's `input` data.
  mergeInput: true,
  // is same as:
  mergeInput: ({ defaults, overrides, env }) => ({ ...defaults, ...env, ...overrides, });,

  // These settings are passed to Crawler class initialization.
  // E.g. if `type: 'playwright'`, then this config is used as:
  // `new PlaywrightCrawler(crawlerConfig);`
  //
  // Use `crawlerConfig` for config that cannot be configured via `input`,
  // or when you need the crawler to use specific settings and you don't
  // want users to override.
  crawlerConfig: {
    maxRequestsPerMinute: 120,
    requestHandlerTimeoutSecs: 180,
    headless: true,
    // ...
  },

  // Same as `crawlerConfig`, but these values can be overriden by the user's
  // config from `input`, and by values in `crawlerConfig`.
  crawlerConfigDefaults: {
    // ...
  },

  // Each URL passed to the crawler will be compared against the routes,
  // and passed to the handler of the first route that matches.
  routes: {
    mainPage: {
      // Regex that a URL must match, or a predicate function, or array of the two.
      match: /example\.com\/home/i,
      // The handler function is equivalent to Crawlee's `Router.addHandler`,
      // but with extra fields.
      handler: async (ctx) => {
        const { $, request, pushData, pushRequests } = ctx;
        // Scrape data from the page
        const data = [ ... ];

        // Save the scraped data. When you save data with `ctx.pushData`,
        // then you can filter, transform, limit, redact, and more.
        //
        // NOTE: `privacyMask` defines which fields are considered
        // personal information. To ensure we think about privacy
        // compliance when scraping, this field is required.
        await pushData(data, {
          privacyMask: { author: true },
        });

        // If there's more URLs to scrape, enqueue them.
        // When you enqueue data with `ctx.pushRequests`,
        // you can filter, transform, limit, and more.
        const reqs = ['https://...'].map((url) => ({ url }));
        await pushRequests(reqs);
      },
    },
  },

  hooks: {
    // By default, once ready, CrawleeOne calls `actor.runCrawler` (which calls `Crawler.run()`)
    // If you supply your own `onReady` callback, you have to call `actor.runCrawler` yourself.
    onReady: async (inst) => {
      // E.g. in this example, user can select to scrape all entries
      // or a certain kind by setting a custom `datasetType` input field.
      const startUrls: string[] = [];
      if (!actor.startUrls.length && actor.input?.datasetType) {
        startUrls.push(datasetTypeToUrl[actor.input?.datasetType]);
      }
      await actor.runCrawler(startUrls);
    },

    // You can add hooks before and after each `handler` call:
    // These hooks receive the same "context" as the handler functions,
    // which is equivalent to Crawlee's `Router.addHandler`, but with extra fields.
    onBeforeHandler: (ctx) => { /* ... */ },
    onAfterHandler: (ctx) => { /* ... */ },

    // If you run the crawler on Apify, or otherwise provide the crawler to others,
    // then it's a good practice to validate their input.
    validateInput: (input) => {
      const schema = Joi.object({ ... });
      Joi.assert(input, schema);
    },
  },

  // Configure the Crawlee proxy. See Crawlee's `ProxyConfiguration`
  // By default, no proxy is used.
  //
  // NOTE: DO NOT set proxy here if you are deploying the crawler to Apify
  // and you want the user to specify the proxy!
  proxy: Actor.createProxyConfiguration({ ... }),

  // Provide a telemetry instance that is used for tracking errors.
  // By default, no telemetry is used.
  //
  // See the docs for `CrawleeOneTelemetry`.
  //
  // In this example we track errors to Sentry.
  telemetry: createSentryTelemetry({
    dsn: 'https://[email protected]/zzzzzzzzzzzzzzzzzzzzz',
    tracesSampleRate: 1.0,
    serverName: 'myCrawler',
  }),

  // Provide an instance that is responsible for state management:
  // - Adding scraped data to datasets
  // - Adding and removing requests to/from queues
  // - Cache storage
  //
  // This is an API based on Apify's `Actor` utility class, which is also
  // the default.
  //
  // You don't need to override this in most of the cases.
  //
  // By default, Apify saves the data locally in `./storage` directory, for
  // as long as the crawler is not running from within the Apify's platform.
  // And if the crawler runs in Apify's platform then it uses
  // Apify's cloud storage.
  //
  // See the docs for `CrawleeOneIO`.
  io: apifyIO,

  // Provide a custom router instance. By default, uses Router from Crawlee.
  // See the docs for Crawlee's Router.
  router: myCustomRouter(),
});

You can find the full type definition of crawleeOne and its arguments here:

To learn more about pushData and pushRequests, see:

  • pushData
    • NOTE: When you use pushData from within a handler, you omit the first argument (ctx).
  • pushRequests

Route handler context

Each route handler receives a context object, as defined by Crawlee Router.

CrawleeOne extends this context object with extra properties:

await crawleeOne({
  // ...
  routes: {
    mainPage: {
      match: /example\.com\/page/i,
      // The `ctx` is the Crawlee Router context + extra properties
      handler: (ctx) => {
        // Crawlee properties
        ctx.log('bla bla...')
        const url = ctx.request.loadedUrl || ctx.request.url;
        ctx.response
        const $ = ctx.parseWithCheerio();
        // And more...

        // Extra props

        // 1. CrawleeOne instance (type: CrawleeOneActorInst):
        // - Save scraped items
        await ctx.actor.pushData(scrapedItems);

        // - Enqueue more URLs to scrape
        const id = Math.floor(Math.random() * 100);
        const url = `https://example.com/resource/${id}`;
        await ctx.actor.pushRequests([{ url }]);

        // - Access datasets and request queues
        const dataset = await ctx.actor.io.openDataset();
        const reqQueue = await ctx.actor.io.openRequestQueue();
        const keyValStore = await ctx.actor.io.openKeyValueStore();

        // - Access resolved `input`
        if (ctx.actor.input.myCustomInput) {
          // do something
        }

        // - Access resolved start URLs
        if (ctx.actor.startUrls.length) {
          // do something
        }

        // - Interact with state that can be access in hooks like `outputTransform` or `outputFilter`
        ctx.actor.state.myVar = 1;

        // 2. `pushData`, `pushRequests`, and `metamorph` can be accessed also
        // directly from the `ctx` object
        await ctx.pushData(scrapedItems) // Same as `ctx.actor.pushData`
        await ctx.pushRequests(urlsToScrape) // Same as `ctx.actor.pushRequests`
        await ctx.metamorph('nextCrawlerId', ...) // Same as `ctx.actor.metamorph`
      },
    }
  },
});

The actor object is integral to CrawleeOne. See here the full list of properties.

Deploying to Apify

See either of the two projects as examples:

1. Write the crawler with CrawleeOne

Either use the example projects above or use your own boilerplate project, but remember that Apify requires you to Dockerize the project in order to be deployed on their platform.

Remember to install crawlee-one package.

2. Define the crawler's input

You need to tell Apify what kind of input can be passed to your crawler. This is done by defining the actor.json file. You need to set this if you want to support the described use cases.

For that, you will need to:

  1. Install apify-actor-config as a dev dependency:

    npm i -D apify-actor-config

    apify-actor-config is a sister package focused solely on working with and generating Apify's actor.json config files.

  2. Write a JS/TS file where you will only define your config and export it as the default export.

    See here the example config file from Profesia.sk Scraper.

    Note that to make use of the CrawleeOne inputs, we need to import allActorInputs and pass it to properties field of createActorInputSchema.

    import { allActorInputs } from 'crawlee-one';
    import { createActorConfig, createActorInputSchema } from 'apify-actor-config';
    
    const inputSchema = createActorInputSchema({
      schemaVersion: 1,
      // ...
      properties: {
        ...customActorInput,
        ...allActorInputs,
      },
    });
    
    const config = createActorConfig({
      actorSpecification: 1,
      // ...
      input: inputSchema,
    });
    
    export default config;

    Also note that we are able to override the defaults set in allActorInputs by directly modifying the object:

    allActorInputs.requestHandlerTimeoutSecs.prefill = 60 * 3;
  3. Build / transpile the config to vanilla JS if necessary.

    In Profesia.sk Scraper, the config is defined as a TypeScript file, but apify-actor-config currently supports only JS files.

    So if you are also using anything other than plain JavaScript, then you will need to build / transpile your project. Do so only once you're happy with the input fields and their defaults.

  4. Generate actor.json file

    Run the npx apify-actor-config gen command and point it to the config JS file:

    npx apify-actor-config gen -c ./path/to/dist/config.js

    Optionally, set this as a script in package.json.

    The command should generate a config file in ./actor/actor.json, with all the inputs from crawlee-one. 🚀

  5. Deploy the project to Apify.

    Now head over to Apify to deploy the crawler there. See their docs on deployment.

  6. Verify that the crawler offers all the inputs.

    When you now go to see your crawler on Apify, you should see that you can configure all kinds of various inputs. Congrats, you've got it working! 🚀

    See the screenshot in the next section (Usage (for end users)) to see how the input looks like in the Apify UI.

Usage (for end users)

As a user of a crawler that was written with CrawleeOne, you have the option to configure the crawler, and transform, filter & limit the scraped data and the "requests" (URLs to scrape).

CrawleeOne crawlers allow you to do literally anything with the scraped data. See the common use cases here.

See here for how to use a CrawleeOne web scrapers through Apify platform.

Apify actor input page

Codegen & Config file

With CrawleeOne, you can generate TypeScript types and helper functions to create new instances of CrawleeOne with full type support.

With these types:

  • You get fully-typed scraper definition.
  • You can easily split the project across multiple files, as the corresponding types can be imported.

The final result can look like this:

// ./routes.ts
import { profesiaRoute } from './__generated__/crawler';

const otherPageRoute: profesiaRoute = {
  match: (url) => url.match(/example\.com\/home/i),
  handler: async (ctx) => {
    // ...
    await ctx.pushData(...);
  },
};
// ./main.ts
import { profesiaCrawler, profesiaRoute } from './__generated__/crawler';
import { otherPageRoute } from './routes';

await profesiaCrawler({
  hooks: {
    validateInput,
  },
  routes: {
    mainPage: {
      match: /example\.com\/home/i,
      handler: (ctx) => {
        // NOTE: The type of `ctx` is inferred
        ctx.parseWithCheerio();
        // ...
      },
    },
    otherPage: otherPageRoute,
  },
});

1. Define the crawler schema in a config

To get started, you need to define the scraper schema.

Config may look like this:

module.exports = {
  version: 1, // Currently always 1
  schema: {
    crawlers: {
      // Crawler name
      mainCrawler: {
        // `type` is one of 'basic', 'http', 'jsdom', 'cheerio', 'playwright', 'puppeteer'
        type: 'playwright',
        routes: ['listingPage', 'detailPage'],
      },
    },
  },
};

Here is an example if we wrote the config in YAML and defined multiple crawlers:

version: 1
schema:
  crawlers:
    main:
      type: 'playwright'
      routes: ['listingPage', 'detailPage']
    other:
      type: 'cheerio'
      routes: ['someNoJSPage']

CrawleeOne uses cosmiconfig to import the config. This means that you can define the config as any of the following:

  • crawlee-one property in package.json
  • .crawlee-onerc file in JSON or YAML format
  • .crawlee-onerc.json, .crawlee-onerc.yaml, .crawlee-onerc.yml, .crawlee-onerc.js, .crawlee-onerc.ts, .crawlee-onerc.mjs, or .crawlee-onerc.cjs file
  • crawlee-onerc, crawlee-onerc.json, crawlee-onerc.yaml, crawlee-onerc.yml, crawlee-onerc.js, crawlee-onerc.ts or crawlee-onerc.cjs file inside a .config subdirectory
  • crawlee-one.config.js, crawlee-one.config.ts, crawlee-one.config.mjs, or crawlee-one.config.cjs file

2. Generate types

To generate the types from the config, run the generate command:

npx crawlee-one generate -o ./path/to/__generated__/file.ts

3. Use generated types

Once generated, we can use the types right away:

import { mainCrawler } from './__generated__/file.ts';

await mainCrawler({
  routes: {
    listingPage: {
      match: /example\.com\/home/i,
      handler: (ctx) => {
        // NOTE: The type of `ctx` is inferred
        ctx.parseWithCheerio();
        // ...
      },
    },
    detailPage: {
      // ...
    },
  },
});

Or we can even run multiple crawlers simultaneously. This can be useful in cases where for some pages you need browser automation like Playwright, whereas for other you don't.

import { mainCrawler, otherCrawler } from './__generated__/file.ts';

// Run the crawlers simulataneously
const mainPromise = mainCrawler({
  routes: {
    listingPage: {
      match: /example\.com\/home/i,
      handler: (ctx) => {
        ctx.page.locator('...');
        // ...

        // Send URLs to scrape to the Cheerio crawler
        await ctx.pushRequests([{ url: ... }], { requestQueueId: 'crawleeQueue' });
      },
    },
    detailPage: {
      // ...
    },
  },
});

const otherPromise = otherCrawler({
  input: {
    requestQueueId: 'crawleeQueue',
  },
  routes: {
    someNoJSPage: {
      match: /example\.com\/home/i,
      handler: (ctx) => {
        // NOTE: The type of `ctx` is inferred
        ctx.parseWithCheerio();
        // ...
        await ctx.pushData(...)
      },
    },
  },
});

await Promise.all([mainPromise, otherPromise]);

Custom telemetry integration (CrawleeOneTelemetry)

You may want to track errors to a custom service. In that case, you can define and pass a custom telemetry instance to the telemetry argument of crawleeOne.

The instance needs to implement the CrawleeOneTelemetry interface:

interface CrawleeOneTelemetry {
  setup: (actor: CrawleeOneActorInst) => Promise<void> | void;
  onSendErrorToTelemetry: (
    error: Error,
    report: object, // Object with data on the error
    options: {
      io?: CrawleeOneIO;
      allowScreenshot?: boolean;
      reportingDatasetId?: string;
    },
    ctx: CrawleeOneCtx;
  ) => Promise<void> | void;
}

See existing integrations for inspiration:

Based on the above, here's an example of a custom telemetry implementation that saves the errors to the local file system:

import fs from 'fs';
import type { CrawleeOneCtx, CrawleeOneTelemetry } from 'crawlee-one';

export const createFsTelemetry = <T extends CrawleeOneTelemetry<CrawleeOneCtx>>() => {
  const timestamp = new Date().getTime();
  let errors = 0;

  return {
    setup: async (actor) => {
      await fs.promises.mkdir('./temp/error');
    },
    onSendErrorToTelemetry: async (error, report, options, ctx) => {
      // E.g. '1694344665557_00001.json'
      const filename = timestamp + '_' + (errors++).toString().padStart(5, '0') + '.json';
      const data = JSON.stringify({ error, report });
      await fs.promises.writeFile(filename, data, 'utf-8');
    },
  } as T;
};

await crawleeOne({
  telemetry: createFsTelemetry(),
  // ...
});

Custom platform and storage integration (CrawleeOneIO)

By default, CrawleeOne uses Apify to manage datasets, request queue, and other platform-specific features.

In most of the cases, this should be fine, because Apify uses local file system when the crawler is not running inside Apify's cloud platform.

Sometimes, you may want to send the data to a custom dataset, or use a shared service for accessing requests or cache storage, or otherwise override the default behaviour. In those cases, you can define and pass a custom CrawleeOneIO instance to the io argument of crawleeOne.

The instance needs to implement the CrawleeOneIO interface:

interface CrawleeOneIO {
  openDataset: (id?: string | null) => MaybePromise<CrawleeOneDataset>;
  openRequestQueue: (id?: string | null) => MaybePromise<CrawleeOneRequestQueue>;
  openKeyValueStore: (id?: string | null) => MaybePromise<CrawleeOneKeyValueStore>;
  getInput: () => Promise<Input | null>;
  triggerDownstreamCrawler: (
    targetActorId: string,
    input?: TInput,
    options?: {
      build?: string;
    }
  ) => Promise<void>;
  runInContext: (userFunc: () => MaybePromise<unknown>, options?: ExitOptions) => Promise<void>;
  createDefaultProxyConfiguration: (
    input?: T | Readonly<T>
  ) => MaybePromise<ProxyConfiguration | undefined>;
  isTelemetryEnabled: () => MaybePromise<boolean>;
  generateErrorReport: (
    input: CrawleeOneErrorHandlerInput,
    options: PickRequired<CrawleeOneErrorHandlerOptions, 'io'>
  ) => MaybePromise<object>;
  generateEntryMetadata: (ctx: Ctx) => MaybePromise<TMetadata>;
}

See existing integrations for inspiration:

Based on the above, here's an example of a custom CrawleeOneIO implementation that overrides the datasets to send them to a custom HTTP endpoint.

import type { CrawleeOneIO, apifyIO } from 'crawlee-one';

export const createCustomIO = (baseUrl: string) => {
  const createDatasetIO = (id?: string) => {
    const fetchAllItems = () => {
      const endpoint = `${baseUrl}/dataset/${id ?? 'default'}/all`;
      return fetch(endpoint).then((d) => d.json());
    };

    const postItems = (items: any[]) => {
      const endpoint = `${baseUrl}/dataset/${id ?? 'default'}`;
      return fetch(endpoint, {
        method: 'POST',
        body: JSON.stringify(items),
      }).then((d) => d.json());
    };

    return {
      pushData: postItems,
      getItems: fetchAllItems,
      getItemsCount: () => fetchAllItems().then((d) => d.length),
    };
  };

  return {
    ...apifyIO,
    openDataset: createDatasetIO,
  } as CrawleeOneIO;
};

await crawleeOne({
  io: createCustomIO(),
  // ...
});

Example projects

Contributing

Found a bug or hav a feature request? Please open a new issue.

When contributing with your code, please follow the standard best practices:

  • Make a fork with your changes, then make a Merge Request to merge it
  • Be polite

Supporting CrawleeOne

CrawleeOne is a labour of love. If you like what I do, you can support me on BuyMeACoffee.