npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@opendesign/illustrator-parser-pdfcpu

v1.1.2

Published

Browser-compatible parser backwards compatible with https://gitlab.avcd.cz/backend/illustrator-parser-poppler

Downloads

31

Readme

Illustrator Parser - pdfcpu

Browser-compatible parser backwards compatible with previous iteration and https://gitlab.avcd.cz/backend/octopus-illustrator.

Usage

Node.js

import {
  PrivateData,
  ArtBoardRefs,
  ArtBoard,
} from '@opendesign/illustrator-parser-pdfcpu/dist/index'
import { FSContext } from '@opendesign/illustrator-parser-pdfcpu/dist/fs_context'

// Will use embedded binary to extract information onto disk
const ctx = await FSContext({ file: '/path/to/illustrator/file' })

// Parses Illustrator file embedded within PDF
const PrivateData = await PrivateData(ctx)

// Returns list of all artboards
const Artboards = await Promise.all(
  ArtBoardRefs(ctx).map((ref) => ArtBoard(ctx, ref))
)

WASM

import { ArtBoard, ArtBoardRefs, PrivateData } from '@opendesign/illustrator-parser-pdfcpu/dist/index'
import { WASMContext } from '@opendesign/illustrator-parser-pdfcpu/dist/wasm_context'

// file can be obtained from <input type=file>
const contents = new Uint8Array(await file.arrayBuffer())

// will try to run WASM via standard browser/Node.js APIs
const ctx = await WASMContext(data)

// Parses Illustrator file embedded within PDF
const PrivateData = await PrivateData(ctx)

// Returns list of all artboards
const Artboards = await Promise.all(
  ArtBoardRefs(ctx).map((ref) => ArtBoard(ctx, ref))
)

Development

Code structure

wasm

Contains Go code to parse Illustrator file, dump PDF structure and extract private data.

Right now has two commands, to be used for extracting data from .ai file in different contexts:

  • dump-serialized - targeting native code - will create a new folder in TMPDIR and dump extracted information there. Folder structure:

      /tmp/996617_f752c559434a4109863b6fda349bd304_LaneWebsite2.0_Resources_Blog.ai_214544471
      ├── _contents/
      ├── _private.ai
      ├── bitmaps/
      ├── fonts/
      └── source.json
  • wasm - targeting browser. When run via WebAssembly.instantiateStreaming will allow extracting information from file without server.

Environment variables

  • GOGC - controls how much extra memory will be allocated by Go Garbage Collector. Default is 100 - meaning memory will increase 2x each time. This default works great for most programs, but not for dump-serialized, which allocates lots of chunks. To combat that, it runs GC manually every so often during dumping process. Here 20 works best.
  • TMPDIR - dictates where file will be written. Be advised to move it off RAM when running batch on all test data - there're tens of GBs of files created in that process.

src

Contains Typescript code to parse outputs of Go code into Octpus-compatible format.

Uses jest for unit tests, contained in __test__.

Public API consists of 3 functions:

  • ArtBoardRefs,
  • ArtBoard,
  • PrivateData,

All of them expect Context as argument. Context has to be obtained by parsing .ai file with either binary or WASM - see examples above.

For ArtBoardRefs and ArtBoard, implementation roughly follows the steps:

  • obtain refs from PDF XRefTable from Context,
  • walk through refs to create a tree,
  • traverse the tree parsing nodes containing raw data,
  • modify the output to resemble poppler results.

For PrivateData Context already has raw bytes representing the data unpacked and ready to scan. In this case, we read it once for two purposes:

  • extract Artboard names (by checking each line for known pattern),
  • parse Text layer data - this requires buffering all lines which contain this data, unpacking it and then parsing.

In total, there are three parsers for raw data:

  • src/contents - parses "XObject" data, described in Section 7.8.2 Content Streams of PDF spec,
  • src/cmap - parses Font description, from Section 9.7.5 CMaps,
  • src/private-data/text-document - parses document inside private data describing text layers,

Fist two parser roughly follow the same pattern: lexer -> operator stacking -> reducer. This works because we don't need tree structre - contents stream is just a list of operators and operands, whilst in cmaps we only need to extract dictionary. Private data implements entirely different lexer and parser, and its past development can be traced in https://gitlab.avcd.cz/backend/ai-private-data-parser/.

gulpfile.ts

Contains some automation for the overall package. Can download test-data, compile Go code and run it on aforementioned test data.

Dependencies

You might need:

  • Node.js - for Typescript and Gulp. Run npm install to download deps.
  • Go - for wasm. npm run build will take care of everything.

If you have nix just run nix-shell and you'll have development shell ready :)

Possible optimizations

  • [x] Use symbols instead of type: number markers on types - will improve readability of IR,
  • [ ] Create symbol for each operator - currently each comparison in index.ts does linear equality check for each string,
  • [x] Avoid duplicate parsing in decoder (i.e. parseFloat on literal string - maybe some intermediate parsing step?)

Release process

  • ensure new description is added to CHANGELOG.md in Unreleased section - kacl will check this in next step :)

  • bump version using npm version,

  • publish new version with npm publish.

Profiling

Node.js

  • Transpile all files to JS:

      tsc -p .
  • Use 0x to create a flamegraph:

      0x ./scripts/parse.js ./test-data/996617_f752c559434a4109863b6fda349bd304_LaneWebsite2.0_Resources_Blog.ai

Go

  • point AICPU_DUMP_PPROF to where dump should be created:

      export AICPU_DUMP_PPROF=./prof/
  • run dump:

      ./wasm/cmd/dump-serialized/dump-serialized ./test-data/996617_f752c559434a4109863b6fda349bd304_LaneWebsite2.0_Resources_Blog.ai
  • analyze with

      go tool pprof -http=":8081" ./wasm/cmd/dump-serialized/dump-serialized ./pprof/cpu.pprof