@racsodev/cv-pdf-to-json
v2.2.1
Published
Node.js library that extracts CV/resume data from PDF files and converts it to structured JSON using Claude's native PDF support.
Downloads
218
Maintainers
Readme
CV PDF to JSON
Extract and process CV data from PDF files with Claude AI's native PDF support. This library provides a robust pipeline for converting PDF resumes into structured JSON data.
Features
- Direct PDF processing using Claude AI's native PDF support
- Structured JSON output of CV data
- Support for processing single PDF file or directory of PDF files
- Support for saving structured JSON outputs
- Debug mode for detailed processing insights
Installation
npm install @racsodev/cv-pdf-to-json
Basic Usage
import { createPdfExtractor } from '@racsodev/cv-pdf-to-json'
import path from 'path'
// Initialize the PDF extractor
const extractor = createPdfExtractor({
anthropicApiKey: process.env.ANTHROPIC_API_KEY || '',
outputJsonPath: './outputs/json',
})
// Process a single file
const result = await extractor.process('path/to/cv.pdf')
console.log('Processing Result:', result)
// Process a directory
const results = await extractor.processDirectory('path/to/cvs')
console.log('Processing Results:', results)
Advanced Usage
For more control over the extraction process, you can use individual components:
import {
DocumentProcessor,
ClaudeProcessor,
type CvData,
type Experience,
type Education,
type Language,
ContractType,
LanguageLevel,
} from '@racsodev/cv-pdf-to-json'
// Initialize Claude AI processor with native PDF support
const processor = new ClaudeProcessor({
apiKey: process.env.ANTHROPIC_API_KEY || '',
})
// Create document processor
const documentProcessor = new DocumentProcessor({
processor,
outputJsonPath: './outputs/json',
debug: true,
})
// Process CV
async function processCV(pdfPath: string) {
const result = await documentProcessor.process(pdfPath)
if (result.success && result.data) {
const cvData: CvData = result.data
console.log('Extracted CV Data:', cvData)
}
return result
}
// Use the processor
const result = await processCV('path/to/cv.pdf')
Output Format
The processor returns data in the following format:
interface CvData {
lastName: string
firstName: string
address: string
email: string
phone: string
linkedin: string
github: string
personalWebsite: string
professionalSummary: string
school: string
schoolLowerCase: string
promotionYear: number
professionalExperiences: Experience[]
otherExperiences: Experience[]
educations: Education[]
hardSkills: string[]
softSkills: string[]
languages: Language[]
publications: string[]
distinctions: string[]
hobbies: string[]
references: string[]
}
interface Experience {
companyName?: string
title?: string
location: string
type: ContractType
startDate: number
endDate: number
duration: number // in months
ongoing: boolean
description: string
associatedSkills: string[]
}
interface Education {
degree: string
institution: string
location: string
startDate: number
endDate: number
duration: number // in months
ongoing: boolean
description: string
associatedSkills: string[]
}
interface Language {
language: string
level: LanguageLevel
}
enum LanguageLevel {
BASIC_KNOWLEDGE = 'BASIC_KNOWLEDGE',
LIMITED_PROFESSIONAL = 'LIMITED_PROFESSIONAL',
PROFESSIONAL = 'PROFESSIONAL',
FULL_PROFESSIONAL = 'FULL_PROFESSIONAL',
NATIVE_BILINGUAL = 'NATIVE_BILINGUAL',
}
enum ContractType {
PERMANENT_CONTRACT = 'PERMANENT_CONTRACT',
SELF_EMPLOYED = 'SELF_EMPLOYED',
FREELANCE = 'FREELANCE',
FIXED_TERM_CONTRACT = 'FIXED_TERM_CONTRACT',
INTERNSHIP = 'INTERNSHIP',
APPRENTICESHIP = 'APPRENTICESHIP',
PERFORMING_ARTS_INTERMITTENT = 'PERFORMING_ARTS_INTERMITTENT',
PART_TIME_PERMANENT = 'PART_TIME_PERMANENT',
CIVIC_SERVICE = 'CIVIC_SERVICE',
PART_TIME_FIXED_TERM = 'PART_TIME_FIXED_TERM',
SUPPORTED_EMPLOYMENT = 'SUPPORTED_EMPLOYMENT',
CIVIL_SERVANT = 'CIVIL_SERVANT',
TEMPORARY_WORKER = 'TEMPORARY_WORKER',
ASSOCIATIVE = 'ASSOCIATIVE',
}
Development Setup
- Install dependencies:
npm install
- Set up environment variables:
# Copy the example env file
cp .env.example .env
# Edit .env and add your Anthropic API key
ANTHROPIC_API_KEY=your_api_key_here
- Process documents:
npm run process <file-path>
This will process the specified PDF file or directory and generate JSON outputs in the configured directory.
Project Structure
src/
- Source code directoryprocessors/
- Document processing pipelineProcessor.ts
- Base processor classDocumentProcessor.ts
- Main document processing logicClaudeProcessor.ts
- Claude AI integration with native PDF support
types/
- TypeScript type definitionsutils/
- Utility functions for data processing and file handling
Requirements
- Node.js >= 18.4.2
- Anthropic API key for Claude AI integration
License
Apache-2.0