git-repo-parser

v2.0.7

Published

7 months ago

A tool to scrape all files from a GitHub repository and turn it into a JSON file

Downloads

0High
0Medium
0Low

arnab0321

github scraper json

git-repo-parser

A powerful tool to scrape all files from a GitHub repository and convert them into JSON or plain text format.

Installation

Install the package globally using npm:

npm install -g git-repo-parser

Or add it to your project as a dependency:

npm install git-repo-parser

Usage

Command Line Interface (CLI)

This package provides two CLI commands:

git-repo-to-json: Scrapes a GitHub repository and saves the result as a JSON file.
git-repo-to-text: Scrapes a GitHub repository and saves the result as a plain text file.

Example usage:

git-repo-to-json https://github.com/username/repo-name.git
git-repo-to-text https://github.com/username/repo-name.git

The scraped data will be saved as files.json or files.txt in your current directory.

Programmatic Usage

You can also use the package in your Node.js projects:

import { scrapeRepositoryToJson, scrapeRepositoryToPlainText } from 'git-repo-parser';

// To get JSON output
const jsonResult = await scrapeRepositoryToJson('https://github.com/username/repo-name.git');

// To get plain text output
const textResult = await scrapeRepositoryToPlainText('https://github.com/username/repo-name.git');

API

`scrapeRepositoryToJson(repoUrl: string): Promise<FileData[]>`

Scrapes the given GitHub repository and returns a promise that resolves to an array of FileData objects.

`scrapeRepositoryToPlainText(repoUrl: string): Promise<string>`

Scrapes the given GitHub repository and returns a promise that resolves to a string containing the repository contents in a structured plain text format.

FileData Interface

The FileData interface represents the structure of files and directories in the JSON output:

interface FileData {
    name: string;
    path: string;
    type: 'file' | 'directory';
    children?: FileData[];
    content?: string;
}

Features

Clones the repository locally (temporary)
Ignores binary files and common non-source files
Supports nested directory structures
Provides both JSON and plain text output formats
Cleans up cloned repository after scraping

Ignored Files

The following file types and patterns are ignored during scraping:

package-lock.json
Binary files (pdf, png, jpg, jpeg, gif, ico, svg, woff, woff2, eot, ttf, otf)
Media files (mp4, avi, webm, mov, mp3, wav, flac, ogg, webp)
Debug and error logs (npm-debug, yarn-debug, yarn-error)
Configuration files (tsconfig, jest.config)
The .git directory

License

This project is licensed under the MIT License.

Author

arnab2001

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check [issues page] if you want to contribute.

Show your support

Give a ⭐️ if this project helped you!

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

git-repo-parser

Installation

Usage

Command Line Interface (CLI)

Example usage:

Programmatic Usage

API

scrapeRepositoryToJson(repoUrl: string): Promise<FileData[]>

scrapeRepositoryToPlainText(repoUrl: string): Promise<string>

FileData Interface

Features

Ignored Files

License

Author

Contributing

Show your support

`scrapeRepositoryToJson(repoUrl: string): Promise<FileData[]>`

`scrapeRepositoryToPlainText(repoUrl: string): Promise<string>`