enjinscraper
v1.8.0
Published
Scrapes an Enjin site via the Enjin API
Downloads
7
Readme
EnjinScraper
Scrapes an Enjin site via the Enjin API.
For support, please join the support Discord: https://discord.gg/2SfGAMskWt.
Usage
Warning If you have 2 factor authentication enabled on your Enjin account, you must either disable it or make a temporary account without 2FA to run this tool!
To scrape all data the tool can scrape, the account must be a sitewide admin or owner account.
EnjinScraper will now do its best even if you can't provide an API key or a site admin account. At minimum, a regular site account is still needed. There's still of course some limits to this, but I've done my best to include as much as I can. Note there's still a few minor things I have to add to it when it works in this mode, such as it seemingly not getting forum images properly and perhaps a way for it to get application questions. These things are normally handled by the admin panel, but obviously some users trying to archive sites they don't own don't have this luxury. Best of luck to everyone!
To use this mode, you will have to manually provide the module IDs of forums, news, and wiki modules you wish to scrape. This will be the first number found in the URL when on any page of one of these modules. For example, here: https://www.megacrafting.com/forum/m/4627724/viewthread/9364148-mejinxx-application
the forum module ID is 4627724
. These should be specified as an array of strings under manualForumModuleIDs
.
Note: if you've already scraped with a site admin account and api key, this update does not provide any additional data.
Quick Run With NPX
Windows
Run the following in Powershell:
mkdir EnjinScraper
cd EnjinScraper
winget install -e --id OpenJS.NodeJS
npx enjinscraper
Note that if rerunning later, you may need to use npx enjinscraper@latest
to force use of the latest version.
Configuration
Obtaining an API key
Per Enjin's instructions:
To enable your API, visit your admin panel / settings / API area. The content on this page includes your base API URL, your secret API key, and the API mode. Ensure that the API mode is set to "Public".
Configuring the config.json
Optionally, create a config.json
file in the root directory of the project. Otherwise, you will be prompted for required values on first run. The file should look like this, but with comments omitted:
{
"apiKey": "someapiKey", // Required
"domain": "www.example.com", // Required
"email": "[email protected]", // Required
"password": "somepassword", // Required
"adminMode": true,
"excludeHTMLModuleIDs": [
"1000001",
"1000002"
],
"excludeForumModuleIDs": [],
"excludeNewsModuleIDs": [],
"excludeTicketModuleIDs": [],
"excludedWikiModuleIDs": [],
"manualForumModuleIDs": [],
"manualNewsModuleIDs": [],
"manualTicketModuleIDs": [],
"manualWikiModuleIDs": [],
"manualUserIDs": [],
"disabledModules": {
"html": false,
"forums": {
"postIPs": true
},
"galleries": false,
"news": false,
"wikis": false,
"tickets": false,
"applications": false,
"comments": false,
"users": {
"ips": false,
"tags": false,
"fullinfo": true,
"characters": true,
"games": true,
"photos": true,
"wall": true,
"yourFriends": true
},
"files": {
"s3": false,
"wiki": false,
"avatars": true,
"profileCovers": true,
"gameBoxes": true,
"userAlbums": true
}
},
"retrySeconds": 5, // Setting to 0 will retry instantly
"retryTimes": 5, // Setting to 0 will disable retries; setting to -1 will retry indefinitely
"debug": true,
"disableSSL": false,
"overrideScrapeProgress": false // setting this to true will ignore the scrapers internal progress tracking and scrape all non-disabled modules
}
Note that data stemming from user profiles is disabled by default, as this can majorly extend the time needed to scrape sites with large member counts. You can of course change this in disabledModules.users
You should use an account with the greatest possible permissions, as that will increase the amount of content that can be scraped. Given that, the practical use of this tool is unfortunately limited to those with backend access to the site to be scraped. There is no neeed to enter module IDs, as the scraper will automatically gather info about all modules on the site.
Running Manually
git clone https://github.com/Kas-tle/EnjinScraper.git
cd EnjinScraper
yarn
npx ts-node index.ts
Outputs
The scraper will output an sqlite file at target/site.sqlite
in the root directory of the project. For a more detailed database schema, see OUTPUTS.md. The database will contain the following tables:
scrapers
: Contains information about what steps have been completed to gracefully resume scraping if needed.module_categories
: Enumerates the different cateogries modules can fall intomodules
: Contains information about modulespresets
: Contains information about presets, essentially a list of individual modulespages
: Contains information about modules in the context of the page they reside onsite_data
: A table that stores various information about a websitehtml_modules
: Contains the HTML, JavaScript, and CSS of HTML modulesforum_modules
: Contains information about the forum modules that were scrapedforums
: Contains information about the forums scraped from the forum modulesthreads
: Contains information about the threads scraped from the forumsposts
: Contains information about the posts scraped from the forumsgallery_albums
: Contains information about albums in a gallery, including their titles, descriptions, and imagesgallery_images
: Contains information about images in a gallery, including their titles, descriptions, and associated albumsgallery_tags
: Contains information about tags in a gallery, including their locations and associated images and albumswiki_pages
: Contains information about pages in a wiki, including their content, access control settings, and metadatawiki_revisions
: Contains information about revisions to pages in a wiki, including their content, access control settings, and metadatawiki_likes
: Contains information about users who have liked pages in a wikiwiki_categories
: Contains information about categories in a wiki, including their titles and thumbnailswiki_uploads
: Contains information about uploaded files in a wikinews_articles
: Contains information about news articles scraped from the news modulesticket_modules
: Contains information about ticket modulestickets
: Contains information about tickets scraped from the ticket modulesticket_replies
: Contains information about replies made to support ticketsapplications
: Contains basic information about applicationsapplication_sections
: Contains sections from applicationsapplication_questions
: Contains questions from applicationsapplication_questions
: Contains individual responses for applicationscomments
: Contains information about comments on applications, news articles, wiki pages, and gallery imagesusers
: Contains information about usersuser_profiles
: Contains information about user profiles, including their personal information, gaming IDs, and social media handlesuser_games
: Contains information about the games that a user has added to their profileuser_characters
: Contains information about the characters that a user has added to their profileuser_albums
: Contains information about the albums that a user has createduser_images
: Contains information about the images that a user has uploadeduser_wall_posts
: Contains information about wall posts made by usersuser_wall_comments
: Contains information about comments made on wall posts by usersuser_wall_comment_likes
: Contains information about users who have liked comments on wall postsuser_wall_post_likes
: Contains information about users who have liked wall posts
All files scraped will be stored in the target/files
directory in the same directory as the config.json
file. The directory structure will simply follow the URL with the https://
header removed. For example, if the site is https://www.example.com/somdir/file.png
, the files will be stored in the target/files/www.example.com/somdir/file.png
directory.
Files that are stored in Enjin's Amazon S3 instance for your site will be automatically downloaded and stored in the target/files
directory. The files will be stored in the same directory structure as they are on the S3 instance. All information about these files will be stored in the s3_files
table in the database. Examples of modules that store files here include galleries, forums, applications, tickets, and news posts.
Files from wiki pages will generally be found under target/files/s3.amazonaws.com/files.enjin.com/${siteID}/modules/wiki/${wikiPresetID}/file.png
.
User avatars are also scraped, which combines the URLs found in user_profiles.avatar
, user_wall_comments.avatar
, and user_wall_post_likes.avatar
. These will generally be found under assets-cloud.enjin.com/users/${userID}/avatar/full.${fileID}.png
. Note that these files are generally stored in the database with the size medium, but we download the full size only instead.
Profile cover images come from user_profiles.cover_image
and are found in either https://assets-cloud.enjin.com/users/${userID}/cover/${fileID}.png
if the user has uploaded their own cover image, or resources.enjin.com/${resourceLocator}/themes/${version}/image/profile/cover/${category}/${fileName}.jpg
if the user is using an Enjin provided cover image.
Game boxes are the images displayed for games a user has on their profile. They are found in assets-cloud.enjin.com/gameboxes/${gameID}/boxmedium.jpg
.
Lastly, user album images from user_images.url_original
can be found in either s3.amazonaws.com/assets.enjin.com/users/${userID}/pics/original/${fileName}
or assets.enjin.com/wall_embed_images/${fileName}
.