dps-extractor
v1.0.0
Published
Builds a docker container for executing data extraction tasks
Downloads
3
Readme
DPS-EXTRACTOR
A containerized application that executes extraction tasks. The scope of this repository is focused on collecting third-party data and storing it in S3.
Requirements
- NodeJs 18.5 https://nodejs.org/en/blog/release/v18.5.0/
- NPM 8.12.1 - (comes with node but need to update to 8.12.1)
- Typescript >= 4.0.5 https://www.typescriptlang.org/
- npm https://www.npmjs.com/
- docker https://www.docker.com/
Internal Dependencies
- dps-utilities-typescript (https://github.com/cafemedia/dps-utilities-typescript)
Setup
- Clone this repository
[email protected]:cafemedia/dps-extractor.git
- Enter the directory
cd dps-extractor
- Install dependencies
npm install
Infrastructure
Running Terraform. A wrapper script is provided for your convenience. Use
terraform.sh -h
for more information.
# -e environment
# -p aws credentials profile name
# -v pass terraform variables, can be invoked multiple times
./terraform.sh \
-e development \
-p aws-profile \
.tf
Building
This project is designed to be imported as a library, and must first be compiled into javascript.
NOTE: the memory requirements for building are increasing...
npm run build
Linting
This project is configured to use tslint
to keep our code styling in line.npm run lint
Formatting
Please be sure to format your code before commit!
npm run format
Testing
A full test suite has been integrated into the project using:
- mocha - test framework - (https://mochajs.org/)
- chai - assertion library - (https://www.chaijs.com/)
- sinon - mocking and faking support - (https://sinonjs.org/)
- nyc - coverage reporting - (https://github.com/istanbuljs/nyc)
npm run test:unit:coverage
Git Commit Hooks
In order to ensure that we aren't pushing messy code that likely won't pass linting or test phases in Drone, we use husky (https://github.com/typicode/husky) which will automatically build, lint and test our code when we attempt to commit.
Working with Private Github Packages
This project depends on dps-utilities-typescript, which is installed via NPM, but requires authentication with Github Packages.
Building Locally with Docker
This project is configured to automatically build and deploy an image to ECR on the Adthrive AWS Account with a repository of the same name. In order to test that builds work locally:
docker build -t dps-extractor --build-arg GITHUB_TOKEN=<YOUR GITHUB PAT> .
docker run -t dps-extractor Hello
jowens@JOWENS-MAC dps-extractor % docker run -t dps-extractor Hello
2021-02-11T18:05:22.912Z - info: [Hello] Starting - 512c0d70-a72d-4cdc-8013-4673527dd0b9 - {}
2021-02-11T18:05:22.923Z - info: [Hello] Done duration=3ms
jowens@JOWENS-MAC dps-extractor %
TODO: This could use some optimization.
Running in Airflow
Example airflow task to be incorporated into a DAG:
hello_extractor = KubernetesPodOperator(
namespace = 'dps',
image = f"312505582686.dkr.ecr.us-east-1.amazonaws.com/dps-extractor:<IMAGE TAG>",
arguments = [
"Hello",
"-s", "{{ task_instance.xcom_pull(task_ids='get_state', key='return_value').date }}",
"-e", "{{ task_instance.xcom_pull(task_ids='get_state', key='return_value').date }}",
"-x"
],
name = "hello-extractor",
task_id = "hello-extractor",
get_logs = True,
dag = dag,
is_delete_operator_pod = True,
in_cluster = True,
log_events_on_failure = True,
run_as_user = "airflow",
annotations = {"datadog-service": "sample-k8s-dag", "datadog-source": "airflow"},
do_xcom_push = True,
)
Note that if do_xcom_push
is set to True
, we must also pass the -x
argument to the container.
Example Log Output:
[2021-02-11 18:14:17,749] {taskinstance.py:901} INFO - Executing <Task(KubernetesPodOperator): hello-extractor> on 2021-02-11T18:13:52.710304+00:00
[2021-02-11 18:14:17,750] {base_task_runner.py:131} INFO - Running on host: gamearningshelloextractor-f82ef6ed61744772b86229379fd9ba1b
[2021-02-11 18:14:17,750] {base_task_runner.py:132} INFO - Running: ['airflow', 'run', 'gam_earnings', 'hello-extractor', '2021-02-11T18:13:52.710304+00:00', '--job_id', '188', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dags/gam_earnings.py', '--cfg_path', '/tmp/tmp77e3vqz_']
[2021-02-11 18:14:19,072] {base_task_runner.py:111} INFO - Job 188: Subtask hello-extractor [2021-02-11 18:14:19,072] {__init__.py:50} INFO - Using executor LocalExecutor
[2021-02-11 18:14:19,072] {base_task_runner.py:111} INFO - Job 188: Subtask hello-extractor [2021-02-11 18:14:19,072] {dagbag.py:417} INFO - Filling up the DagBag from /opt/airflow/dags/dags/gam_earnings.py
[2021-02-11 18:14:19,396] {base_task_runner.py:111} INFO - Job 188: Subtask hello-extractor Running <TaskInstance: gam_earnings.hello-extractor 2021-02-11T18:13:52.710304+00:00 [running]> on host gamearningshelloextractor-f82ef6ed61744772b86229379fd9ba1b
[2021-02-11 18:14:19,548] {logging_mixin.py:112} WARNING - /home/airflow/.local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py:309: DeprecationWarning: Using `airflow.contrib.kubernetes.pod.Pod` is deprecated. Please use `k8s.V1Pod`.
dummy_pod = Pod(
[2021-02-11 18:14:19,548] {logging_mixin.py:112} WARNING - /home/airflow/.local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py:77: DeprecationWarning: Using `airflow.contrib.kubernetes.pod.Pod` is deprecated. Please use `k8s.V1Pod` instead.
pod = self._mutate_pod_backcompat(pod)
[2021-02-11 18:14:19,606] {pod_launcher.py:171} INFO - Event: hello-extractor-0d3841b79ce44594bd421def1f168461 had an event of type Pending
[2021-02-11 18:14:19,606] {pod_launcher.py:139} WARNING - Pod not yet started: hello-extractor-0d3841b79ce44594bd421def1f168461
[2021-02-11 18:14:20,614] {pod_launcher.py:171} INFO - Event: hello-extractor-0d3841b79ce44594bd421def1f168461 had an event of type Pending
[2021-02-11 18:14:20,614] {pod_launcher.py:139} WARNING - Pod not yet started: hello-extractor-0d3841b79ce44594bd421def1f168461
[2021-02-11 18:14:21,625] {pod_launcher.py:171} INFO - Event: hello-extractor-0d3841b79ce44594bd421def1f168461 had an event of type Running
[2021-02-11 18:14:21,660] {pod_launcher.py:156} INFO - b'2021-02-11T18:14:20.820Z - \x1b[32minfo\x1b[39m: [Hello] Starting - 9e4e5a0a-44c0-444a-9940-80f2966f4366 - {"start":"2021-02-10T00:00:00.000+00:00","end":"2021-02-10T00:00:00.000+00:00","writeXcom":true}\n'
[2021-02-11 18:14:21,660] {pod_launcher.py:156} INFO - b'2021-02-11T18:14:20.822Z - \x1b[32minfo\x1b[39m: [Hello] Done duration=1ms\n'
[2021-02-11 18:14:21,718] {pod_launcher.py:267} INFO - Running command... cat /airflow/xcom/return.json
[2021-02-11 18:14:21,761] {pod_launcher.py:267} INFO - Running command... kill -s SIGINT 1
foo