Getting started
To deploy a new data orchestration project with dorc, follow the guide below.
Prerequisites
Before you get started with dorc you will need to ensure the following are setup before
- AWS - AWS is used to create the cloud infrastructure, you therefore require an AWS environment with the relevant access and permissions. We recommend using aws-vault as a way of securely handling AWS credentials.
- pyenv - pyenv is used to install and manage the Python.
- Pulumi - Pulumi is used to deploy all the infrastructure programmatically. See the documentation on how to install Pulumi.
Folder Structure
The folder structure for setting up dorc is shown below. Within a directory of your choice, you will need to clone the dorc repo, alongside your pipeline config repository.
/<directory>
/dorc
/pipeline-config
/universal
/infra
/src
Dockerfile
requirements.txt
Pipeline Config Repository
The pipeline config repository is where your unique pipeline definition and configuration sits.
There is however a requirement on the folder structure of this private repository. A typical structure is seen above with the three different folders.
universal- Used to handle and run the dorc universal infrastructure.infra- Used to handle and run the dorc infra infrastructure. This infrastructure can be deployed to different environments.src- Where the core pipeline source code lives.
Typically once a project is setup you would expect the bulk of the code changes to be happening within the src folder.
Environments
Just like any standard software we recommend running your project in multiple environments. dorc allows for all of your data orchestration to be replicated across different environments of your choosing.
The only deployment in dorc that is environment agnostic is the universal infrastructure.
Layers
Layers facilitate enable the organization of data pipelines at different levels of abstraction, promoting a structured and manageable approach to data management.
Note: You do not have to use layers, you can just have one folder called
defaultthat contains all of your pipelines.
A typical data architecture might contain the following layers:
-
raw-to-processed: This layer consists of pipelines responsible for transforming datasets from their raw form, performing tasks such as validation, cleaning, and wrangling, and then moving them to the processed layer. The resulting data could then be used by Data Scientists for analysis or developmet.
-
processed-to-curated: This layer consists of pipelines that further refine the processed data into curated data products. Activities in this layer may involve joining multiple datasets and incorporating business logic. The resulting data could then be surfaced in visualisations.
For a dorc project we would reference this structure within the src folder like the following, note that example is typically the high level name of the specific pipeline you are willing to create.
/pipeline-config
/src
/raw-to-processed
/example
/processed-to-curated
/example
Dockerfile
requirements.txt
See further reading if unsure about typical data engineering structures.
Environment Variables
The following environment variables are required. We recommend creating a .env within the dorc folder but if this is not possible, e.g. running dorc through automated CI/CD actions exporting the variables is required.
AWS_ACCOUNT=
AWS_REGION=
CONFIG_REPO_PATH=
INFRA_BUCKET=
UNIVERSAL_STACK_NAME=
INFRA_STACK_NAME=
PULUMI_CONFIG_PASSPHRASE=
See the description and a example for each of the variables below
- AWS_ACCOUNT - AWS Account ID, for more details see here
- AWS_REGION - The AWS specific region in which to deploy all the infrastructure too
- CONFIG_REPO_PATH - The relative path to your pipeline-config repo from your local dorc repo. For the recommended structure above this value would be
../pipeline-config - UNIVERSAL_STACK_NAME - The folder location & name of the infrastructure stack that the universal infrastructure will be deployed too. Default value is: universal
- INFRA_STACK_NAME - The folder location & name of the infrastructure stack that the infra infrastructure will be deployed too. Default value is: infra
- PULUMI_CONFIG_PASSPHRASE -
Dockerfile
Docker is used to store and the code as images which are then deployed to serverless lambda functions. The Dockerfile can be set to the following:
FROM amazon/aws-lambda-python:3.9
ARG CODE_PATH
COPY ${CODE_PATH}/* ${LAMBDA_TASK_ROOT}/
COPY requirements.tx[t] ./global-requirements.txt
RUN test -f global-requirements.txt && pip install -r global-requirements.txt || echo "No global requirements"
COPY ${CODE_PATH}/../requirements.tx[t] ${LAMBDA_TASK_ROOT}/
RUN test -f ${LAMBDA_TASK_ROOT}/requirements.txt && pip install -r ${LAMBDA_TASK_ROOT}/requirements.txt || echo "No local requirements"
CMD [ "lambda.handler" ]
This allows for a project wide requirements.txt for universal wide Python packages to be installed but also for a pipeline specific requirements.txt.
Setup
All of the commands used within dorc are exposed as Make commands. Perform the following sequence to get started
make python-setup- Installs the relevant Python version using pyenvmake venv- Creates the Python virtual environment and installs all relevant packages. .venv/bin/activate- Activate the virtual environmentmake infra/init- Initialise the Pulumi infrastructure. You will need to ensure you have the relevant aws access available within the shell
Universal Infrastructure
Using the same folder structure as previously mentioned and with the UNIVERSAL_STACK_NAME=universal environment variables set we can use dorc to handle all the universal project infrastructure. dorc exports a Universal creator Python class.
Create a __main__.py within the universal folder (or the name you specified in the UNIVERSAL_STACK_NAME environment variable). Then create a Pulumi.universal.yaml in the same directory.
/pipeline-config
/universal
__main__.py
Pulumi.universal.yaml
Within the __main__.py you want to create an instance of the dorc universal creator and pass it with an instance of the dorc universal configuration model.
# These are imported from dorc
from utils.config import UniversalConfig
from infrastructure.universal import CreateUniversal
config = UniversalConfig(
region="eu-west-2",
project="ExampleDorc",
tags={"project": "example-dorc"},
)
universal_infrastructure_creator = CreateUniversal(config)
universal_infrastructure_creator.apply()
In the dorc repository and in the virtual environment created by dorc, we apply the universal infrastructure using the common dorc infra/apply command.
make infra/apply instance=universal
Infra Infrastructure
Building on the universal setup we now want to apply the environment specific environment.
Create the folder that you specified for the INFRA_STACK_NAME and the required __main__.py file. For every environment you wish to create you will need to create a relevant Pulumi.{env}.yaml file.
/pipeline-config
/universal
.
.
.
/infra
__main__.py
Pulumi.{env}.yaml
Within the Pulumi.{env}.yaml we need to set the environment configuration variable
config:
environment: env
Now we can create and apply the dorc python code within the __main__.py
from utils.config import Config
from infrastructure.infra import CreateInfra
config = Config(
universal=YOUR_BASE_UNIVERSAL_CONFIG,
vpc_id="aws_vpc_id",
private_subnet_ids=["aws_private_subnet_id"],
)
infra_infrastructure_creator = CreateInfra(config)
infra_infrastructure_creator.apply()
Now apply the infrastructure with the relevant dorc command
make infra/apply instance=infra env=prod
Now we have the universal and infra infrastructure setup we can create our pipelines.