Introducing CF.Cumulus.Ingest

Matt Collins
May 24, 2024
3 min read

We're pleased to reveal the next component of our open-source cloud data platform accelerator, CF.Cumulus. We've been hard at work developing new functionality to solve common data ingestion activities and problems that we've seen out in the wild.

What is CF.Cumulus.Ingest?

With orchestration managed by Cumulus Control, Cumulus Ingest brings the functionality to move data from your various sources into the Lakehouse. Utilising metadata related to data source connections, datasets and dataset attributes, this information is passed to pre-built data pipelines to query your source systems and move data through our raw and cleansed Lakehouse layers. The Ingest framework takes advantage of Change Data Capture, Spark Compute and Delta Table storage for efficient processing, storage and data lineage tracking.

How can CF.Cumulus.Ingest help you?

Improve time-to-insights:

By implementing best practices and optimising resource configurations allows you to get data quickly, so you can see value quickly.

More time to spend on the things that matter:

Let Cumulus ingest take the brunt of the work when ingesting data. Taking advantage of pre-built components simplifies engineer involvement in onboarding new data sources and automates the ingestion into your Lakehouse.

Put trust in your data:

Data Observability and lineage is possible thanks to Cumulus’s framework and the data formats chosen.
This ensures users have transparency and confidence, establishing your data as a "source of truth."

What are the features of CF.Cumulus.Ingest?

Pre-Configured Connections

What? Pre-configured connections to various data sources, including SQL Server (both on premises and in Azure), Oracle Databases, and file systems.
Why? Simplifies data source integration and reduces setup time.

Authentication Flexibility

What? Support for variety of authentication methods, such as Managed Identities, with security baked-in via Azure Key Vault.
Why? Ensures secure access to data sources without compromising on authentication options.

Medallion Architecture for Data Storage

What? Utilises a Medallion Architecture for data storage, including hierarchical folder structure in Raw container and merged Delta Tables in Cleansed container.
Why? Enables efficient storage and retrieval of data across different layers.

Pre-built Metadata Assets

What? Includes pre-built metadata assets that facilitate data ingestion - populating SQL tables is all that is needed to ingest your first datasets!
Why? Simplifies data loading tasks and reduces manual effort.

Out-of-the-Box Data Pipelines

What? Provides ready-to-use data pipelines for pulling data from source systems into a Medallion layer Data Lake.
Why? Reduces development time and ensures consistent data movement.

Intelligent Data Loading

What? Uses Change Data Capture (CDC) conditions and historical load status for intelligent data loading.
Why? Optimises data ingestion by loading only relevant changes.

User-Defined CDC Logic

What? Allows users to define and configure CDC logic for incremental data loading.
Why? Saves compute costs and minimises time-to-load.

Data Type Enforcing

What? Define data types in the Attributes metadata to ensure desired data types are enforced when storing to Delta format in Cleansed storage layer.
Why? Control over dataset schemas upstream of serving layers and functionality.

Source Control Support

What? Each component of Cumulus Ingest supports source control.
Why? Facilitates code deployment across environments in a repeatable manner.

Isolated or Combined Deployment

What? CF.Cumulus.Ingest is decoupled from CF.Cumulus.Control meaning that it can be deployed independently, or in a combined deployment.
Why? Offers flexibility based on project needs and unlocks additional features when combined.

Scalability

What? Dynamically assign Spark Compute of differing sizes to optimally assign the power you need depending on dataset size and complexity when merging to your Cleansed layer.
Why? Ensures efficient data loading for both large and small datasets.

What's next?

If you want to take a look at all that CF.Cumulus has to offer, then please head over to our landing page where you can see everything in action for yourself: Cumulus | Cloud Formations

We're also available to run you through some of the finer points and answer any questions by booking time with the team here: Speak to an expert | Cloud Formations

Thanks!

Outcomes We Deliver

Products

Advisory

Delivery

Training

Powered by Products. Delivered with Expertise.