top of page

Using Metadata to Build a Robust Analytics Platform: From Introductory to Advanced Concepts

  • Writer: Matt Collins
    Matt Collins
  • Nov 6
  • 5 min read

Determine the granularity your metadata needs

Metadata-driven processing is something of a golden standard when it comes to analytics platforms. By leveraging a meta store of some kind, organisations can create a dynamic analytics framework that is both robust and scalable. With some well-designed content injection, behind-the-scenes configurations improve the reusability of data pipelines, notebooks and infrastructure. As a result, there are many great implementations out in the wild with a variety of complexities and configurations depending on the tools used, the use-case in question and level of control in the process desired for the implementation.


With such a range of designs to follow and build, what is the “correct” degree of granularity you need to capture to build your metadata-driven solution though?


We’ll take some time to talk about 5 examples of different levels of control and functionality we can implement through metadata in a tool-agnostic approach.


Why use metadata?

Metadata is essentially data that contains information about other data. A simple example would be looking at the properties of a file on your local PC. Here you will see the file name, path it is saved to, size, etc. These are all datapoints describing the data you care about.


It is likely you’ve used some of these metadata attributes in data engineering tasks before. You may have specified a file’s name (and path) to search for it and read it in a programming language of your choice so that you can display the contents. You might have also queried the file size when you’re inspecting the data, such as the columns of tabular data.


Level 1: Concepts

From a beginner’s perspective, you may have written multiple scripts, or drag-and-drop pipelines for reading different files from a server. Naturally, you want to improve the efficiency of your process and look to parameterise your workflow with the metadata regarding file names and paths, to repeat the operation for files across your local filesystem or cloud storage. Now, when things change, such as a new file being uploaded, or a file moving folder, you’ve also got less hard-coded references to update and you’ve improved maintainability at the same time.


How you store your metadata about this could be an object in your code, or a parameter in an orchestration tool.


As an intuitive representation, we can model this in a tabular structure to display our metadata of files we wish to interact with, such as below:

 

Screenshot of table metadata

Using this metadata table you can add additional files to interact with by simply adding new rows and letting the existing code block process the new information as desired!


Implementing these basic coding practices of reusability (and automation) helps reduce the size of your operational codebase when working with enterprise-level workloads.


Level 2: Basics

We’re off to a good start, but how do we take it to the next step? Reviewing our process, we can look for repeated components. Have you got pipelines that perform the same (or similar-enough) operations for copying CSV files and JSON files? Do you have code that performs “if-else”logic that could be captured into something more re-usable?

 

The metadata example below looks at extending the metadata to accommodate SQL Server data sources as well.

Screenshot of table metadata

As the complexity increases, it might make sense to break your metadata out into separate tables to be concise and reusable.

 

Connections:

Screenshot of table metadata

 

Datasets:

Screenshot of table metadata

 

Some behind-the-scenes logic can then be used to tie the metadata infrastructure together to create an organised and intuitive backend that gives your data pipelines the information they need to read the source data.

 

Level 3: Extend


Depending on the tooling being used, you might even have the opportunity to parameterise this itself. A specific example I’ve blogged about before is using metadata-driven to handle Spark Compute allocation for different tasks. Without repeating myself too much, parameterising the compute you want to perform different tasks can give you greater efficiency and save costs from a computational perspective. 


Taking this a step further, you could even parameterise the platform which performs a set of tasks in different regions, such as using metadata to allocate certain implementations of your data platform across regions (e.g. Orchestrating tasks on certain triggers for North America vs Central Europe).

 

Level 4: Anticipate

Thinking big, how about using metadata to configure failover zones in your Data Engineering processes? Perhaps there is an Azure region outage where you’re able to move your workloads to the redundant region by toggling a Boolean operator (manually or automatically!) in your metadata.

 

Thinking small, what about metadata to configure date and timestamp formatting? A common challenge for Data scientists working with spreadsheet data where dates may be formatted in a variety of spur of the moment user-defined preferences (e.g. 2024-12-31, 31 December 2024, 1 Dec 24, 12312024, etc). By encoding the timestamp format provided on a column-by-column basis in your metadata, you can add generic logic as part of your data engineering activities to use a metadata format string to ensure your timestamp attributes are always handled correctly up-front, without cowboy code resolutions appended to solve the problem as it comes.

 

Level 5: Parameterise

Opening this one up to the reader, what else could you encode in metadata? What business problems do you find yourself having to solve time and time again as part of your data engineering solutions? 

Or better still, what content can you inject that will improve interoperability and increase flexibility and scalability of your platform?


If you’re able to parameterise something that is currently hard-coded, you improve the portability of your design, reducing the tech debt of the current implementation when moving onto the next iteration, wrapping in documentation of data sources, configurations and challenges as part of the package.

 

What is the right level for you?

In most cases, you will be able to build metadata-driven data platforms in an incremental and agile way, improving as you go. Starting small, you’ll be able to see the benefits, but a bit of planning goes a long way. Understanding your data strategy in the long term when building your data platform will greatly help assist both the choice of tools you use, as well as understanding what tools you wish to swap in and out as time goes on. This high-level understanding can then allow you to focus on the metadata of these components and build them in with the aim of swapping components in and out in a modular fashion.


Conclusion and Considerations

In this article we’ve tried to showcase that metadata isn’t something to be afraid of, and that your Data Engineering team is probably already using it to some degree! 


Depending on your organisation’s data maturity and data strategy, you may be considering higher granularity in your metadata configuration to increase the longevity of your platform. There is an inherit trade-off between control and a learning curve (for both the technical and non-technical teams involved), which can be accommodated through supportive tools for explainability and observability, such as good governance applications and practices, and generative tools, like interfaces for creating metadata integrated with source systems.

 

Regardless of the approach taken, increasing your metadata-driven infrastructure can be tackled incrementally, giving you the chance to solve business challenges with available capacity, while setting yourself up for the bigger picture of a metadata-driven platform.

 

Try it for yourself!

If you want to see this in action, why not reach out to the team to see how we use a metadata-driven framework in CF.Cumulus.


Our open-source framework allows you to quickly deploy the resources you need to start ingesting and transforming data from various data sources with pre-built connectors using Data Engineering best practices and Azure resources.

 

Comments


Be the first to know

Subscribe to our blog to get updates on new posts.

Thanks for subscribing!

TRANSFORM YOUR BUSINESS

bottom of page