# The Evolution of Data Infrastructure: From Basics to Advanced
Written on
Chapter 1: Understanding the Modern Data Landscape
In recent years, significant investments have flowed into companies that embody the “Modern Data Stack” concept. For instance, Fivetran has garnered nearly a billion dollars in funding, while DBT has raised 150 million and is seeking even more capital. Starburst has also attracted 100 million dollars, among many others.
Despite this influx of capital, it’s a misconception that every organization has a fully developed data stack. In reality, most businesses cannot and do not launch with a comprehensive data infrastructure. Instead, companies typically build their data stacks incrementally, which is arguably the most effective approach.
Achieving a perfect source of truth with fully reliable data monitored through observability tools is a gradual process. It requires time for teams to establish processes, scalability, trust, and the ability to effectively manage data.
In this article, I will delineate the stages that organizations often navigate while constructing their foundational data stack. Naturally, these stages are influenced by each company's unique priorities, objectives, and available funding.
The Five-Person Start-Up Data Framework
Initially, a start-up's data stack often relies on Excel.
To clarify, in the early phases of a company, it’s common to have just a developer and perhaps an analyst—if you're fortunate. The analyst typically requests data from the developer, who may extract it from their production database—a somewhat risky endeavor. Unfortunately, they might not have had the opportunity to set up a separate database for reporting, which could be addressed in the next development cycle.
Consequently, the analyst or business team member performs some basic data manipulation to generate a report. For a small team of 5-6 individuals, this method is a viable solution. Investing substantial resources into a data warehouse or complex data transformations would likely be imprudent unless the company has exceptional profit margins per employee.
However, if your transactions are minimal, data sources are few, and team members are pressed for time, utilizing Excel can be a practical choice. It allows for data sharing, basic analytics, and the creation of ad hoc queries for future reporting needs. That said, if the company begins to thrive, a transition away from Excel will be necessary.
Constructing Your Initial Baseline Data Stack
Eventually, relying on Excel becomes impractical. The manual processes are cumbersome and prone to errors, particularly as your organization expands in size, transaction volume, and data sources. This growth necessitates a centralized reporting solution.
At this juncture, your team may be ready to embark on building their first baseline data stack. While this task can appear overwhelming, especially with the plethora of articles available on constructing a modern data stack, I advise against attempting to implement everything simultaneously.
Initially, as you start gathering support from management and stakeholders, it’s crucial to demonstrate value quickly or, at the very least, show that you can produce a fundamental report. Attempting to adopt what is traditionally viewed as the modern data stack too early can lead to challenges.
Instead, I recommend focusing on three critical components:
- Ingestion → Data Pipelines
- Storage → Data Warehouses, Lakes, Lakehouses, etc.
- Reporting and Data Visualization → Attractive Dashboards, Notebooks, and, inevitably, Excel Reports
By concentrating on these three areas, your team can begin to develop a systematic approach for transferring data from various sources to your data warehouse. Establishing reliable processes is essential, as it ensures you can scale effectively as your company continues to grow.
Emphasizing the Human Element in Data Management
A crucial aspect that often gets overlooked, as highlighted by Danny from WhyLabs, is the human element in data management. While the technical aspects of storage, ingestion, and reporting are vital, there’s another layer that encompasses the human needs associated with data.
The main goal of creating a data warehouse is to provide a workspace for analysts and data scientists—who, after all, are human. This means that the data produced must be understandable, user-friendly, easily traceable, and trustworthy. If users cannot rely on your data, they may resort to pulling information from alternative sources, rendering your data warehouse ineffective.
If users struggle to locate the necessary data or are unaware of its existence, they may pull it manually. Furthermore, if they are not cognizant of existing data transformations, they might end up duplicating them unnecessarily.
At this early stage, it’s acceptable to adopt a more manual approach to some best practices. For instance, you could implement semi-automated queries to validate your data or invest in a straightforward tool like Bigeye for monitoring purposes. For tracking lineage and maintaining a data dictionary, using a combination of Notion, Google Sheets, and Lucid Charts can suffice. However, as your operations scale, this approach may become unmanageable. It’s crucial to find a balance that maximizes benefits while minimizing complexity.
Ultimately, it’s essential to evaluate your organization’s current state, needs, and priorities, and implement appropriate solutions accordingly.
The Future of Your Data Stack
With the foundational aspects of ingestion, storage, and reporting established, your next step is to enhance data observability and integrate tools to improve the traceability of how data flows through various sections, as illustrated in the diagram below.
However, the diagram does not encompass everything. For instance, I have omitted details about reverse ETLs, MLOps, and a variety of other tools that Matt Turck effectively tracks.
I plan to continue refining this diagram and elaborating on it in future articles, all while keeping pace with the emergence of new tools.
The first video, titled "What Is The Modern Data Stack - Intro To Data Infrastructure Part 1," provides an overview of contemporary data infrastructures and their significance.
The second video, "The Hitchhikers Guide to Building a Modern Data Stack - Coalesce 2023," offers insights into constructing an effective modern data stack.