From Excel to Lakehouse: A Modern Upgrade Path Using Parquet + DuckDB + Python

There is a version of the Excel modernization conversation that goes badly almost every time it is attempted, and it goes badly in a very specific way. An IT team, a data engineering consultant, or an ambitious internal technology initiative decides that the organization's reliance on Excel is a problem to be eliminated rather than an architecture to be evolved. A new platform is selected — a cloud data warehouse, a BI tool, an ERP system — and the migration is framed as a replacement: Excel out, new system in. Finance resists, because finance always resists platform migrations that strip away the analytical flexibility they depend on. The project drags. Adoption is partial. A year later, the new system handles some workflows while shadow Excel models have quietly proliferated to cover everything the new platform could not accommodate. The organization is now maintaining two parallel data environments instead of one, and nobody is happy with either. The root cause of this failure pattern is architectural overreach — the attempt to solve a data infrastructure problem by eliminating the tool that finance is most productive in, rather than by addressing the actual bottlenecks underneath it.

The modern upgrade path that actually works keeps Excel exactly where it is: in the hands of the analyst, as the front-end interface for financial modeling, reporting, and ad hoc analysis. What it changes is the layer underneath — replacing the Excel-native data storage and query mechanisms that break down at scale with a columnar storage and in-process query engine architecture that handles volumes and query complexity that Excel simply cannot. The three technologies that make this approach practical today are Parquet for storage, DuckDB for querying, and Python for the integration layer that connects them to Excel. Together they form a lightweight, remarkably fast, and operationally simple data backend that can be introduced incrementally, without a platform migration project, without replacing anyone's workflow, and without a six-month implementation timeline.

Parquet is a columnar file format originally developed in the Apache Hadoop ecosystem and now the de facto standard for analytical data storage across the modern data stack. The reason columnar storage matters for financial data is rooted in how analytical queries actually work. When an analyst wants to calculate the average revenue across ten thousand invoice records, a row-oriented format like a CSV or an Excel workbook has to read every field of every row — customer name, invoice date, currency, line items — even though the query only touches one column. Parquet stores each column contiguously, meaning the query engine reads only the revenue column, skipping everything else. For wide financial tables with many fields and millions of rows, the I/O reduction is dramatic. Parquet also applies column-level compression, so a file that would be 800 megabytes as a CSV is commonly 80 to 120 megabytes as Parquet — a compression ratio that compounds in significance when you are storing multiple years of transaction history or granular portfolio-level data across many entities. Files are written and read using Python's pandas or pyarrow libraries in a single line of code, requiring no server infrastructure, no database administrator, and no deployment process beyond saving a file to a directory.

DuckDB is where the architecture becomes genuinely transformative. DuckDB is an in-process analytical database — meaning it runs entirely within the Python process that calls it, with no server to configure, no connection string to manage, no license to purchase, and no network latency in the query path. It can query Parquet files directly, without loading them into memory first, and it executes SQL with full analytical function support: window functions, CTEs, multi-table joins, aggregations across hundreds of millions of rows. In benchmarks against traditional approaches, DuckDB regularly queries data that would take Excel minutes to process in under a second, and data that would crash Excel entirely in a few seconds. For a finance team dealing with granular transaction data, multi-year position histories, or portfolio-level aggregations across dozens of entities, the performance difference is not incremental — it is categorical. The query that used to require a database server, a data engineering team, and a formal data request now runs locally on a laptop in the Python script that feeds the Excel model.

The Python integration layer is what makes this architecture transparent to the Excel user. The pattern is straightforward: a Python script queries the Parquet data store using DuckDB, produces the aggregated or filtered result set the model needs, and writes it into the Excel workbook using openpyxl or xlwings — populating the exact cells, sheets, and table ranges the model expects, in exactly the format the model is designed to consume. From the analyst's perspective, they click a refresh button or run a script, and the model updates with fresh data. The source of that data — whether it came from a CSV, a Parquet file, an API call, or a database query — is abstracted away behind the Python layer. The model itself does not change. The formulas, the presentation logic, the analytical structure the analyst has invested months building — all of it is preserved. What changes is how fast the data arrives, how much of it can be processed, and how reliably it can be refreshed without manual intervention.

The migration strategy that minimizes disruption follows a deliberate sequence. The starting point is identifying the data sources that are currently causing the most pain — the large CSV exports that take ten minutes to open in Excel, the consolidation workbooks that crash when updated, the VLOOKUP chains running across hundreds of thousands of rows that make the file unusable. These are converted to Parquet first, with Python scripts replacing the manual export-and-paste workflows that currently populate them. The queries that were happening inside Excel — the pivot tables, the SUMIFS aggregations, the multi-table lookups — are rewritten as DuckDB SQL queries in the Python layer, producing pre-aggregated result tables that Excel receives rather than computes. The Excel workbook is simplified: its job is no longer to store and query raw data but to display, format, and interpret the results that Python delivers to it. The model becomes faster, smaller, and more maintainable in a single migration step, without changing the analyst's experience of working with it in any meaningful way.

As the architecture matures, the Parquet data store can evolve into a proper lakehouse structure — partitioned by date and entity, catalogued with schema metadata, versioned for point-in-time querying, and connected to upstream data pipelines that refresh it automatically from source systems. DuckDB can be replaced or supplemented with cloud query engines like Apache Spark or BigQuery if data volumes grow to the scale where local processing becomes a constraint. But crucially, none of these evolutions require touching the Excel front-end. The analyst continues to work in the environment they know, while the infrastructure underneath scales to meet whatever the data demands.

This is the modernization philosophy that Cell Fusion Solutions brings to every engagement: upgrade the infrastructure, preserve the workflow, and deliver performance gains that the business feels immediately rather than after a lengthy platform transition. We design and implement Parquet, DuckDB, and Python integration architectures that connect seamlessly to existing Excel models, transforming data bottlenecks that have become accepted facts of life into solved problems. If your organization is running analytical workflows that Excel can barely contain — or has already watched them collapse under their own weight — Cell Fusion Solutions can build the modern data layer underneath them, without asking your finance team to change how they work.

Next
Next

Building a Semantic Layer for Excel: One Source of Truth for KPIs