PRQL: A Modern Data Transformation Language for Fintech and Beyond

Introduction

PRQL (Pipeline Relational Query Language) is a modern data transformation language designed to bridge the gap between SQL's power and Pandas' intuitive syntax. Born from a proposal on Hacker News in January 2023, PRQL aims to provide a consistent, composable, and user-friendly alternative to traditional SQL. By leveraging relational algebra principles, it offers a declarative syntax that simplifies complex data workflows. Currently, PRQL supports DuckDB and ClickHouse natively, with an interactive JavaScript Playground for real-time testing. This article explores its design philosophy, key features, and practical applications in fintech data engineering.

SQL's Limitations and PRQL's Design Philosophy

SQL's Challenges

SQL, while powerful, suffers from several limitations:

Inconsistent syntax: The SELECT clause often becomes overloaded with aggregation, window functions, and subqueries, leading to unclear logical flow.
Poor composability: Nested subqueries are hard to read, and there is no standardized pattern for modular development.
Dialect fragmentation: Different databases (e.g., Snowflake, BigQuery) use varying SQL dialects, increasing development and maintenance costs.

PRQL's Core Design

PRQL addresses these issues through:

Relational algebra foundation: Maintains SQL's relational model while enforcing consistent semantics and syntax.
Declarative syntax: Uses pipes (|) or newlines (⏎) to define data flows, enabling top-down execution and modular logic.
Orthogonality: Each transformation (SELECT, FILTER, GROUP BY) operates independently, allowing decomposition into reusable subqueries.

Syntax and Functional Features

Syntax Design

PRQL's syntax emphasizes readability and composability:

Pipes and newlines: Define data flows with each step as a separate line. For example:
```
from customers
filter age > 30
group by region
aggregate count
```
Date literals and formatting: Supports Python-like F-string formatting for automatic database-specific string concatenation.
Null handling: Uses ?? to replace NULL values, e.g., name ?? 'Unknown'.

Transformation Operations

PRQL provides orthogonal transformations:

SELECT: Selects columns without altering row counts.
FILTER: Reduces rows by applying conditions.
GROUP BY: Groups data for subsequent transformations.
AGGREGATE: Aggregates data into a single row.
WINDOW: Applies window functions without changing row counts.

Custom functions support functional programming paradigms, such as:

def take_smallest(n, table)
  sort by size
  limit n

Functions can be curried, allowing flexible reuse.

Composability Examples

PRQL simplifies complex logic that is cumbersome in SQL:

DISTINCT implementation: Combines GROUP BY and LIMIT 1 to achieve distinct values.
Hierarchical data processing: Uses LOOP to traverse tree structures, generating paths like parent.path || '/' || account.

Type System and Interactivity

PRQL's type system enhances developer productivity:

Type inference: Compiles queries with early error detection based on input data.
Schema integration: Future plans include leveraging database schemas for more accurate type checking.

Interactive development is supported through tools like the JavaScript Playground, which compiles PRQL to SQL and displays results instantly, enabling testing without database connections.

Current Applications and Future Directions

Supported Tools

JavaScript Playground: An online editor for real-time SQL compilation and testing.
DuckDB and ClickHouse: Native PRQL support eliminates the need for SQL translation.

Future Plans

PRQL aims to:

Enhance schema integration for improved type accuracy.
Expand database compatibility and feature sets.
Develop bindings for R, Python, and other languages to broaden adoption.

Technical Advantages

PRQL offers several advantages over traditional SQL:

Simplified complex queries: Operations like take_smallest eliminate the need for nested subqueries.
Functional programming style: Features like derive and window align with functional paradigms.
Cross-database compatibility: Aims to unify data transformation across diverse databases, reducing dialect-specific code.

Conclusion

PRQL represents a significant evolution in data transformation languages, combining the strengths of SQL and Pandas while addressing their limitations. Its declarative syntax, orthogonality, and interactive development environment make it ideal for fintech data engineering. By supporting modern databases and functional programming concepts, PRQL empowers developers to build robust, maintainable data pipelines. As the project continues to evolve under the Apache Foundation, its potential to revolutionize data workflows in finance and beyond remains promising.