Introducing FugueSQL — SQL for Pandas, Spark, and Dask DataFrames
An End-To-End SQL Interface for Data Science and Analytics
As a data scientist, you might be familiar with both Pandas and SQL. However, there might be some queries, transformations that you feel comfortable doing in SQL instead of Python.
Wouldn’t it be nice if you can query a pandas DataFrame like below:
… using SQL?
Or use a Python function within a SQL query?
That is when FugueSQL comes in handy.
What is FugueSQL?
FugueSQL is a Python library that allows users to combine Python code and SQL commands. This gives users the flexibility to switch between Python and SQL within a Jupyter Notebook or a Python script.
To install FugueSQL, type:
To run on Spark or Dask execution engines, type:
In this article, we will explore some utilities of FugueSQL and compare FugueSQL with other tools such as pandasql.
FugueSQL in a Notebook
FugueSQL comes with a Jupyter notebook extension that allows users to interactively query DataFrames with syntax highlighting.
To use it, import the setup
function from fugue_notebook
to register the%%fsql
cell magic. This is only available on classic notebooks for now (not available on JupyterLab).
To understand how the%%fsql
cell magic, let’s start with creating a pandas DataFrame:
Now, you can query like how you would normally do in SQL by adding the %%fsql
at the beginning of the cell.
In the code above, only PRINT
does not follow standard SQL. This is similar to the pandas head()
and Spark show()
operations to display a number of rows.
Operations such as GROUP BY
are similar to standard SQL syntax.
An Enhanced SQL Interface
For SQL users, nothing shown above is out of the ordinary except for the PRINT
statement. However, Fugue also adds some enhancements to standard SQL, allowing it to handle end-to-end data workflows gracefully.
Deal With Temp Tables
SQL users often have to use temp tables or common table expressions (CTE) to hold intermediate transformations. Luckily, FugueSQL supports the creation of intermediate tables through a variable assignment.
For example, after transforming df
, we can assign it to another variable called df2
and save df2
to a file using SAVE variable OVERWRITE file_name
.
Now, if we want to apply more transformation to df2
, simply load it from the file we saved previously.
Pretty cool, isn’t it?
Added Keywords
SQL’s grammar is meant for querying, which means that it lacks keywords to manipulate data. FugueSQL adds some keywords for common DataFrame operations. For example:
For a full list of operators, check the FugueSQL operator docs.
Integrate With Python
FugueSQL also allows you to use Python functions within a SQL query using TRANSFORM
.
For example, to use the function str_concat
in a SQL query:
… simply add the following components to the function:
Cool! Now we are ready to add it to a SQL query:
Scaling to Big Data
One of the beautiful properties of SQL is that it is agnostic to the size of the data. The logic is expressed in a scale-agnostic manner and will remain the same even if running on Pandas, Spark, or Dask.
With FugueSQL, we can apply the same logic on the Spark execution engine just by specifying %%fsql spark
. We don’t even need to edit the str_concat
function to bring it to Spark as Fugue takes care of porting it.
PREPARTITION BY
One of the important parts of distributed computing is partitioning. For example, to get the median value in each logical group, the data needs to be partitioned such that each logical group lives on the same worker.
To describe this, FugueSQL has the PREPARTITION BY
keyword. Fugue’s prepartition-transform
semantics are equivalent to the pandas groupby-apply
. The only difference is that prepartition-transform
scales to the distributed setting as it dictates the location of the data.
Note that the get_median
function above gets called once for each distinct value in the column col2
. Because the data is partitioned beforehand, we can just pull the first value of col2
to know what group we are working with.
FugueSQL in Production
To bring FugueSQL out of Jupyter notebooks and into Python scripts, all we need to do is wrap the FugueSQL query inside a fsql
class. We can then call the .run()
method and choose an execution engine.
What Is the Difference Between FugueSQL and pandasql?
If you know pandasql, you might wonder: Why should you use FugueSQL if pandasql already allows you to run SQL with pandas?
pandasql has a single backend, SQLite. It introduces a large overhead to transfer data between pandas and SQLite. On the other hand, FugueSQL supports multiple local backends: pandas, DuckDB and SQLite.
When using the pandas backend, Fugue directly translates SQL to pandas operations, so there is no data transfer at all. DuckDB has superb pandas support, so the overhead of data transfer is also negligible. Both Pandas and DuckDB are preferred FugueSQL backends for local data processing.
Fugue also has support for Spark, Dask, and cuDF (through blazingSQL) as backends.
Congratulations! You have just learned how to use FugueSQL as a SQL interface for operating on Python DataFrames. With FugueSQL, you can now use SQL syntax to express end-to-end data workflows and scale to distributed computing seamlessly!
This article does not exhaustively cover FugueSQL features. For more information about Fugue or FugueSQL, check the resources below.
Feel free to play and fork the source code of this article here:
This content was originally published here.