Beyond the Borrow Checker

2023/02/22

#data #rust #ETL

Table of contents

Rust for ETL

Abstract

In this blog post, we explore the reasons why Rust is an excellent choice for Extract, Transform, Load (ETL) jobs. I have been writing ETL jobs for small and big companies in Java, Scala, and Python for over a decade. Python is an obvious winner of this segment and its popularity shows up in StackOverflow surveys. Because of its popularity there is unlimited content on the internet on how to do certain tasks in Python and you can get help on the usual platforms very easily. However, Python has some pretty big downsides.

First of all, being a dynamic interpreted language makes it pretty hard to trust your code. The chance that it works for you and fails in production is non-trivial. Docker and other methods help you with this but why would you need to use another tool just to have reliable deployments? I have wasted a lot of hours on Python deployment issues and it takes a lot to have a reliable setup that smaller teams usually don’t have.

This is where Rust comes into the picture.

Rust’s unique features like static cross-platform building, a single way to configure projects, simple dependency management, memory safety, and built-in support for serialization and deserialization make it an attractive option for data processing tasks. The borrow checker ensures that memory is managed correctly, making it almost impossible to write code with memory errors, resulting in reliable and safe ETL jobs.

We usually run ETL jobs on execution platforms like k8s with Airflow or AWS Glue with Spark. These platforms are usually pretty slow and inefficient for smaller tasks and do not yield great performance out of the box. Companies end up hiring experts who can fine-tune these systems to specific workloads.

If you’re looking for a modern and efficient alternative for your ETL jobs, Rust is worth considering.

Why on Earth would anybody use Rust for ETL?

  • Static cross-platform building

Rust’s cross-platform building feature allows developers to create binary executables that can run on different operating systems. This is because Rust produces a single binary that can be deployed across multiple platforms. This feature makes it easy to write ETL jobs that can be run on different systems.

  • Single way to configure your project

Rust has a great approach to configuring projects. Instead of relying on a variety of configuration files, Rust uses a single configuration file, Cargo.toml, to configure the project. You can, of course, try to add more configuration files if you want but the official way is Cargo.toml. Python has at least three different ways. This makes it easy to manage and maintain the project’s dependencies and configurations.

  • Handling dependency versions are simple

One of the most significant challenges in developing software is managing dependencies. Rust solves this problem by using the Cargo package manager. Cargo allows developers to declare dependencies and their versions in the Cargo.toml file. Cargo then downloads and installs the required dependencies, ensuring that they work together correctly. While Python has many issues with different operating system dependencies and Python versions, Rust does not. I have seen a problem once where a dependency had some issues but it was trivial to fix and the workaround was obvious. I cannot say the same about Python where I run into problems with libraries not supporting a certain combination of Python and operating system.

  • It is almost impossible to write broken libraries

Rust has a unique feature that ensures that developers write the correct code. The borrow checker ensures that memory is managed correctly, making it almost impossible to write code with memory errors. On top of that the strict type system makes it trivial to check type safety across the code and this makes it easy to write reliable and safe ETL jobs.

  • Serialization and deserialization are simple

Rust has built-in support for the serialization and deserialization of data. This feature makes it easy to read and write data in various formats, making it an excellent choice for ETL jobs.

  • Async makes parallel and concurrent workflows easy to write

Python also has some async capabilities but I like Rust a bit more. I think the people who designed Rust async did a great job from a practical point of view. I know it could be better and I am aware of its largely theoretical criticism.

Ok, so let us have a look at a side-by-side comparison of writing a medium size ETL job in both languages.

Types

First, let’s create a type in each language to represent the problem at hand. We would like to create a lookup structure with two ways lookup and store strings and hashes in it.

  • Python

Just to have something easy to work with you need to use a library in Python. It has class-based higher-level types which are a bit harder to use in practice than the Rust equivalent.

from pydantic import BaseModel
class EidCache(BaseModel):
    forward_dict: Dict[str, str]
    reverse_dict: Dict[str, str]
    created_at: Optional[str]
  • Rust

The Rust version is roughly the same.

#[derive(Serialize, Deserialize, Debug)]
pub struct EidCache {
    pub forward_dict: BTreeMap<String, String>,
    pub reverse_dict: BTreeMap<String, String>,
    created_at: String,
}

Code

This is my first Rust code, so go easy on me. :) I am not sure how to reduce the memory allocation, maybe use references instead of creating a new string every time.

  • Python

Python is a bit easier to write if you are already familiar with dictionary comprehension.

def get_reverse_dictionary(d: Dict[str, str]):
    return {v: k for k, v in d.items()}


def create_db_eid_cache():
    database_names = get_all_database_names()
    forward_dict = {name: get_db_eid(name) for name in database_names}
    reverse_dict = get_reverse_dictionary(forward_dict)

    return EidCache(
        forward_dict=forward_dict, reverse_dict=reverse_dict, created_at=iso_utc_now()
    )
  • Rust
fn create_eid_cache<F>(names: Option<Vec<String>>, eid_fun: F) -> Option<EidCache>
where
    F: Fn(&str) -> String,
{
    match names {
        Some(xs) => {
            let xs_with_eid: Vec<(String, String)> =
                xs.iter().map(|x| (x.to_string(), eid_fun(x))).collect();

            let mut forward_dict: BTreeMap<String, String> = BTreeMap::new();
            let mut reverse_dict: BTreeMap<String, String> = BTreeMap::new();

            xs_with_eid.iter().for_each(|x| {
                forward_dict.insert(x.0.clone(), x.1.clone());
                reverse_dict.insert(x.1.clone(), x.0.clone());
                ()
            });

            let created_at = OffsetDateTime::now_utc().format(&Iso8601::DEFAULT).unwrap();

            let request_count = 0;

            Some(EidCache {
                forward_dict,
                reverse_dict,
                created_at,
                request_count,
            })
        }
        None => {
            let msg = "There are no name entries".to_string();
            info!("{}", msg);
            None
        }
    }
}

As you can see the codes are comparable in length, the Rust version being a bit longer.

The whole project is a 1150 lines of Rust and 687 lines of Python. It involves talking to different AWS APIs, calculating checksums and write summary data to S3. Because it is easy to use multiple files in Rust we split up the code into smaller more managable chunks and compile it to a single binary before deploying the function to Lambda. With Python we need to zip up the files and go through all sorts of hoops because AWS supports only Python 3.7 or 3.10 depending on which Glue version is available. The dependency management is much easier with Rust, we simple need to use the Cargo.toml for it while in the Python case you have to provide your dependencies which get installed at run time adding more time to the already slow execution.

Performance

Our Python code runs around 2 minutes and 30 seconds (p50) because it has to start up the Spark cluster and do lots of things that we do not need. I was trying to understand how would it look like if we move this workload out of Glue into Lambda and used Rust instead of Python. I could have used Python on Lambda as well and I am going to write an article about that experience next. Anyways, our first attempt with Rust was extremely smooth thanks to cargo-lambda and the language tooling. The p50 value for the AWS Lambda function with Rust is 1,3 seconds. Yes this is an apples to oranges comparison.

I think that is a pretty big improvement that we cannot ignore. As far as developer productivity goes: yes you can write Python faster than Rust but you are going to pay the penalty later at deployment time and when you run in production.

Summary

I am pretty happy with Rust and we are going to use it more and more. I am ok to have a bit more development time and significantly reduce the operation efforts we have to put in to maintain the system.