Bodo – high-performance compute engine for Python data processing

Published at

6 hours ago

Main Article

Bodo – high-performance compute engine for Python data processing

Logo

Docs · Slack · Benchmarks

Bodo: High-Performance Python Compute Engine for Data and AI

Bodo is a cutting edge compute engine for large scale Python data processing. Powered by an innovative auto-parallelizing just-in-time compiler, Bodo transforms Python programs into highly optimized, parallel binaries without requiring code rewrites, which makes Bodo 20x to 240x faster compared to alternatives!

Unlike traditional distributed computing frameworks, Bodo:

Seamlessly supports native Python APIs like Pandas and NumPy.
Eliminates runtime overheads common in driver-executor models by leveraging Message Passing Interface (MPI) tech for true distributed execution.

Goals

Bodo makes Python run much (much!) faster than it normally does!

Exceptional Performance: Deliver HPC-grade performance and scalability for Python data workloads as if the code was written in C++/MPI, whether running on a laptop or across large cloud clusters.
Easy to Use: Easily integrate into Python workflows with a simple decorator, and support native Pandas and NumPy APIs.
Interoperable: Compatible with regular Python ecosystem, and can selectively speed up only the functions that are Bodo supported.
Integration with Modern Data Infrastructure: Provide robust support for industry-leading data platforms like Apache Iceberg and Snowflake, enabling smooth interoperability with existing ecosystems.

Non-goals

Full Python Language Support: We are currently focused on a targeted subset of Python used for data-intensive and computationally heavy workloads, rather than supporting the entire Python syntax and all library APIs.
Non-Data Workloads: Prioritize applications in data engineering, data science, and AI/ML. Bodo is not designed for general-purpose use cases that are non-data-centric.
Real-time Compilation: While compilation time is improving, Bodo is not yet optimized for scenarios requiring very short compilation times (e.g., workloads with execution times of only a few seconds).

Key Features

Automatic optimization & parallelization of Python programs using Pandas and NumPy.
Linear scalability from laptops to large-scale clusters and supercomputers.
Advanced scalable I/O support for Iceberg, Snowflake, Parquet, CSV, and JSON with automatic filter pushdown and column pruning for optimized data access.
High performance SQL Engine that is natively integrated into Python.

See Bodo documentation to learn more: https://docs.bodo.ai/

Installation

Bodo can be installed using Pip or Conda:

pip install -U bodo

conda create -n Bodo python=3.12 -c conda-forge
conda activate Bodo
conda install bodo -c bodo.ai -c conda-forge

Bodo works with Linux x86 and both Mac x86 and Mac ARM right now. We will have Windows support (and more) coming soon!

Example Code

Here is an example Pandas code that reads and processes a sample Parquet dataset with Bodo.

import pandas as pd
import numpy as np
import bodo
import time

# Generate sample data
NUM_GROUPS = 30
NUM_ROWS = 20_000_000

df = pd.DataFrame({
    "A": np.arange(NUM_ROWS) % NUM_GROUPS,
    "B": np.arange(NUM_ROWS)
})
df.to_parquet("my_data.pq")

@bodo.jit(cache=True)
def computation():
    t1 = time.time()
    df = pd.read_parquet("my_data.pq")
    df2 = pd.DataFrame({"A": df.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)})
    df2.to_parquet("out.pq")
    print("Execution time:", time.time() - t1)

computation()

How to Contribute

Please read our latest project contribution guide.

Getting involved

You can join our community and collaborate with other contributors by joining our Slack channel – we’re excited to hear your ideas and help you get started!

gittech.site