Data intelligence relies on a strong, functional data pipeline. However, the workflows that feed those pipelines can be rather arbitrarily complex.
Building, connecting, and maintaining complex workflows add unnecessary work for data engineers.
It's not an arrangement that works in the fast-paced world of enterprise software.
Fortunately, developers can lower their workload with tools like Luigi.
What Is Luigi?
Spotify created and maintains Luigi, a workflow engine whose philosophy and concepts were inspired by GNU Make.
It’s a Python module that provides a framework for building and running complex pipelines of batch jobs.
What problem does Luigi solve?
Luigi’s main function is to take care of workflow management so developers can focus on other concerns.
It can be used to help build data pipeline tasks like declaring dependencies between tasks or defining the inputs and outputs of each task.
On top of creating data pipeline tasks, Luigi helps run them. It’s a good tool for handling dependencies, providing visualization tools, and handling and reporting failures.
When used with a central scheduler it can also enable distributed execution.
Benefits of Luigi
- Smoothly resume data workflow after a failure.
- Parametrize and re-run tasks on a schedule (daily, hourly, or as needed) with the help of an external trigger.
- Organize code with shared patterns.
- Command line integration.
- Small overhead for a task (about 4 lines: class, def requires, def output, def run).
- Everything is done by inheriting Python classes.
- Can be extended with other tasks such as Spark jobs, Hive queries, and more.
Strengths of Luigi
Modular code makes software more reliable and easier to main and update.
With Luigi, writing modular code is simple. Developers can easily create complicated dependencies between tasks.
Better yet, managing those dependencies is equally straightforward.
Luigi’s simple API lets users build a build a highly complex tree of dependencies without making it too difficult to understand.
Other team members or outside maintainers can easily interpret the code.
Luigi is highly flexible. It relies on Python, which allows developers the freedom to create tasks that do anything needed.
Connecting components is easy and intuitive.
There’s no external or static configuration for the pipelines, only Python scripts, so everything is dynamic.
Last - but not least - is idempotency. Completed tasks are not run twice, so a failed workflow can be restarted from the middle.
It picks up right where it left off, which produces the same output every time.
Weaknesses of Luigi
One of Luigi’s main weaknesses is the flip-sides of one of its biggest strengths.
Specifically, it can’t re-run partial or old pipelines since it picks up where it left off.
It also has no native support for distributed execution.
Developers need to use a central controller to gain that functionality.
Some have found Luigi’s user interface to be hard to navigate.
This is one of the biggest reasons users move to Airflow, though with some practice the UI issue becomes less noticeable.
The biggest complaints of developers who’ve worked with Luigi revolve around issues with scaling.
There are two reasons for the tool’s scalability issues:
- The number of Luigi worker processes is limited by the number of cron worker processes currently assigned to the job.
- The web UI and scheduler run on a single threaded process. If the scheduler is busy or someone else is using the UI, the web UI suffers from frustratingly slow performance.
Airflow (Airbnb)Airbnb uses a lot of data heavy features: price optimization for hosts, property recommendations for guests, and internal tracing features to guide business decisions. They created Airflow to meet their specific data needs, then decided to open source it in 2015. It’s flexible and scalable, but users have experienced some problems with time zones, managing the scheduler, and unexpected backfills.
Pinball (Pinterest)Pinterest created Pinball when they found none of the existing workflow management solutions met their requirements for customizability. It has a lot of features and scales horizontally very well. The community is small, though, and it doesn’t have good documentation.
In practice Luigi is used for ETL (extract, transform, load) operations that feed data intelligence operations.
Luigi handles batch jobs, not streaming, continuous processes.
It’s not a data integration software, but it can be used to orchestrate custom data integration tasks.
Right now, Airflow is a more popular tool for workflow management.
Luigi still has its supporters and there are areas where it has the edge over Airflow and Pinball, but unless it can address its scalability issue it may not be able to maintain its user base going forward.
Every development project has unique needs. At Concepta, we build with tools chosen for each project to create a custom solution for every client. Claim your free consultation to see what we can do for your company!