I Tried to Schedule My ETL Pipeline. Here’s What I Didn’t Expect.

I Tried to Schedule My ETL Pipeline. Here’s What I Didn’t Expect.


, I mentioned that scheduling is the next wall I’ll be walking toward.

So I guess, here I am, walking towards it

But before I get into what happened, let me give some context for anyone stumbling on this for the first time.

I am a systems analyst who decided to transition into data engineering. Instead of just taking courses and collecting certificates, I decided to learn by building and writing about it publicly. Every article in this series documents something I actually built, the decisions I made, the things that broke, and what I learned from it.

The first article was my 12-month self-study roadmap, where I laid out the plan for how I was going to approach this transition. The second was me building my first ETL pipeline from scratch using the GitHub API, as a complete beginner. In the third, I took that same pipeline and made it more production-ready by adding SQLite storage, idempotency handling, and Google Drive persistence, all inside Google Colab.

This article is the fourth. And it picks up exactly where the last one ended.

I expected to spend most of my time picking a scheduling tool and configuring it. What I didn’t expect was that before I could even think about scheduling, I had to deal with something more fundamental. My pipeline couldn’t run outside of Google Colab. And until that changed, no scheduler in the world could help me.

This is the story of what actually happened.

The First Wall: My Pipeline Lived in Colab

Before I even got to scheduling, I wanted to understand what it would actually take to run my pipeline automatically. So I looked at my code properly for the first time with that question in mind.

Here’s what the load section looked like:

conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db')

That path, /content/drive/MyDrive/, only exists inside Google Colab. It’s the mounted Google Drive path that Colab gives you when you connect your Drive to a notebook. Outside Colab, that path doesn’t exist. If any scheduler tried to run this script, it would crash right there.

The interesting thing is that my code had no google.colab imports. No Colab-specific libraries. Just one hardcoded path that I had been typing without really thinking about it. That path was the dependency, not the code.

This was the first thing I didn’t expect. I thought the challenge would be learning a scheduling tool. Instead, the first lesson was that my environment was part of my pipeline, and I hadn’t noticed.

The fix was simple. Instead of hardcoding the Colab path, I made the database path configurable through an environment variable:

import os

DB_PATH = os.environ.get('DB_PATH', 'github_repos.db')
conn = sqlite3.connect(DB_PATH)

Now the script uses whatever path is set in the environment. If nothing is set, it falls back to creating a local github_repos.db file in the same folder. One change, and the pipeline was no longer tied to Colab.

Running It Outside Colab for the First Time

Before setting up any scheduler, I wanted to confirm the script actually worked on its own. So I saved it as pipeline.py, created a requirements.txt with the two libraries it needs:

requests
pandas

And ran it from my terminal:

It printed: Pipeline complete. Duplicates handled.

And a file called github_repos.db appeared in my folder. The same pipeline I had been running in Colab was now running as a plain Python script, anywhere.

That felt like a bigger deal than I expected. Not because the change was complex, it wasn’t. But because I realized I had been thinking of my pipeline as a notebook, when what I actually had was a script that happened to live inside one.

Choosing a Scheduling Tool

At this point I had a standalone script. Now I needed something to run it on a schedule.

I looked at a few options. APScheduler lets you define schedules inside your Python code, which works while a session is running but stops the moment you close your terminal. That’s not really scheduling, that’s just a loop. Airflow is the industry standard for orchestrating pipelines, but it requires running a server, a metadata database, and a web interface. That’s a lot of infrastructure for where I am right now.

GitHub Actions sat in the middle. It’s free, it runs on GitHub’s servers, the schedule is defined in code, and it doesn’t require me to maintain any infrastructure. The tradeoff is that it’s designed for CI/CD workflows, not pipeline orchestration, so it has limits around complex dependencies and monitoring. But for a pipeline at my stage, it’s a practical choice.

I also want to be honest: tools like Airflow exist for a reason. When a pipeline grows, when you have dependencies between tasks, when you need visibility into what ran and what failed, you need proper orchestration. GitHub Actions is not that. But it’s a good first step, and understanding why it’s limited is part of learning what those more serious tools are actually solving.

Setting Up GitHub Actions

GitHub Actions works through workflow files, which are YAML files you place in a specific folder in your repository. The folder structure looks like this:

github-etl/
├── .github/
│   └── workflows/
│       └── schedule.yml
├── pipeline.py
└── requirements.txt

Here’s the full workflow file I created:

name: Run ETL Pipeline

on:
  schedule:
    - cron: '0 9 * * *'
  workflow_dispatch:

jobs:
  run-pipeline:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run pipeline
        run: python pipeline.py

Let me walk through what each part is doing.

  • cron: '0 9 * * *' is the actual schedule. Cron is a time-based job scheduling format that’s been around in Unix systems for decades. The five values represent minute, hour, day of month, month, and day of week. So 0 9 * * * means: at minute 0 of hour 9, every day, every month, every day of the week. In other words, 9am UTC every day.
  • workflow_dispatch adds a manual trigger. This means you can also run the workflow by clicking a button in GitHub, without waiting for the scheduled time. This is useful for testing.
  • runs-on: ubuntu-latest tells GitHub to spin up a fresh Linux machine for each run. Every time the workflow triggers, GitHub creates a clean environment, installs your dependencies, runs your script, and then shuts everything down. There’s no persistent machine sitting somewhere running your code. It’s ephemeral.

The steps are straightforward. Checkout pulls your code from the repository into the runner. Setup Python installs the version you specify. Install dependencies runs pip install -r requirements.txt. And then Run pipeline executes your script.

What Happened When I Ran It

After pushing the workflow file to GitHub, I went to the Actions tab in my repository and triggered it manually using the workflow_dispatch button.

It ran. Twenty-seven seconds from start to finish. The pipeline pulled data from the GitHub API, transformed it, and loaded it into SQLite, all on a GitHub server, without me doing anything after clicking the button.

I did get one warning on the first run:

Node.js 20 actions are deprecated...

This was because I had used older versions of the checkout and setup-python actions. The fix was updating actions/checkout@v3 to actions/checkout@v4 and actions/setup-python@v4 to actions/setup-python@v5. After that, the workflow ran clean.

What I Actually Learned

Going into this, I thought scheduling was about picking the right tool. What I found was that scheduling forced me to think about something I hadn’t thought carefully about before: portability.

A pipeline that only runs in one specific environment isn’t really a pipeline. It’s a script tied to a platform. Making it schedulable meant making it portable first, and making it portable meant understanding what it actually depended on.

The hardcoded path was a small thing. But catching it changed how I think about writing pipeline code going forward. Every time I write a path or a credential or an environment-specific value, I now ask whether that thing will exist outside the context I’m building in.

The other thing I learned is that scheduling and orchestration are different problems. GitHub Actions handles scheduling well. It doesn’t handle things like retrying failed runs with backoff, alerting when something goes wrong, visualizing pipeline dependencies, or managing multiple pipelines that depend on each other. Those are orchestration problems, and they’re what tools like Airflow are built to solve.

I’m not there yet. But I understand now why those tools exist in a way I didn’t before.

What’s Next

The pipeline is now running every day at 9am UTC. Data is being collected. And I’m starting to notice something: when you have a pipeline running daily, you start caring about the data it produces in a different way.

Are all the records clean? Are there repos slipping through with missing fields? Is the viral flag actually meaningful, or did I define it in a way that makes almost everything “No”?

Those are data quality questions. And they’re the next wall I’m walking toward.

This is part of my ongoing series documenting my transition from systems analyst to data engineer. If you’ve been following along, thank you. If this is your first article in the series, the earlier ones are linked below.

From Data Analyst to Data Engineer: My 12-Month Self-Study Roadmap

I Built My First ETL Pipeline as a Complete Beginner. Here’s How.

I Thought Data Engineering Was Just Writing Scripts. I Was Wrong.

Connect with me on LinkedInYouTube, and Twitter.



Source link