Skip to main content

The Job Search Pipeline

To get the most out of this tool β€” or even just to trust what it’s doing β€” it helps to understand how it works under the hood. This page gives you a conceptual overview of the full job search pipeline, from your initial query to a sorted, scored list of jobs.

Whether you're just curious or planning to customize, here's how the pieces fit together.


πŸ—ΊοΈ Overview of the Process​

The system is organized as a modular pipeline, which means you can run everything at once or step through each phase manually. Here's the core sequence:

  1. Input a Job Search Query

    • You define keywords, location, filters, and date range.
    • (Handled via UI’s Query tab or in config/ settings)
  2. Search & Scraping

    • Uses Google, Remotive, or others to fetch job search result pages (SERPs).
    • Saves raw HTML and metadata.
    • Code: core/00_fetch_remotive_jobs.py or 01_serp_scraper.py
  3. Intermediate Outputs

    • Results saved in data/01_fetch_serps/run_*/ folders.
    • Includes done_tracker.csv, scraped/, and metadata folders.
  4. Page Classification

    • Not all pages are actual job listings. We run an LLM-based classifier to detect valid job ads.
    • Code: flow_pagecateg/flow.dag.yaml
  5. Job Scoring

    • For valid jobs, we analyze content (title, description, benefits) using a prompt-based scoring engine.
    • Code: flow_jobposting/llm_wrapper.py, 09_run_promptflow.py
  6. Final Ranking

    • Jobs are sorted and ranked based on prompt responses and optional heuristics.
  7. Export & Review

    • Outputs saved as .jsonl and .csv files for further use.
    • Code: core/03_export_results_to_jsonl.py, views/results_tab.py

πŸš€ How to Run the Pipeline​

You can launch the pipeline in two main ways:

πŸ”˜ From the Streamlit App​

  • Use the Control tab in the UI to run a full search session.
  • All steps will be executed in sequence.

πŸ–₯️ From the Command Line​

python jobserp_explorer/core/10_run_full_pipeline.py

This executes the same flow programmatically. Use this if you're scheduling runs or want more control.


🧩 Why This Matters​

This modular pipeline architecture ensures the system is:

  • Customizable β€” you can plug in new scrapers, scoring models, or filters.
  • Reproducible β€” each run is timestamped and saved, so you can trace or repeat.
  • Debuggable β€” outputs at each step let you diagnose what's working or not.


Got feedback or ideas for improving the pipeline? Head to Contributing to get involved.