The Job Search Pipeline
To get the most out of this tool β or even just to trust what itβs doing β it helps to understand how it works under the hood. This page gives you a conceptual overview of the full job search pipeline, from your initial query to a sorted, scored list of jobs.
Whether you're just curious or planning to customize, here's how the pieces fit together.
πΊοΈ Overview of the Processβ
The system is organized as a modular pipeline, which means you can run everything at once or step through each phase manually. Here's the core sequence:
-
Input a Job Search Query
- You define keywords, location, filters, and date range.
- (Handled via UIβs Query tab or in
config/
settings)
-
Search & Scraping
- Uses Google, Remotive, or others to fetch job search result pages (SERPs).
- Saves raw HTML and metadata.
- Code:
core/00_fetch_remotive_jobs.py
or01_serp_scraper.py
-
Intermediate Outputs
- Results saved in
data/01_fetch_serps/run_*/
folders. - Includes
done_tracker.csv
,scraped/
, and metadata folders.
- Results saved in
-
Page Classification
- Not all pages are actual job listings. We run an LLM-based classifier to detect valid job ads.
- Code:
flow_pagecateg/flow.dag.yaml
-
Job Scoring
- For valid jobs, we analyze content (title, description, benefits) using a prompt-based scoring engine.
- Code:
flow_jobposting/llm_wrapper.py
,09_run_promptflow.py
-
Final Ranking
- Jobs are sorted and ranked based on prompt responses and optional heuristics.
-
Export & Review
- Outputs saved as
.jsonl
and.csv
files for further use. - Code:
core/03_export_results_to_jsonl.py
,views/results_tab.py
- Outputs saved as
π How to Run the Pipelineβ
You can launch the pipeline in two main ways:
π From the Streamlit Appβ
- Use the Control tab in the UI to run a full search session.
- All steps will be executed in sequence.
π₯οΈ From the Command Lineβ
python jobserp_explorer/core/10_run_full_pipeline.py
This executes the same flow programmatically. Use this if you're scheduling runs or want more control.
π§© Why This Mattersβ
This modular pipeline architecture ensures the system is:
- Customizable β you can plug in new scrapers, scoring models, or filters.
- Reproducible β each run is timestamped and saved, so you can trace or repeat.
- Debuggable β outputs at each step let you diagnose what's working or not.
π Related Pagesβ
Got feedback or ideas for improving the pipeline? Head to Contributing to get involved.