The Job Search Pipeline

To get the most out of this tool — or even just to trust what it’s doing — it helps to understand how it works under the hood. This page gives you a conceptual overview of the full job search pipeline, from your initial query to a sorted, scored list of jobs.

Whether you're just curious or planning to customize, here's how the pieces fit together.

🗺️ Overview of the Process

The system is organized as a modular pipeline, which means you can run everything at once or step through each phase manually. Here's the core sequence:

Input a Job Search Query
- You define keywords, location, filters, and date range.
- (Handled via UI’s Query tab or in config/ settings)
Search & Scraping
- Uses Google, Remotive, or others to fetch job search result pages (SERPs).
- Saves raw HTML and metadata.
- Code: core/00_fetch_remotive_jobs.py or 01_serp_scraper.py
Intermediate Outputs
- Results saved in data/01_fetch_serps/run_*/ folders.
- Includes done_tracker.csv, scraped/, and metadata folders.
Page Classification
- Not all pages are actual job listings. We run an LLM-based classifier to detect valid job ads.
- Code: flow_pagecateg/flow.dag.yaml
Job Scoring
- For valid jobs, we analyze content (title, description, benefits) using a prompt-based scoring engine.
- Code: flow_jobposting/llm_wrapper.py, 09_run_promptflow.py
Final Ranking
- Jobs are sorted and ranked based on prompt responses and optional heuristics.
Export & Review
- Outputs saved as .jsonl and .csv files for further use.
- Code: core/03_export_results_to_jsonl.py, views/results_tab.py

🚀 How to Run the Pipeline

You can launch the pipeline in two main ways:

🔘 From the Streamlit App

Use the Control tab in the UI to run a full search session.
All steps will be executed in sequence.

🖥️ From the Command Line

python jobserp_explorer/core/10_run_full_pipeline.py

This executes the same flow programmatically. Use this if you're scheduling runs or want more control.

🧩 Why This Matters

This modular pipeline architecture ensures the system is:

Customizable — you can plug in new scrapers, scoring models, or filters.
Reproducible — each run is timestamped and saved, so you can trace or repeat.
Debuggable — outputs at each step let you diagnose what's working or not.

📄 Job Scoring & Prompt Engineering
📄 Session Results & Outputs

Got feedback or ideas for improving the pipeline? Head to Contributing to get involved.

🗺️ Overview of the Process​

🚀 How to Run the Pipeline​

🔘 From the Streamlit App​

🖥️ From the Command Line​

🧩 Why This Matters​

🔗 Related Pages​

🗺️ Overview of the Process

🚀 How to Run the Pipeline

🔘 From the Streamlit App

🖥️ From the Command Line

🧩 Why This Matters

🔗 Related Pages