Python/LLM Developer Needed: AI Web Scraping & Prompt Engineering for 1,000+ Domains
FreelanceJobsResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
NDA will be required to move forward.
- ROLE
Python / AI Data Engineer — Web Crawling + LLM Structured Extraction (1,000+ domains, 383-column Excel output)
- ABOUT
We're an early-stage startup building an internal "company intelligence" dataset that powers a comparison product.
We have 1,000+ company domains and a 383-column Excel template (~70+ key attributes, each with a 5-part fact packet: Value, Source URL, Verbatim Quote, Date Pulled, Confidence).
- The Problem
We already have a detailed, rule-based extraction prompt. Manual execution via AI web interfaces isn't scalable. We need an automated pipeline that crawls sites, feeds content to an LLM, and outputs schema-perfect structured data.
- What You'll Build
Deep crawl + extract text:
domain + subdomains + key linked pages + PDFs; handle JS-rendered pages (Playwright/Scrapy/Firecrawl acceptable).
LLM execution:
run our master prompt against crawled content using Gemini or OpenAI API.
Strict structured output:
force LLM output to match our 383-column schema using JSON Schema structured outputs / strict mode (enums for canonical lists).
ETL + Excel write-back + QC: map results into the Excel template (openpyxl/pandas), add logging/caching, and a small regression harness.
- Required Skills
Strong Python + production data pipelines
Web crawling/scraping:
Playwright and/or Scrapy (dynamic sites + pagination)
LLM API integration:
Gemini Structured Outputs and/or OpenAI Structured Outputs
Data mapping to rigid schemas; handling enums, dates, numeric formatting
- Deliverables
Well-documented Python repo + "one-command run"
Crawler + content store + caching
LLM runner with strict JSON Schema outputs (enum-enforced)
Excel writer that fills the 383-column template
QC + regression test on a small sample set
Screening Questions (answer briefly)
What would you use to crawl subdomains + PDFs + JS pages, and how do you handle basic blocking?
How would you enforce "Value must be one of these tokens" across hundreds of fields?
Share a similar project (scraping + structured extraction + schema mapping).
Budget
Open to fixed-price or hourly. Please propose milestones and a ballpark estimate.
Contract duration of 3 to 6 months. with 40 hours per week.
Mandatory skills:
Python, Data Scraping, JavaScript, API, Scrapy, Automation, Data Extraction, Web Scraping, Web Crawling
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free