
Python/LLM Developer Needed: AI Web Scraping & Prompt Engineering for 1,000+ Domains
FreelanceJobsResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
NDA will be required to move forward.
- ROLE
Python / AI Data Engineer — Web Crawling + LLM Structured Extraction (1,000+ domains, 383-column Excel output)
- ABOUT
We're an early-stage startup building an internal "company intelligence" dataset that powers a comparison product.
We have 1,000+ company domains and a 383-column Excel template (~70+ key attributes, each with a 5-part fact packet: Value, Source URL, Verbatim Quote, Date Pulled, Confidence).
- The Problem
We already have a detailed, rule-based extraction prompt. Manual execution via AI web interfaces isn't scalable. We need an automated pipeline that crawls sites, feeds content to an LLM, and outputs schema-perfect structured data.
- What You'll Build
Deep crawl + extract text:
domain + subdomains + key linked pages + PDFs; handle JS-rendered pages (Playwright/Scrapy/Firecrawl acceptable).
LLM execution:
run our master prompt against crawled content using Gemini or OpenAI API.
Strict structured output:
force LLM output to match our 383-column schema using JSON Schema structured outputs / strict mode (enums for canonical lists).
ETL + Excel write-back + QC: map results into the Excel template (openpyxl/pandas), add logging/caching, and a small regression harness.
- Required Skills
Strong Python + production data pipelines
Web crawling/scraping:
Playwright and/or Scrapy (dynamic sites + pagination)
LLM API integration:
Gemini Structured Outputs and/or OpenAI Structured Outputs
Data mapping to rigid schemas; handling enums, dates, numeric formatting
- Deliverables
Well-documented Python repo + "one-command run"
Crawler + content store + caching
LLM runner with strict JSON Schema outputs (enum-enforced)
Excel writer that fills the 383-column template
QC + regression test on a small sample set
Screening Questions (answer briefly)
What would you use to crawl subdomains + PDFs + JS pages, and how do you handle basic blocking?
How would you enforce "Value must be one of these tokens" across hundreds of fields?
Share a similar project (scraping + structured extraction + schema mapping).
Budget
Open to fixed-price or hourly. Please propose milestones and a ballpark estimate.
Contract duration of 3 to 6 months. with 40 hours per week.
Mandatory skills:
Python, Data Scraping, JavaScript, API, Scrapy, Automation, Data Extraction, Web Scraping, Web Crawling
Similar Jobs
Software Engineer II (Data Center Packet Forwarding)
HPE
Captiva Developer / Administrator
Onico Solutions
Test Automation Engineer (Optical Testing & Calibration)
Lumentum Ottawa ULC
Experienced QA Engineer Needed for Software Testing
FreelanceJobs
LabVIEW Test Engineer
Global Connect Technologies
More Jobs at FreelanceJobs
View all →Data analyst/BI expert to integrate multiple data sources into a single dashboard in power bi
FreelanceJobs
Business Intelligence Expert for Power BI, Tableau, SQL, and Cloud Data Solutions
FreelanceJobs
Data analyst/BI expert to integrate multiple data sources into a single dashboard in power bi
FreelanceJobs
Migration from G Suite 2 office 365
FreelanceJobs
Software Engineer Needed for Dynamic Projects
FreelanceJobs
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free