Skip to main content
F

Python/LLM Developer Needed: AI Web Scraping & Prompt Engineering for 1,000+ Domains

FreelanceJobs
CAPosted February 23, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonJavaScriptPandasAPI

Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score

Job Description

NDA will be required to move forward.

  • ROLE

Python / AI Data Engineer — Web Crawling + LLM Structured Extraction (1,000+ domains, 383-column Excel output)

  • ABOUT

We're an early-stage startup building an internal "company intelligence" dataset that powers a comparison product.

We have 1,000+ company domains and a 383-column Excel template (~70+ key attributes, each with a 5-part fact packet: Value, Source URL, Verbatim Quote, Date Pulled, Confidence).

  • The Problem

We already have a detailed, rule-based extraction prompt. Manual execution via AI web interfaces isn't scalable. We need an automated pipeline that crawls sites, feeds content to an LLM, and outputs schema-perfect structured data.

  • What You'll Build

Deep crawl + extract text:

domain + subdomains + key linked pages + PDFs; handle JS-rendered pages (Playwright/Scrapy/Firecrawl acceptable).

LLM execution:

run our master prompt against crawled content using Gemini or OpenAI API.

Strict structured output:

force LLM output to match our 383-column schema using JSON Schema structured outputs / strict mode (enums for canonical lists).

ETL + Excel write-back + QC: map results into the Excel template (openpyxl/pandas), add logging/caching, and a small regression harness.

  • Required Skills

Strong Python + production data pipelines

Web crawling/scraping:

Playwright and/or Scrapy (dynamic sites + pagination)

LLM API integration:

Gemini Structured Outputs and/or OpenAI Structured Outputs

Data mapping to rigid schemas; handling enums, dates, numeric formatting

  • Deliverables

Well-documented Python repo + "one-command run"

Crawler + content store + caching

LLM runner with strict JSON Schema outputs (enum-enforced)

Excel writer that fills the 383-column template

QC + regression test on a small sample set

Screening Questions (answer briefly)

What would you use to crawl subdomains + PDFs + JS pages, and how do you handle basic blocking?

How would you enforce "Value must be one of these tokens" across hundreds of fields?

Share a similar project (scraping + structured extraction + schema mapping).

Budget

Open to fixed-price or hourly. Please propose milestones and a ballpark estimate.

Contract duration of 3 to 6 months. with 40 hours per week.

Mandatory skills:

Python, Data Scraping, JavaScript, API, Scrapy, Automation, Data Extraction, Web Scraping, Web Crawling

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free