Python/LLM Developer Needed: AI Web Scraping & Prompt Engineering for 1,000+ Domains

FreelanceJobs

Contract mid

CAPosted February 23, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonJavaScriptPandasAPI

Job Description

NDA will be required to move forward.

ROLE

Python / AI Data Engineer — Web Crawling + LLM Structured Extraction (1,000+ domains, 383-column Excel output)

ABOUT

We're an early-stage startup building an internal "company intelligence" dataset that powers a comparison product.

We have 1,000+ company domains and a 383-column Excel template (~70+ key attributes, each with a 5-part fact packet: Value, Source URL, Verbatim Quote, Date Pulled, Confidence).

The Problem

We already have a detailed, rule-based extraction prompt. Manual execution via AI web interfaces isn't scalable. We need an automated pipeline that crawls sites, feeds content to an LLM, and outputs schema-perfect structured data.

What You'll Build

Deep crawl + extract text:

domain + subdomains + key linked pages + PDFs; handle JS-rendered pages (Playwright/Scrapy/Firecrawl acceptable).

LLM execution:

run our master prompt against crawled content using Gemini or OpenAI API.

Strict structured output:

force LLM output to match our 383-column schema using JSON Schema structured outputs / strict mode (enums for canonical lists).

ETL + Excel write-back + QC: map results into the Excel template (openpyxl/pandas), add logging/caching, and a small regression harness.

Required Skills

Strong Python + production data pipelines

Web crawling/scraping:

Playwright and/or Scrapy (dynamic sites + pagination)

LLM API integration:

Gemini Structured Outputs and/or OpenAI Structured Outputs

Data mapping to rigid schemas; handling enums, dates, numeric formatting

Deliverables

Well-documented Python repo + "one-command run"

Crawler + content store + caching

LLM runner with strict JSON Schema outputs (enum-enforced)

Excel writer that fills the 383-column template

QC + regression test on a small sample set

Screening Questions (answer briefly)

What would you use to crawl subdomains + PDFs + JS pages, and how do you handle basic blocking?

How would you enforce "Value must be one of these tokens" across hundreds of fields?

Share a similar project (scraping + structured extraction + schema mapping).

Budget

Open to fixed-price or hourly. Please propose milestones and a ballpark estimate.

Contract duration of 3 to 6 months. with 40 hours per week.

Mandatory skills:

Python, Data Scraping, JavaScript, API, Scrapy, Automation, Data Extraction, Web Scraping, Web Crawling

All jobs at FreelanceJobs →Browse Remote QA Engineer Jobs →

Similar Jobs

Journeyman Computer and Info Research Scientist

Evans Inc.

Maryland, US

FDA Fellowship - Investigating Safety of Biologics Using Next-Generation Sequencing for Adventitious Virus Detection

Oak Ridge Institute for Science and Education

Silver Spring, Maryland, US

Senior Software Engineer Staff - Clearance Required

Lockheed Martin

Hyattsville, Maryland, US

Biomedical Engineer

Washington VA Medical Center

Washington, District of Columbia, US

Shop Welder/Labourer

Artemis PC Pump Systems Ltd.

Lloydminster, Alberta, CA

More Jobs at FreelanceJobs

View all →

Senior Google Cloud Platform

FreelanceJobs

Cloud & DevOps Engineer for Small Teams AWS/GCP/Azure

FreelanceJobs

Senior Android

FreelanceJobs

Senior Android

FreelanceJobs

Front end IOS, React

FreelanceJobs

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free