From URL to Structured Data: How Pricium Transforms E-commerce Scraping

A technical deep-dive into how Pricium converts a single product URL into complete, variation-aware, geo-accurate structured JSON - and why that transformation is harder than it looks.

The Deceptively Simple Interface

The Pricium API has the simplest possible interface:

POST /scrape
{
  "url": "https://amazon.com/dp/B0EXAMPLE",
  "location": "US"
}

One URL in. Complete structured product data out.

Under that simple interface is a significant amount of engineering. This post walks through what actually happens between input and output - and why getting it right is harder than it appears.

Step 1: Request Routing and Geo-Context Initialization

When a request arrives with "location": "UK", Pricium doesn't just change a header. It:

Routes the outbound request through a residential proxy pool in the UK
Sets up a browser session with UK locale, currency preferences, and a UK shipping address in session storage
Selects an appropriate browser fingerprint matching a common UK user profile (OS, browser version, screen resolution, timezone)

This multi-layer geo-context setup is what ensures the retrieved price is genuinely the UK price - not the US price with a UK flag attached.

Step 2: Anti-Bot Evasion

Major e-commerce platforms invest heavily in detecting non-human traffic. Before even loading the product page, Pricium's browser environment:

Configures a realistic TLS fingerprint (not the default Playwright TLS client hello)
Disables automation-related JavaScript properties (navigator.webdriver, etc.)
Loads necessary browser extensions and plugins that a real user would have
Sets up realistic browser history and cookie patterns

The goal: be indistinguishable from a real user with a real browser in a real location.

Step 3: Page Load and JavaScript Execution

The product URL is loaded in the configured browser context. Unlike HTTP-only scrapers:

All JavaScript is executed (including framework rendering, lazy loading, and dynamic content)
The scraper waits for meaningful page events - not just DOMContentLoaded, but actual price elements rendering
Network requests are monitored for XHR/fetch calls that load pricing data asynchronously

Step 4: Variation Data Extraction

This is the most technically challenging step. Pricium uses two strategies in parallel:

Strategy A: Embedded Data Parsing

Many platforms store variation data in the page's embedded JavaScript - Amazon's "Twister" JSON is a canonical example. This data structure contains a complete map of all variations and their attributes. Pricium parses these embedded objects directly:

// Pseudocode of Twister extraction
const pageContent = await page.content();
const twistedMatch = pageContent.match(/"twisterData":\s*(\{.+?\})/s);
if (twistedMatch) {
  const variationMap = JSON.parse(twistedMatch[1]);
  // Extract size/color/price mappings from variationMap
}

When this succeeds, we get all variation data from a single page load - efficiently.

Strategy B: Swatch Interaction

For platforms that don't expose variation data in embedded scripts, Pricium falls back to programmatic variation enumeration: identifying all interactive variation swatches, clicking each one, waiting for the DOM to update, and capturing the resulting price.

This is slower (handled with parallelism where possible) but comprehensive.

Step 5: Data Normalization

Raw extracted data varies in structure across retailers. Pricium normalizes everything into a consistent schema:

interface ProductData {
  product_title: string;
  source_url: string;
  currency: string;
  variations: Array<{
    size?: string;
    color?: string;
    config?: string;         // For electronics: storage, RAM, etc.
    price: number;
    original_price?: number; // Pre-discount price if on sale
    available: boolean;
    rating?: number;
    review_count?: number;
  }>;
  geo_pricing?: Record<string, {
    price: number;
    currency: string;
    tax_included: boolean;
  }>;
  scraped_at: string; // ISO 8601
}

The same schema regardless of whether the source was Amazon, Flipkart, Nike.com, or a Shopify store.

Step 6: Quality Validation

Before returning data, Pricium runs validation checks:

Are prices within reasonable bounds for the category? (Detect bot-countermeasure data)
Does the number of variations match known product structure? (Detect incomplete captures)
Is the price fresh? (Timestamp validation)
Is availability consistent with known stock patterns?

If validation fails, the request is retried with a different proxy and browser configuration.

What Comes Out: A Real Example

{
  "product_title": "Levi's Men's 514 Straight Fit Jeans",
  "source_url": "https://amazon.com/dp/B00EXAMPLE",
  "currency": "USD",
  "scraped_at": "2026-03-20T14:32:01Z",
  "variations": [
    { "size": "30x30", "color": "Dark Stonewash", "price": 44.99, "available": true, "rating": 4.3 },
    { "size": "32x30", "color": "Dark Stonewash", "price": 44.99, "available": true, "rating": 4.3 },
    { "size": "34x32", "color": "Dark Stonewash", "price": 49.99, "available": false, "rating": 4.2 },
    { "size": "30x30", "color": "Medium Stonewash", "price": 39.99, "available": true, "rating": 4.5 }
  ]
}

Start to finish: typically under 8 seconds for most products, including full variation enumeration.

Why This Matters

The transformation from "URL" to "complete structured product data" sounds simple but requires solving anti-bot, geo-routing, JavaScript rendering, variation enumeration, and data normalization problems simultaneously. Pricium does all of this so you don't have to - and exposes it through the simplest possible interface.

Start turning URLs into structured data with Pricium →