File Downloads - Crawlstack

Crawlstack provides a built-in mechanism to capture file downloads during your crawl and save them seamlessly to your storage.

How it works

When scraping modern websites, files are typically downloaded in one of two ways:

Network Downloads: The file is hosted on a server (e.g., clicking a link to https://example.com/report.pdf).
Client-Side Downloads: The file is generated instantly in the browser’s memory using JavaScript (e.g., exporting a CSV from a data table via a blob: URI).

Crawlstack’s download manager implements a Two-Pronged Trap to catch both. It intercepts network traffic for standard downloads and safely patches the browser’s URL.createObjectURL API to extract data directly from memory for client-side files, preventing the browser’s default “Save As” popup entirely.

Enabling Downloads

To capture files during a run, simply call the runner.enableDownloads() method in your script before triggering the download action.

// 1. Enable interception
await runner.enableDownloads();

// 2. Trigger the download (e.g., clicking an export button)
const exportBtn = document.querySelector("#export-csv");
exportBtn.click();

// 3. Wait for the download to finish and get the URLs
const { local_url, public_url, mimeType, size } = await runner.waitFor(async () => {
   const downloads = await runner.getDownloads();
   const file = downloads.files[0];
   
   if (file && file.status === 'done') {
       return file;
   } else if (file && file.status === 'failed') {
       throw new Error(`Download failed: ${file.error}`);
   }
});

console.log(`File: ${mimeType}, Size: ${size} bytes`);

// 4. Use local_url with runner.fetch() for high-performance direct access
// runner.fetch() uses a stealthy background proxy to bypass CORS and CSP.
const res = await runner.fetch(local_url);
const textData = await res.text();
console.log("File content length:", textData.length);

// 5. You can now publish the public_url for external use!
// The public_url is accessible via the Relay Server.
await runner.publishItems([{
    id: 'my-file',
    data: { downloadLink: public_url }
}]);

// 6. (Optional) Clear the queue if you are downloading multiple files sequentially
await runner.clearDownloads();

URL Strategy

Crawlstack provides two types of URLs for every intercepted file:

local_url: Pointing to https://opfs-local.internal/. This is the fastest way to access the file content within your extraction script using runner.fetch(). It bypasses the network entirely.
public_url: Pointing to your configured Relay Server. Use this when you need to share the file link with an external system (e.g., via a webhook).

Storage Location

Depending on your configuration, captured files will be saved in one of two places. The file paths will always follow this format: [crawlerId]/[runId]/[filename].

OPFS (Local Storage)

By default, files are saved locally inside the browser using the Origin Private File System (OPFS). They are served to your scripts via a high-performance, stealthy streaming bridge.

S3 / R2 / MinIO (Cloud Storage)

If you have configured S3 credentials in the Settings dashboard, Crawlstack will automatically stream the captured files directly to your cloud bucket. In this case, both local_url and public_url will point to the direct S3 link.

Logging

Whenever a file is successfully intercepted and saved, you will see a log entry in your Run’s backend logs:

[Downloader] Client file saved to https://my-bucket.s3.amazonaws.com/...

​How it works

​Enabling Downloads

​URL Strategy

​Storage Location

​OPFS (Local Storage)

​S3 / R2 / MinIO (Cloud Storage)

​Logging