Implementing Resumable Uploads for Large Datasets: Strategies and SDK Examples
resumableSDKperformance

Implementing Resumable Uploads for Large Datasets: Strategies and SDK Examples

uuploadfile
2026-02-04 12:00:00
12 min read
Advertisement

Deep, practical guide to resumable uploads for AI datasets: chunk tuning, retries, checksums, and JS/Python SDKs for large-file reliability in 2026.

Hook: Stop losing hours to failed dataset uploads

If you've ever watched a terabyte-scale dataset fail 90% through an upload and wished for a simple, reliable resume mechanism, this article is for you. In 2026, AI training pipelines push larger, more frequent dataset transfers and projects can't afford data loss, duplicate costs, or brittle retry logic. Below you'll find a deep, practical dive into resumable uploads: the algorithms, chunking strategies, retry patterns, and complete SDK examples in JavaScript and Python tuned for large AI training datasets.

Quick summary and actionable takeaways

  • Design for checkpoints: persist chunk state and checksums to survive client crashes.
  • Tune chunk size: use bandwidth-delay product (BDP) to pick chunk sizes instead of arbitrary defaults.
  • Use idempotent part APIs: make chunk uploads repeatable with part indices and checksums.
  • Retry with jitter: exponential backoff + jitter and server-aware handling of 429/503 is essential.
  • Parallelize safely: upload parts in parallel but finalize on the server with a manifest and full-file checksum.
  • Security: short-lived pre-signed URLs, TLS, and optional client-side encryption for regulated datasets.

Why resumable uploads matter in 2026

The scale of training datasets has kept accelerating into 2025 and 2026. Public moves like strategic acquisitions in AI data marketplaces and increased commercialization of training data have made dataset transfer robustness a first-class concern. Teams now routinely move multi-GB to multi-TB files from edge collectors, annotators, and marketplaces to central training lakes. A single failed transfer can mean wasted compute and delayed experiments.

Resumable upload systems reduce bandwidth waste, lower storage egress costs, and improve developer and operator productivity. They are also a compliance and security surface: resumable schemes that leak credentials, skip integrity checks, or allow duplicate parts can create vulnerabilities.

Resumable upload algorithms: state, protocols, and integrity

Core state machine

At a high level, a resumable upload is a deterministic state machine with these states:

  • initialized: server created an upload ID and optional pre-signed URLs for parts.
  • uploading: client sends numbered chunks/parts and records their status.
  • finalizing: client asks server to assemble parts and validate final checksum.
  • completed: object is available for downstream processing.

The server maintains a manifest of uploaded parts (part index, byte ranges, checksum, size). The client owns the retry logic and checkpoint persistence.

Protocols and patterns

Common options in modern stacks include:

  • S3 multipart: de facto for large-object uploads to S3-compatible stores; parts minimum 5 MB, server provides upload ID and part ETags.
  • tus: a standardized resumable upload protocol with server-implemented offsets and patch semantics, often used in offline-first clients.
  • HTTP Range / Byte-Range PATCH: APIs accept byte ranges and respond with current offset; useful for custom servers.
  • Pre-signed part URLs: combine pre-signed URLs with S3 or object store parts to allow client-direct uploads to storage.

Checksums and integrity

Per-part checksums are non-negotiable for reliable resumes. Use fast checksums like CRC32C for quick verification, and compute a final cryptographic hash (SHA-256) for the entire object. For huge datasets or deduplication needs, a Merkle-tree approach provides efficient partial verification and parallel integrity proofs.

Strong tip: always send the part checksum with the upload and have the server verify before marking the part complete. Never trust part completion without a checksum match.

Chunking strategies: picking the right part size

Chunk size selection directly impacts latency, throughput, memory, and cost. Too small and you waste CPU and additional metadata; too large and retries become expensive when a single chunk fails. Optimal chunk size depends on network RTT, available bandwidth, client memory, and the storage API constraints.

Bandwidth-delay product (BDP) approach

Calculate a chunk size based on the estimated Bandwidth-Delay Product:


// Pseudocode
// RTT measured in seconds, bandwidth in bytes/sec
bdp = rtt * bandwidth
chunk = clamp(bdp, minChunk, maxChunk)
  

Practically, measure small test uploads to estimate the bandwidth and RTT, then choose chunk sizes in the window [minChunk, maxChunk]. For example, for desktop clients on modern cloud links, a chunk of 16MiB-64MiB is often close to optimal for multi-GB files. For mobile with higher RTTs and variable throughput, 1MiB-8MiB is safer.

Guidelines and rules of thumb

  • S3-compatible stores: respect the 5 MiB minimum per part and consider 10-50 MiB for large datasets.
  • Very large files (100GB+): choose 32-128 MiB parts and parallelize uploads.
  • Mobile and high-latency: favor 1-8 MiB to reduce re-upload cost on failures.
  • Minimize metadata: fewer parts mean fewer API calls, which cuts cost and server overhead.

Adaptive chunk sizing

Implement adaptive strategies: start with a conservative chunk, measure throughput, and enlarge chunks on sustained good throughput, or shrink when error rates increase. Persist the tuned size per device to speed future uploads.

Retry strategies: backoff, idempotency, and checkpointing

Effective retry behavior keeps throughput high while avoiding thundering herds and duplicate work.

Exponential backoff with jitter

Use exponential backoff with decorrelated jitter to avoid synchronized retries. Example parameters:

  • initial delay: 200ms
  • factor: 2
  • jitter: uniform random between 0 and current delay
  • max retries: 8-12

Classify errors

Different error classes require different responses:

  • 4xx client errors (400, 403): usually fatal for that request; inspect and re-authenticate or abort the upload.
  • 429 / 503: retry with backoff and consider reducing concurrency.
  • Network timeouts / connection resets: safe to retry the same part if idempotent; ensure server-side checks protect against duplicate writes.

Idempotency and deduplication

Upload parts with part index and checksum. If the server detects an existing identical part, return success without reapplying the data. This makes retries safe and efficient.

Checkpointing and persistence

Persist upload state locally: upload ID, part indices completed, per-part checksums, and current chunk size. For production clients, use durable local storage (IndexedDB for browser, sqlite for desktop or Python CLI). Recovering from crashes requires reading the checkpoint and continuing from the highest confirmed part.

Parallel uploads and ordering

Parallelizing part uploads maximizes throughput but introduces complexity.

  • Assign each worker a non-overlapping part index.
  • Upload parts out of order; the manifest keeps index->ETag/checksum mapping.
  • Finalize by sending the ordered manifest to the server, which assembles parts atomically.

For AI datasets, ensure the final object checksum matches an authoritative record before downstream tasks begin; training on corrupted data is costly.

Security and compliance

Short-lived pre-signed URLs, TLS 1.3, and strong server-side logging are baseline controls in 2026. For regulated datasets (HIPAA, GDPR), add encryption-at-rest with customer-managed keys or client-side encryption, strict access logs, and data residency controls.

JavaScript SDK example

The following is a compact resumable upload client for browser or Node.js that demonstrates chunking, checkpointing (localStorage for demo), parallel uploads, checksums, and retries. For production, replace localStorage with IndexedDB and add robust error handling.


// SimpleResumableUploader.js
// Uses single-quote strings to simplify embedding
const DEFAULT_CHUNK = 8 * 1024 * 1024 // 8 MiB
const CONCURRENCY = 4

function sleep(ms){ return new Promise(r => setTimeout(r, ms)) }

async function crc32c(buffer){
  // placeholder: in prod use a WASM or native CRC32C implementation
  // here we return a hex stub for demo
  return 'crc32c-' + buffer.byteLength
}

class SimpleResumableUploader{
  constructor({createUploadUrl}){
    this.createUploadUrl = createUploadUrl // function(fileMeta) -> {uploadId, partUrlTemplate}
  }

  async upload(file, meta){
    const {uploadId, partUrlTemplate} = await this.createUploadUrl(meta)
    const stateKey = 'upload:' + uploadId
    let state = JSON.parse(localStorage.getItem(stateKey) || null) || {parts: {}, chunkSize: DEFAULT_CHUNK}

    const totalParts = Math.ceil(file.size / state.chunkSize)

    const workers = new Array(CONCURRENCY).fill(null).map(() => this._worker(file, state, uploadId, partUrlTemplate, stateKey))
    await Promise.all(workers)

    // after all parts uploaded, call finalize
    const manifest = Object.keys(state.parts).sort((a,b)=>a-b).map(i => ({part: Number(i), checksum: state.parts[i].checksum}))
    await fetch(`/uploads/${uploadId}/complete`, {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({manifest})
    })
    localStorage.removeItem(stateKey)
  }

  async _worker(file, state, uploadId, partUrlTemplate, stateKey){
    while(true){
      const nextIndex = this._nextPendingPart(state, file.size, state.chunkSize)
      if(nextIndex === null) return
      const start = nextIndex * state.chunkSize
      const end = Math.min(file.size, start + state.chunkSize)
      const blob = file.slice(start, end)
      const arr = await blob.arrayBuffer()
      const checksum = await crc32c(arr)

      // skip if already uploaded with same checksum
      if(state.parts[nextIndex] && state.parts[nextIndex].checksum === checksum){
        continue
      }

      const url = partUrlTemplate.replace('{part}', String(nextIndex))
      let tries = 0
      while(true){
        try{
          const res = await fetch(url, {method: 'PUT', body: arr, headers: {'X-Chunk-Checksum': checksum}})
          if(res.ok){
            state.parts[nextIndex] = {checksum, size: arr.byteLength}
            localStorage.setItem(stateKey, JSON.stringify(state))
            break
          }
          if(res.status >= 400 && res.status < 500){
            throw new Error('Permanent client error ' + res.status)
          }
          // server error -> retry
        }catch(err){
          tries++
          if(tries > 10) throw err
          const backoff = Math.min(2000 * Math.pow(2, tries), 30000)
          const jitter = Math.random() * backoff
          await sleep(backoff + jitter)
        }
      }
    }
  }

  _nextPendingPart(state, fileSize, chunkSize){
    const total = Math.ceil(fileSize / chunkSize)
    for(let i=0;i

Notes: this demo uses a server endpoint that returns pre-signed part URLs using a {part} placeholder and a completion endpoint that accepts an ordered manifest. In production, also include part ETags returned by storage APIs.

Python SDK example

The Python example below uploads large files with parallel part uploads using threads, persists checkpoints in a simple sqlite DB, and computes SHA-256 per-part and final checksum.


# resumable_uploader.py
import os
import math
import hashlib
import sqlite3
import threading
import requests
from queue import Queue

CHUNK_SIZE = 16 * 1024 * 1024
CONCURRENCY = 6

class SqliteCheckpoint:
    def __init__(self, path='uploads.db'):
        self.conn = sqlite3.connect(path, check_same_thread=False)
        self.conn.execute('''CREATE TABLE IF NOT EXISTS parts (upload_id TEXT, part_idx INTEGER, checksum TEXT, size INTEGER, PRIMARY KEY(upload_id, part_idx))''')
        self.lock = threading.Lock()

    def mark_part(self, upload_id, idx, checksum, size):
        with self.lock:
            self.conn.execute('INSERT OR REPLACE INTO parts (upload_id, part_idx, checksum, size) VALUES (?,?,?,?)', (upload_id, idx, checksum, size))
            self.conn.commit()

    def get_parts(self, upload_id):
        cur = self.conn.execute('SELECT part_idx, checksum FROM parts WHERE upload_id=?', (upload_id,))
        return {r[0]: r[1] for r in cur.fetchall()}

def sha256_bytes(b):
    h = hashlib.sha256()
    h.update(b)
    return h.hexdigest()

class ResumableUploader:
    def __init__(self, create_upload_url_fn, checkpoint):
        self.create_upload_url_fn = create_upload_url_fn
        self.checkpoint = checkpoint

    def upload(self, path, meta):
        size = os.path.getsize(path)
        upload = self.create_upload_url_fn(meta)
        upload_id = upload['upload_id']
        part_template = upload['part_url_template']

        done_parts = self.checkpoint.get_parts(upload_id)
        total_parts = math.ceil(size / CHUNK_SIZE)

        q = Queue()
        for i in range(total_parts):
            if i in done_parts:
                continue
            q.put(i)

        def worker():
            while not q.empty():
                idx = q.get()
                start = idx * CHUNK_SIZE
                with open(path, 'rb') as f:
                    f.seek(start)
                    data = f.read(CHUNK_SIZE)
                checksum = sha256_bytes(data)
                url = part_template.replace('{part}', str(idx))
                tries = 0
                while True:
                    try:
                        res = requests.put(url, data=data, headers={'X-Chunk-Checksum': checksum}, timeout=60)
                        if res.status_code == 200:
                            self.checkpoint.mark_part(upload_id, idx, checksum, len(data))
                            break
                        elif 400 <= res.status_code < 500:
                            raise Exception('Client error %s' % res.status_code)
                    except Exception as e:
                        tries += 1
                        if tries > 8:
                            raise
                q.task_done()

        threads = [threading.Thread(target=worker) for _ in range(CONCURRENCY)]
        for t in threads: t.start()
        q.join()
        for t in threads: t.join()

        # finalize
        parts = self.checkpoint.get_parts(upload_id)
        ordered = [{'part': i, 'checksum': parts[i]} for i in sorted(parts.keys())]
        requests.post(f'https://api.example.com/uploads/{upload_id}/complete', json={'manifest': ordered})

# Usage: provide a function that creates pre-signed part URLs and upload id
  

Production considerations and optimizations

There are several engineering tradeoffs to tune beyond the core algorithm:

  • CDN edge ingest: for globally distributed clients, accept uploads at edge PoPs and forward to central object storage to reduce RTT and improve throughput; architectures that focus on edge ingress reduce tail latency (see edge-focused patterns at Edge-Oriented Oracle Architectures).
  • Serverless finalization: assemble or validate manifests using serverless functions for elastic scaling.
  • Storage lifecycle: use storage classes to automatically tier infrequently used dataset snapshots to reduce cost.
  • Pre-compressed chunks: compressible datasets can reduce transfer size dramatically, but only if you ensure consistent compression across retries and part boundaries.
  • Deduplication: content-addressable storage with per-chunk hashes avoids re-storing identical content across datasets or versions.

Monitoring and SLOs

Track metrics: upload success rate, average time-to-complete per GB, retry rate per client, and percent of resumed uploads vs fresh. Use these to set SLOs and spot regressions after SDK or server changes. For instrumentation patterns and cost-focused metrics, see operational case studies on query and instrumentation savings (query-spend case study).

Costs and multipart trade-offs

Multipart uploads lower egress and retries but increase API call counts and metadata storage. For many cloud object stores, reducing part count lowers the number of PUT requests and API charges. Always balance part size against retry cost — sometimes paying for an extra PUT is cheaper than re-uploading a 64 MiB chunk repeatedly on flaky mobile links.

As of 2026, several trends shape resumable uploads for AI datasets:

  • Edge-first ingestion: major CDN and edge providers are offering direct resumable upload endpoints to ingest training data closer to collectors, reducing RTT and egress cost.
  • Data marketplaces and provenance: acquisitions and integrations in AI data marketplaces mean dataset provenance and immutable manifests are increasingly required. Expect server APIs that record dataset lineage alongside part manifests.
  • WASM checksum offload: clients increasingly run checksum and compression in WASM for consistent cross-platform performance; see WASM adoption discussions in tooling rundowns (WASM checksum patterns).
  • Content-addressable and Merkle-based manifests: for very large datasets, Merkle trees make partial verification and efficient delta updates feasible (Merkle & manifest architectures).
  • P2P and federation: experimental tooling uses peer-assisted uploads for high-volume edge collections, reducing central bandwidth load.

Checklist: ship a resilient resumable upload flow

  1. Choose a protocol: tus, S3 multipart, or custom byte-range API.
  2. Implement per-part checksums and final SHA-256 verification.
  3. Persist checkpoints durable across crashes and reboots.
  4. Use exponential backoff with jitter and classify errors for retries.
  5. Tune chunk size using BDP and adapt dynamically.
  6. Support parallel parts with ordered manifest finalization.
  7. Protect data: TLS, short-lived pre-signed URLs, and optional client-side encryption.
  8. Instrument metrics and alerts for upload health and costs.

Final notes

Resumable uploads are both an engineering challenge and an operational multiplier. For teams building AI pipelines in 2026, investing in robust resumable flows reduces wasted compute, speeds iteration, and enables new business models around dataset marketplaces and provenance. Small decisions — chunk size, where to store state, how you classify retries — compound when you handle terabytes at scale.

Call to action

Ready to implement resilient resumable uploads for your training pipelines? Start with a small prototype using the JS or Python SDK examples above, measure your baseline RTT and bandwidth, then iterate chunk sizing and retry parameters. If you want a turnkey library or audited architecture review for your system, contact our team to get a tailored implementation and performance tuning roadmap.

Advertisement

Related Topics

#resumable#SDK#performance
u

uploadfile

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T07:25:55.991Z