Implementing Resumable Uploads for Large Datasets: Strategies and SDK Examples
Deep, practical guide to resumable uploads for AI datasets: chunk tuning, retries, checksums, and JS/Python SDKs for large-file reliability in 2026.
Hook: Stop losing hours to failed dataset uploads
If you've ever watched a terabyte-scale dataset fail 90% through an upload and wished for a simple, reliable resume mechanism, this article is for you. In 2026, AI training pipelines push larger, more frequent dataset transfers and projects can't afford data loss, duplicate costs, or brittle retry logic. Below you'll find a deep, practical dive into resumable uploads: the algorithms, chunking strategies, retry patterns, and complete SDK examples in JavaScript and Python tuned for large AI training datasets.
Quick summary and actionable takeaways
- Design for checkpoints: persist chunk state and checksums to survive client crashes.
- Tune chunk size: use bandwidth-delay product (BDP) to pick chunk sizes instead of arbitrary defaults.
- Use idempotent part APIs: make chunk uploads repeatable with part indices and checksums.
- Retry with jitter: exponential backoff + jitter and server-aware handling of 429/503 is essential.
- Parallelize safely: upload parts in parallel but finalize on the server with a manifest and full-file checksum.
- Security: short-lived pre-signed URLs, TLS, and optional client-side encryption for regulated datasets.
Why resumable uploads matter in 2026
The scale of training datasets has kept accelerating into 2025 and 2026. Public moves like strategic acquisitions in AI data marketplaces and increased commercialization of training data have made dataset transfer robustness a first-class concern. Teams now routinely move multi-GB to multi-TB files from edge collectors, annotators, and marketplaces to central training lakes. A single failed transfer can mean wasted compute and delayed experiments.
Resumable upload systems reduce bandwidth waste, lower storage egress costs, and improve developer and operator productivity. They are also a compliance and security surface: resumable schemes that leak credentials, skip integrity checks, or allow duplicate parts can create vulnerabilities.
Resumable upload algorithms: state, protocols, and integrity
Core state machine
At a high level, a resumable upload is a deterministic state machine with these states:
- initialized: server created an upload ID and optional pre-signed URLs for parts.
- uploading: client sends numbered chunks/parts and records their status.
- finalizing: client asks server to assemble parts and validate final checksum.
- completed: object is available for downstream processing.
The server maintains a manifest of uploaded parts (part index, byte ranges, checksum, size). The client owns the retry logic and checkpoint persistence.
Protocols and patterns
Common options in modern stacks include:
- S3 multipart: de facto for large-object uploads to S3-compatible stores; parts minimum 5 MB, server provides upload ID and part ETags.
- tus: a standardized resumable upload protocol with server-implemented offsets and patch semantics, often used in offline-first clients.
- HTTP Range / Byte-Range PATCH: APIs accept byte ranges and respond with current offset; useful for custom servers.
- Pre-signed part URLs: combine pre-signed URLs with S3 or object store parts to allow client-direct uploads to storage.
Checksums and integrity
Per-part checksums are non-negotiable for reliable resumes. Use fast checksums like CRC32C for quick verification, and compute a final cryptographic hash (SHA-256) for the entire object. For huge datasets or deduplication needs, a Merkle-tree approach provides efficient partial verification and parallel integrity proofs.
Strong tip: always send the part checksum with the upload and have the server verify before marking the part complete. Never trust part completion without a checksum match.
Chunking strategies: picking the right part size
Chunk size selection directly impacts latency, throughput, memory, and cost. Too small and you waste CPU and additional metadata; too large and retries become expensive when a single chunk fails. Optimal chunk size depends on network RTT, available bandwidth, client memory, and the storage API constraints.
Bandwidth-delay product (BDP) approach
Calculate a chunk size based on the estimated Bandwidth-Delay Product:
// Pseudocode
// RTT measured in seconds, bandwidth in bytes/sec
bdp = rtt * bandwidth
chunk = clamp(bdp, minChunk, maxChunk)
Practically, measure small test uploads to estimate the bandwidth and RTT, then choose chunk sizes in the window [minChunk, maxChunk]. For example, for desktop clients on modern cloud links, a chunk of 16MiB-64MiB is often close to optimal for multi-GB files. For mobile with higher RTTs and variable throughput, 1MiB-8MiB is safer.
Guidelines and rules of thumb
- S3-compatible stores: respect the 5 MiB minimum per part and consider 10-50 MiB for large datasets.
- Very large files (100GB+): choose 32-128 MiB parts and parallelize uploads.
- Mobile and high-latency: favor 1-8 MiB to reduce re-upload cost on failures.
- Minimize metadata: fewer parts mean fewer API calls, which cuts cost and server overhead.
Adaptive chunk sizing
Implement adaptive strategies: start with a conservative chunk, measure throughput, and enlarge chunks on sustained good throughput, or shrink when error rates increase. Persist the tuned size per device to speed future uploads.
Retry strategies: backoff, idempotency, and checkpointing
Effective retry behavior keeps throughput high while avoiding thundering herds and duplicate work.
Exponential backoff with jitter
Use exponential backoff with decorrelated jitter to avoid synchronized retries. Example parameters:
- initial delay: 200ms
- factor: 2
- jitter: uniform random between 0 and current delay
- max retries: 8-12
Classify errors
Different error classes require different responses:
- 4xx client errors (400, 403): usually fatal for that request; inspect and re-authenticate or abort the upload.
- 429 / 503: retry with backoff and consider reducing concurrency.
- Network timeouts / connection resets: safe to retry the same part if idempotent; ensure server-side checks protect against duplicate writes.
Idempotency and deduplication
Upload parts with part index and checksum. If the server detects an existing identical part, return success without reapplying the data. This makes retries safe and efficient.
Checkpointing and persistence
Persist upload state locally: upload ID, part indices completed, per-part checksums, and current chunk size. For production clients, use durable local storage (IndexedDB for browser, sqlite for desktop or Python CLI). Recovering from crashes requires reading the checkpoint and continuing from the highest confirmed part.
Parallel uploads and ordering
Parallelizing part uploads maximizes throughput but introduces complexity.
- Assign each worker a non-overlapping part index.
- Upload parts out of order; the manifest keeps index->ETag/checksum mapping.
- Finalize by sending the ordered manifest to the server, which assembles parts atomically.
For AI datasets, ensure the final object checksum matches an authoritative record before downstream tasks begin; training on corrupted data is costly.
Security and compliance
Short-lived pre-signed URLs, TLS 1.3, and strong server-side logging are baseline controls in 2026. For regulated datasets (HIPAA, GDPR), add encryption-at-rest with customer-managed keys or client-side encryption, strict access logs, and data residency controls.
JavaScript SDK example
The following is a compact resumable upload client for browser or Node.js that demonstrates chunking, checkpointing (localStorage for demo), parallel uploads, checksums, and retries. For production, replace localStorage with IndexedDB and add robust error handling.
// SimpleResumableUploader.js
// Uses single-quote strings to simplify embedding
const DEFAULT_CHUNK = 8 * 1024 * 1024 // 8 MiB
const CONCURRENCY = 4
function sleep(ms){ return new Promise(r => setTimeout(r, ms)) }
async function crc32c(buffer){
// placeholder: in prod use a WASM or native CRC32C implementation
// here we return a hex stub for demo
return 'crc32c-' + buffer.byteLength
}
class SimpleResumableUploader{
constructor({createUploadUrl}){
this.createUploadUrl = createUploadUrl // function(fileMeta) -> {uploadId, partUrlTemplate}
}
async upload(file, meta){
const {uploadId, partUrlTemplate} = await this.createUploadUrl(meta)
const stateKey = 'upload:' + uploadId
let state = JSON.parse(localStorage.getItem(stateKey) || null) || {parts: {}, chunkSize: DEFAULT_CHUNK}
const totalParts = Math.ceil(file.size / state.chunkSize)
const workers = new Array(CONCURRENCY).fill(null).map(() => this._worker(file, state, uploadId, partUrlTemplate, stateKey))
await Promise.all(workers)
// after all parts uploaded, call finalize
const manifest = Object.keys(state.parts).sort((a,b)=>a-b).map(i => ({part: Number(i), checksum: state.parts[i].checksum}))
await fetch(`/uploads/${uploadId}/complete`, {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({manifest})
})
localStorage.removeItem(stateKey)
}
async _worker(file, state, uploadId, partUrlTemplate, stateKey){
while(true){
const nextIndex = this._nextPendingPart(state, file.size, state.chunkSize)
if(nextIndex === null) return
const start = nextIndex * state.chunkSize
const end = Math.min(file.size, start + state.chunkSize)
const blob = file.slice(start, end)
const arr = await blob.arrayBuffer()
const checksum = await crc32c(arr)
// skip if already uploaded with same checksum
if(state.parts[nextIndex] && state.parts[nextIndex].checksum === checksum){
continue
}
const url = partUrlTemplate.replace('{part}', String(nextIndex))
let tries = 0
while(true){
try{
const res = await fetch(url, {method: 'PUT', body: arr, headers: {'X-Chunk-Checksum': checksum}})
if(res.ok){
state.parts[nextIndex] = {checksum, size: arr.byteLength}
localStorage.setItem(stateKey, JSON.stringify(state))
break
}
if(res.status >= 400 && res.status < 500){
throw new Error('Permanent client error ' + res.status)
}
// server error -> retry
}catch(err){
tries++
if(tries > 10) throw err
const backoff = Math.min(2000 * Math.pow(2, tries), 30000)
const jitter = Math.random() * backoff
await sleep(backoff + jitter)
}
}
}
}
_nextPendingPart(state, fileSize, chunkSize){
const total = Math.ceil(fileSize / chunkSize)
for(let i=0;i
Notes: this demo uses a server endpoint that returns pre-signed part URLs using a {part} placeholder and a completion endpoint that accepts an ordered manifest. In production, also include part ETags returned by storage APIs.
Python SDK example
The Python example below uploads large files with parallel part uploads using threads, persists checkpoints in a simple sqlite DB, and computes SHA-256 per-part and final checksum.
# resumable_uploader.py
import os
import math
import hashlib
import sqlite3
import threading
import requests
from queue import Queue
CHUNK_SIZE = 16 * 1024 * 1024
CONCURRENCY = 6
class SqliteCheckpoint:
def __init__(self, path='uploads.db'):
self.conn = sqlite3.connect(path, check_same_thread=False)
self.conn.execute('''CREATE TABLE IF NOT EXISTS parts (upload_id TEXT, part_idx INTEGER, checksum TEXT, size INTEGER, PRIMARY KEY(upload_id, part_idx))''')
self.lock = threading.Lock()
def mark_part(self, upload_id, idx, checksum, size):
with self.lock:
self.conn.execute('INSERT OR REPLACE INTO parts (upload_id, part_idx, checksum, size) VALUES (?,?,?,?)', (upload_id, idx, checksum, size))
self.conn.commit()
def get_parts(self, upload_id):
cur = self.conn.execute('SELECT part_idx, checksum FROM parts WHERE upload_id=?', (upload_id,))
return {r[0]: r[1] for r in cur.fetchall()}
def sha256_bytes(b):
h = hashlib.sha256()
h.update(b)
return h.hexdigest()
class ResumableUploader:
def __init__(self, create_upload_url_fn, checkpoint):
self.create_upload_url_fn = create_upload_url_fn
self.checkpoint = checkpoint
def upload(self, path, meta):
size = os.path.getsize(path)
upload = self.create_upload_url_fn(meta)
upload_id = upload['upload_id']
part_template = upload['part_url_template']
done_parts = self.checkpoint.get_parts(upload_id)
total_parts = math.ceil(size / CHUNK_SIZE)
q = Queue()
for i in range(total_parts):
if i in done_parts:
continue
q.put(i)
def worker():
while not q.empty():
idx = q.get()
start = idx * CHUNK_SIZE
with open(path, 'rb') as f:
f.seek(start)
data = f.read(CHUNK_SIZE)
checksum = sha256_bytes(data)
url = part_template.replace('{part}', str(idx))
tries = 0
while True:
try:
res = requests.put(url, data=data, headers={'X-Chunk-Checksum': checksum}, timeout=60)
if res.status_code == 200:
self.checkpoint.mark_part(upload_id, idx, checksum, len(data))
break
elif 400 <= res.status_code < 500:
raise Exception('Client error %s' % res.status_code)
except Exception as e:
tries += 1
if tries > 8:
raise
q.task_done()
threads = [threading.Thread(target=worker) for _ in range(CONCURRENCY)]
for t in threads: t.start()
q.join()
for t in threads: t.join()
# finalize
parts = self.checkpoint.get_parts(upload_id)
ordered = [{'part': i, 'checksum': parts[i]} for i in sorted(parts.keys())]
requests.post(f'https://api.example.com/uploads/{upload_id}/complete', json={'manifest': ordered})
# Usage: provide a function that creates pre-signed part URLs and upload id
Production considerations and optimizations
There are several engineering tradeoffs to tune beyond the core algorithm:
- CDN edge ingest: for globally distributed clients, accept uploads at edge PoPs and forward to central object storage to reduce RTT and improve throughput; architectures that focus on edge ingress reduce tail latency (see edge-focused patterns at Edge-Oriented Oracle Architectures).
- Serverless finalization: assemble or validate manifests using serverless functions for elastic scaling.
- Storage lifecycle: use storage classes to automatically tier infrequently used dataset snapshots to reduce cost.
- Pre-compressed chunks: compressible datasets can reduce transfer size dramatically, but only if you ensure consistent compression across retries and part boundaries.
- Deduplication: content-addressable storage with per-chunk hashes avoids re-storing identical content across datasets or versions.
Monitoring and SLOs
Track metrics: upload success rate, average time-to-complete per GB, retry rate per client, and percent of resumed uploads vs fresh. Use these to set SLOs and spot regressions after SDK or server changes. For instrumentation patterns and cost-focused metrics, see operational case studies on query and instrumentation savings (query-spend case study).
Costs and multipart trade-offs
Multipart uploads lower egress and retries but increase API call counts and metadata storage. For many cloud object stores, reducing part count lowers the number of PUT requests and API charges. Always balance part size against retry cost — sometimes paying for an extra PUT is cheaper than re-uploading a 64 MiB chunk repeatedly on flaky mobile links.
Latest trends and 2026 predictions
As of 2026, several trends shape resumable uploads for AI datasets:
- Edge-first ingestion: major CDN and edge providers are offering direct resumable upload endpoints to ingest training data closer to collectors, reducing RTT and egress cost.
- Data marketplaces and provenance: acquisitions and integrations in AI data marketplaces mean dataset provenance and immutable manifests are increasingly required. Expect server APIs that record dataset lineage alongside part manifests.
- WASM checksum offload: clients increasingly run checksum and compression in WASM for consistent cross-platform performance; see WASM adoption discussions in tooling rundowns (WASM checksum patterns).
- Content-addressable and Merkle-based manifests: for very large datasets, Merkle trees make partial verification and efficient delta updates feasible (Merkle & manifest architectures).
- P2P and federation: experimental tooling uses peer-assisted uploads for high-volume edge collections, reducing central bandwidth load.
Checklist: ship a resilient resumable upload flow
- Choose a protocol: tus, S3 multipart, or custom byte-range API.
- Implement per-part checksums and final SHA-256 verification.
- Persist checkpoints durable across crashes and reboots.
- Use exponential backoff with jitter and classify errors for retries.
- Tune chunk size using BDP and adapt dynamically.
- Support parallel parts with ordered manifest finalization.
- Protect data: TLS, short-lived pre-signed URLs, and optional client-side encryption.
- Instrument metrics and alerts for upload health and costs.
Final notes
Resumable uploads are both an engineering challenge and an operational multiplier. For teams building AI pipelines in 2026, investing in robust resumable flows reduces wasted compute, speeds iteration, and enables new business models around dataset marketplaces and provenance. Small decisions — chunk size, where to store state, how you classify retries — compound when you handle terabytes at scale.
Call to action
Ready to implement resilient resumable uploads for your training pipelines? Start with a small prototype using the JS or Python SDK examples above, measure your baseline RTT and bandwidth, then iterate chunk sizing and retry parameters. If you want a turnkey library or audited architecture review for your system, contact our team to get a tailored implementation and performance tuning roadmap.
Related Reading
- AWS European Sovereign Cloud: Technical Controls, Isolation Patterns and What They Mean for Architects
- Edge-Oriented Oracle Architectures: Reducing Tail Latency and Improving Trust in 2026
- Tool Roundup: Offline‑First Document Backup and Diagram Tools for Distributed Teams (2026)
- Perceptual AI and the Future of Image Storage on the Web (2026)
- Case Study: How We Reduced Query Spend on whites.cloud by 37% — Instrumentation to Guardrails
- When Metal Meets Pop: What Gwar’s Cover of 'Pink Pony Club' Says About Genre Fluidity and Nasheed Remixing
- Citing Social Media Finance Conversations: Using Bluesky’s Cashtags in Academic Work
- How to Market Luxury Properties to Remote Buyers: Lessons from Montpellier and Sète Listings
- Parental Guide to Emerging AI Platforms in Education: Separating Hype From Helpful Tools
- Checklist: Preflight Email Tests to Beat Gmail’s AI Filters
Related Topics
uploadfile
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you