Implementing Resumable Uploads for Multi-hour Recordings with Unreliable Networks
Engineering guide for resilient resumable uploads of multi-hour recordings with checkpointing, parallel chunk validation and recovery strategies.
Hook: Your multi-hour recording must survive flaky networks — here’s how to build that
Uploading a four-hour livestream or a multi-gigabyte podcast episode over Wi-Fi or mobile networks is a common failure point. Connections drop, clients crash, and naive retry logic turns hours of recording into lost time and user frustration. In 2026, with HTTP/3/QUIC and edge ingest now mainstream, engineering a resilient resumable upload flow is table stakes for professional media apps. This guide walks you through a battle-tested server and client implementation that handles checkpointing, parallel chunk validation, and reliable upload recovery for multi-hour recordings.
Overview: Goals and constraints
Start by aligning on concrete goals and constraints for the upload system.
- Goal: Allow clients to upload files of arbitrary length (hours of media) reliably across intermittent networks.
- Constraints: Minimize server bandwidth, scale to thousands of concurrent uploads, ensure integrity and compliance, and retry without user re-recording.
- Key features: resumable sessions, per-chunk checksums, parallel upload, checkpoint metadata, final assembly and integrity verification, presigned URLs for direct object storage uploads, and GC of stale sessions.
2026 context and why this matters now
As of late 2025 and early 2026, several trends make advanced resumable uploads both more powerful and more expected:
- Widespread adoption of HTTP/3 and QUIC reduced round-trip latency and improved under-loss performance, but client and network variability still necessitate robust retries and resume logic.
- Edge compute and CDN-integrated ingest let you push uploads closer to clients, reducing latency and egress costs — but orchestration and final consistency remain your responsibility.
- Serverless and ephemeral function limits mean long-lived assembly should be offloaded to durable services or object storage.
High-level architecture
Design with these components:
- Client SDK (browser, mobile) that slices data, computes checksums, and manages a resumable session.
- Session manager API that issues session IDs, chunk manifests, and presigned targets.
- Object storage with multipart upload support (S3, R2, GCS) or an edge-backed ingest endpoint.
- Metadata store for checkpoints and GC (Redis, DynamoDB, Postgres).
- Finalizer service that assembles parts, validates full-file integrity, and moves to long-term storage or CDN origin.
Defining the resumable session and checkpoint format
Store compact but sufficient metadata server-side so clients can resume quickly. A session record should include:
- sessionId: UUID v4
- userId, fileName, fileSize
- chunkSize (e.g., 8 MiB recommended), and total parts = ceil(fileSize / chunkSize)
- parts: array or bitmap of uploaded part ids and their checksums or object ETags
- lastAlive timestamp and ttl for GC
- finalHash: optional expected SHA-256 provided by client for end-to-end verification
Persist this in a low-latency store like Redis or DynamoDB for quick reads and updates. Use JSON documents and an expiry TTL to allow cleanup of abandoned sessions.
Choosing chunk size and concurrency
Chunk size impacts resilience, overhead, and memory usage:
- Small chunks (1-4 MiB): easier to resend, higher overhead for large files.
- Medium chunks (8-16 MiB): good balance for multi-hour media — fewer parts, moderate resume cost.
- Large chunks (32+ MiB): fewer requests, but higher cost to reupload when a single chunk fails.
For multi-hour recordings, start with 8 MiB and tune based on client bandwidth distribution. Limit parallelism to 4-12 concurrent uploads per device to avoid saturating mobile networks and incurring connection errors. Use exponential backoff with jitter for retries.
Chunk validation strategy: per-chunk hashing and Merkle-like finalization
Integrity is critical. Use a two-step verification:
- Per-chunk checksum: compute SHA-256 (or BLAKE3 for speed) on the client for each chunk and send it alongside the upload request. Store it in the session record. This enables the server or object store to validate the upload immediately.
- Final file hash: client computes a rolling file-level SHA-256 or a Merkle root over chunk hashes. The finalizer verifies the assembled object matches the expected final hash before marking the upload complete.
Merkle-style manifests are useful when you want partial deduplication or integrity proofs, especially across distributed assembly pipelines.
Direct-to-storage with presigned parts: scalable and cost-effective
To avoid proxying large uploads through your application servers, use presigned URLs or presigned multipart parts. Workflow:
- Client requests a resume session or a new session from the session manager.
- Server returns a sessionId and a list of presigned upload URLs for specific part indices. Include expected chunk checksums in the response.
- Client uploads each chunk directly to object storage using the presigned URL and returns the resulting ETag or success code back to session manager.
This reduces server egress costs and scales with object storage speed. In 2026, many CDNs provide edge-signed upload endpoints to accept multipart parts directly at the edge.
Server example: Node/Express session manager (presigned parts for S3)
const express = require('express')
const { S3Client, CreateMultipartUploadCommand, CreatePresignedPost } = require('@aws-sdk/client-s3')
const { v4: uuidv4 } = require('uuid')
// pseudo-code: sessionStore is Redis or DynamoDB
app.post('/upload/session', async (req, res) => {
const { fileName, fileSize, chunkSize=8*1024*1024 } = req.body
const sessionId = uuidv4()
const parts = Math.ceil(fileSize / chunkSize)
// Create multipart upload in S3
const create = await s3.send(new CreateMultipartUploadCommand({Bucket, Key: sessionId+'/'+fileName}))
// store session: sessionId, uploadId, parts, chunkSize
sessionStore.save(sessionId, {uploadId: create.UploadId, parts: {}, chunkSize, fileSize})
res.json({sessionId, chunkSize, parts})
})
This returns session metadata. Next, issue presigned URLs for each part and update session parts as clients upload.
Client pattern: slice, checksum, upload, checkpoint
Client responsibilities:
- Slice file into parts and compute per-part checksum before uploading.
- Upload parts in parallel to presigned URLs and on success write a checkpoint to the session manager with part index, checksum, and server ETag.
- On resume, fetch session state and upload only missing or invalidated parts.
async function uploadFile(file, session) {
const chunkSize = session.chunkSize
const totalParts = Math.ceil(file.size / chunkSize)
const concurrency = 6
const queue = createPartQueue(totalParts)
while (queue.hasNext()) {
const promises = []
for (let i=0; i<concurrency && queue.hasNext(); i++) {
const partIndex = queue.next()
promises.push(uploadPart(file, partIndex, chunkSize, session))
}
await Promise.allSettled(promises)
}
}
async function uploadPart(file, index, size, session) {
const start = index*size
const blob = file.slice(start, start+size)
const checksum = await sha256(blob)
const presigned = await getPresignedUrl(session.sessionId, index, checksum)
const res = await fetch(presigned.url, {method: 'PUT', body: blob})
if (res.ok) await checkpointPart(session.sessionId, index, checksum, res.headers.get('etag'))
else throw new Error('Upload failed')
}
Parallel chunk validation and optimistic concurrency
When uploading multiple parts concurrently, race conditions can occur if the session metadata is updated by multiple clients or device restarts. Mitigate risks with the following:
- Use atomic updates in the session store. For Redis use HSET with version checks or Lua scripts; for DynamoDB use conditional writes on a version attribute.
- Store both the client checksum and the storage ETag. On resume, compare them and re-upload if mismatch.
- If parts are uploaded out of order, assembly is still possible; rely on the multipart API which accepts parts by index.
Recovery strategies for unreliable networks
Design for intermittent connectivity with these tactics:
- Checkpoint after each successful part: store part metadata (index, checksum, storage ETag) immediately.
- Local durable checkpointing: also mirror session metadata to local storage (IndexedDB, SQLite on mobile) to allow resume after app crash before server-side checkpoint is updated.
- Client-side retry with backoff and jitter: use capped exponential backoff and randomized jitter to reduce thundering herd.
- Partial playback and transcode streaming: for very large recordings, consider periodic partial finalization and transcoding of uploaded ranges so that partial content is usable while rest uploads.
- Conflict resolution: on detecting different checksums for the same index, prefer the latest timestamped client upload or attempt re-upload after verification.
Finalization and integrity verification
Once all parts are uploaded and recorded in the session, run a finalizer process that:
- Calls CompleteMultipartUpload on the storage provider or otherwise assembles parts.
- Downloads a streamed checksum of the assembled object and verifies against the client-supplied final hash or computes the hash from chunk checksums.
- Moves the object to long-term storage or marks it public via CDN, depending on policy.
Fail-safe: if final verification fails, mark session as failed and keep parts for a grace period so the client can re-upload problematic parts without starting over.
Security, compliance and cost controls
- TLS always: TLS for all endpoints and ensure presigned URLs are short-lived and single-use if possible.
- Per-part authorization: tie presigned URLs to session identity and rotate credentials frequently.
- Encryption: use server-side encryption in storage and consider client-side encryption for HIPAA/GDPR-sensitive media.
- Audit logs: record session events, part uploads and finalization for compliance and debugging.
- Cost control: use CDN edge ingest to reduce egress, reuse existing multipart parts to dedupe, and schedule lifecycle rules to remove abandoned partial uploads.
Monitoring, retries and observability
Instrument these metrics and alerts:
- Session creation rate, completion rate, and abandonment rate
- Per-part upload latency and error rates
- Finalization success/failure and checksum mismatch counts
- Storage costs per GB and egress by region
Provide detailed client-side error codes in the session API so uploads can surface actionable messages instead of generic failures.
SDK design patterns and API contract
An SDK simplifies adoption and ensures consistent resume logic across platforms. Recommended methods:
- initSession(fileMeta) => sessionId, chunkSize, partsCount
- getSessionStatus(sessionId) => uploadedParts, checksums
- uploadPart(sessionId, index, blob, progressCb)
- checkpointPart(sessionId, index, checksum, etag)
- finalize(sessionId, finalHash)
Expose events for progress, pause, resume, and error so app UIs can show clear states. Include automatic local persistence of session state for app crashes.
Edge cases and hard lessons from production
- Clients that switch networks mid-upload can change IPs and NAT traversal which can invalidate some CDN edge presigned contexts. Use short-lived presigned URLs issued per IP where necessary, or reissue on network change.
- Mobile OS backgrounding can suspend uploads. Use platform-specific background upload APIs that hand off to the OS for long-running transfers.
- Too much parallelism on low-memory devices can OOM when buffering multiple chunks. Keep memory bounded by streaming and avoiding full-buffer accumulation.
- Long-lived presigned multipart uploads not completed incur storage costs. Implement aggressive GC and notify users before expiring sessions.
Example: recovery flow summary
- App starts and either creates or fetches sessionId.
- App loads local checkpoint and fetches authoritative session state from server.
- App computes missing parts list by comparing local and server checkpoints.
- App resumes parallel uploads on missing parts, updating checkpoints after each success.
- After all parts succeed, app calls finalize. Server completes multipart upload and verifies final hash.
Advanced: dedupe, delta uploads and chunk-addressing
If your service accepts many similar recordings or guests upload the same media, consider content-addressed chunks where chunk hashes are used to deduplicate storage. This reduces egress and storage costs but requires a chunk-index service and careful TTL handling for partial uploads.
Wrap-up: trade-offs and tuning checklist
Key trade-offs:
- Chunk size vs retry cost
- Concurrency vs client resource usage
- Latency vs server-side validation depth
Tuning checklist before launch:
- Validate uploads end-to-end with large QA files across flaky networks.
- Run load tests to validate session store and object storage scaling.
- Configure GC for stale sessions and abort incomplete multipart uploads periodically.
- Provide clear client UI with resume guidance and background upload notifications.
Actionable takeaways
- Use presigned multipart uploads and server-side session checkpointing to scale and reduce egress.
- Compute and persist per-chunk checksums, and verify a final file hash before accepting the upload as complete.
- Limit concurrency to a safe level, checkpoint after each part, and support local durable checkpoints to recover from crashes.
- Monitor session health and costs, and leverage edge ingest where low latency and lower egress are critical.
Call to action
Ready to implement reliable resumable uploads for multi-hour media? Download our reference SDK and example server implementations to get started, or contact our engineering team for a design review tailored to your platform and compliance requirements.
Related Reading
- Smart Lighting for Your Vehicle: When Ambient Light Becomes a Safety Hazard (and How to Use It Right)
- Is a 32" Samsung Odyssey Monitor Overkill for Mobile Cloud Gaming?
- Hiking Doner: Packing Portable Kebabs for Multi-Day Trails Like the Drakensberg
- Spotting Placebo Tech in Custom Athletic Gear: A Buyer’s Guide
- Heating vs Insulation: Why Upgrading Your Roof Is the Hot-Water-Bottle Solution Your Home Needs
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Performance Tuning for API-Driven Content Upload Solutions
Building a Resilient Content Upload Framework for High-Traffic Events
Navigating Pregnancy and Tech: The Intersection of Emotionality and Digital Solutions
Navigating Consent in Digital Content Creation: A Developer's Guide
Creating Interactive Content Experiences: The Role of File Upload APIs
From Our Network
Trending stories across our publication group