Resumable Uploads for Multi-hour Recordings

Engineering guide for resilient resumable uploads of multi-hour recordings with checkpointing, parallel chunk validation and recovery strategies.

Hook: Your multi-hour recording must survive flaky networks — here’s how to build that

Uploading a four-hour livestream or a multi-gigabyte podcast episode over Wi-Fi or mobile networks is a common failure point. Connections drop, clients crash, and naive retry logic turns hours of recording into lost time and user frustration. In 2026, with HTTP/3/QUIC and edge ingest now mainstream, engineering a resilient resumable upload flow is table stakes for professional media apps. This guide walks you through a battle-tested server and client implementation that handles checkpointing, parallel chunk validation, and reliable upload recovery for multi-hour recordings.

Overview: Goals and constraints

Start by aligning on concrete goals and constraints for the upload system.

Goal: Allow clients to upload files of arbitrary length (hours of media) reliably across intermittent networks.
Constraints: Minimize server bandwidth, scale to thousands of concurrent uploads, ensure integrity and compliance, and retry without user re-recording.
Key features: resumable sessions, per-chunk checksums, parallel upload, checkpoint metadata, final assembly and integrity verification, presigned URLs for direct object storage uploads, and GC of stale sessions.

2026 context and why this matters now

As of late 2025 and early 2026, several trends make advanced resumable uploads both more powerful and more expected:

Widespread adoption of HTTP/3 and QUIC reduced round-trip latency and improved under-loss performance, but client and network variability still necessitate robust retries and resume logic.
Edge compute and CDN-integrated ingest let you push uploads closer to clients, reducing latency and egress costs — but orchestration and final consistency remain your responsibility.
Serverless and ephemeral function limits mean long-lived assembly should be offloaded to durable services or object storage.

High-level architecture

Design with these components:

Client SDK (browser, mobile) that slices data, computes checksums, and manages a resumable session.
Session manager API that issues session IDs, chunk manifests, and presigned targets.
Object storage with multipart upload support (S3, R2, GCS) or an edge-backed ingest endpoint.
Metadata store for checkpoints and GC (Redis, DynamoDB, Postgres).
Finalizer service that assembles parts, validates full-file integrity, and moves to long-term storage or CDN origin.

Defining the resumable session and checkpoint format

Store compact but sufficient metadata server-side so clients can resume quickly. A session record should include:

sessionId: UUID v4
userId, fileName, fileSize
chunkSize (e.g., 8 MiB recommended), and total parts = ceil(fileSize / chunkSize)
parts: array or bitmap of uploaded part ids and their checksums or object ETags
lastAlive timestamp and ttl for GC
finalHash: optional expected SHA-256 provided by client for end-to-end verification

Persist this in a low-latency store like Redis or DynamoDB for quick reads and updates. Use JSON documents and an expiry TTL to allow cleanup of abandoned sessions.

Choosing chunk size and concurrency

Chunk size impacts resilience, overhead, and memory usage:

Small chunks (1-4 MiB): easier to resend, higher overhead for large files.
Medium chunks (8-16 MiB): good balance for multi-hour media — fewer parts, moderate resume cost.
Large chunks (32+ MiB): fewer requests, but higher cost to reupload when a single chunk fails.

For multi-hour recordings, start with 8 MiB and tune based on client bandwidth distribution. Limit parallelism to 4-12 concurrent uploads per device to avoid saturating mobile networks and incurring connection errors. Use exponential backoff with jitter for retries.

Chunk validation strategy: per-chunk hashing and Merkle-like finalization

Integrity is critical. Use a two-step verification:

Per-chunk checksum: compute SHA-256 (or BLAKE3 for speed) on the client for each chunk and send it alongside the upload request. Store it in the session record. This enables the server or object store to validate the upload immediately.
Final file hash: client computes a rolling file-level SHA-256 or a Merkle root over chunk hashes. The finalizer verifies the assembled object matches the expected final hash before marking the upload complete.

Merkle-style manifests are useful when you want partial deduplication or integrity proofs, especially across distributed assembly pipelines.

Direct-to-storage with presigned parts: scalable and cost-effective

To avoid proxying large uploads through your application servers, use presigned URLs or presigned multipart parts. Workflow:

Client requests a resume session or a new session from the session manager.
Server returns a sessionId and a list of presigned upload URLs for specific part indices. Include expected chunk checksums in the response.
Client uploads each chunk directly to object storage using the presigned URL and returns the resulting ETag or success code back to session manager.

This reduces server egress costs and scales with object storage speed. In 2026, many CDNs provide edge-signed upload endpoints to accept multipart parts directly at the edge.

Server example: Node/Express session manager (presigned parts for S3)

const express = require('express')
const { S3Client, CreateMultipartUploadCommand, CreatePresignedPost } = require('@aws-sdk/client-s3')
const { v4: uuidv4 } = require('uuid')
// pseudo-code: sessionStore is Redis or DynamoDB

app.post('/upload/session', async (req, res) => {
  const { fileName, fileSize, chunkSize=8*1024*1024 } = req.body
  const sessionId = uuidv4()
  const parts = Math.ceil(fileSize / chunkSize)
  // Create multipart upload in S3
  const create = await s3.send(new CreateMultipartUploadCommand({Bucket, Key: sessionId+'/'+fileName}))
  // store session: sessionId, uploadId, parts, chunkSize
  sessionStore.save(sessionId, {uploadId: create.UploadId, parts: {}, chunkSize, fileSize})
  res.json({sessionId, chunkSize, parts})
})

This returns session metadata. Next, issue presigned URLs for each part and update session parts as clients upload.

Client pattern: slice, checksum, upload, checkpoint

Client responsibilities:

Slice file into parts and compute per-part checksum before uploading.
Upload parts in parallel to presigned URLs and on success write a checkpoint to the session manager with part index, checksum, and server ETag.
On resume, fetch session state and upload only missing or invalidated parts.

async function uploadFile(file, session) {
  const chunkSize = session.chunkSize
  const totalParts = Math.ceil(file.size / chunkSize)
  const concurrency = 6
  const queue = createPartQueue(totalParts)

  while (queue.hasNext()) {
    const promises = []
    for (let i=0; i<concurrency && queue.hasNext(); i++) {
      const partIndex = queue.next()
      promises.push(uploadPart(file, partIndex, chunkSize, session))
    }
    await Promise.allSettled(promises)
  }
}

async function uploadPart(file, index, size, session) {
  const start = index*size
  const blob = file.slice(start, start+size)
  const checksum = await sha256(blob)
  const presigned = await getPresignedUrl(session.sessionId, index, checksum)
  const res = await fetch(presigned.url, {method: 'PUT', body: blob})
  if (res.ok) await checkpointPart(session.sessionId, index, checksum, res.headers.get('etag'))
  else throw new Error('Upload failed')
}

Parallel chunk validation and optimistic concurrency

When uploading multiple parts concurrently, race conditions can occur if the session metadata is updated by multiple clients or device restarts. Mitigate risks with the following:

Use atomic updates in the session store. For Redis use HSET with version checks or Lua scripts; for DynamoDB use conditional writes on a version attribute.
Store both the client checksum and the storage ETag. On resume, compare them and re-upload if mismatch.
If parts are uploaded out of order, assembly is still possible; rely on the multipart API which accepts parts by index.

Recovery strategies for unreliable networks

Design for intermittent connectivity with these tactics:

Checkpoint after each successful part: store part metadata (index, checksum, storage ETag) immediately.
Local durable checkpointing: also mirror session metadata to local storage (IndexedDB, SQLite on mobile) to allow resume after app crash before server-side checkpoint is updated.
Client-side retry with backoff and jitter: use capped exponential backoff and randomized jitter to reduce thundering herd.
Partial playback and transcode streaming: for very large recordings, consider periodic partial finalization and transcoding of uploaded ranges so that partial content is usable while rest uploads.
Conflict resolution: on detecting different checksums for the same index, prefer the latest timestamped client upload or attempt re-upload after verification.

Finalization and integrity verification

Once all parts are uploaded and recorded in the session, run a finalizer process that:

Calls CompleteMultipartUpload on the storage provider or otherwise assembles parts.
Downloads a streamed checksum of the assembled object and verifies against the client-supplied final hash or computes the hash from chunk checksums.
Moves the object to long-term storage or marks it public via CDN, depending on policy.

Fail-safe: if final verification fails, mark session as failed and keep parts for a grace period so the client can re-upload problematic parts without starting over.

Security, compliance and cost controls

TLS always: TLS for all endpoints and ensure presigned URLs are short-lived and single-use if possible.
Per-part authorization: tie presigned URLs to session identity and rotate credentials frequently.
Encryption: use server-side encryption in storage and consider client-side encryption for HIPAA/GDPR-sensitive media.
Audit logs: record session events, part uploads and finalization for compliance and debugging.
Cost control: use CDN edge ingest to reduce egress, reuse existing multipart parts to dedupe, and schedule lifecycle rules to remove abandoned partial uploads.

Monitoring, retries and observability

Instrument these metrics and alerts:

Session creation rate, completion rate, and abandonment rate
Per-part upload latency and error rates
Finalization success/failure and checksum mismatch counts
Storage costs per GB and egress by region

Provide detailed client-side error codes in the session API so uploads can surface actionable messages instead of generic failures.

SDK design patterns and API contract

An SDK simplifies adoption and ensures consistent resume logic across platforms. Recommended methods:

initSession(fileMeta) => sessionId, chunkSize, partsCount
getSessionStatus(sessionId) => uploadedParts, checksums
uploadPart(sessionId, index, blob, progressCb)
checkpointPart(sessionId, index, checksum, etag)
finalize(sessionId, finalHash)

Expose events for progress, pause, resume, and error so app UIs can show clear states. Include automatic local persistence of session state for app crashes.

Edge cases and hard lessons from production

Clients that switch networks mid-upload can change IPs and NAT traversal which can invalidate some CDN edge presigned contexts. Use short-lived presigned URLs issued per IP where necessary, or reissue on network change.
Mobile OS backgrounding can suspend uploads. Use platform-specific background upload APIs that hand off to the OS for long-running transfers.
Too much parallelism on low-memory devices can OOM when buffering multiple chunks. Keep memory bounded by streaming and avoiding full-buffer accumulation.
Long-lived presigned multipart uploads not completed incur storage costs. Implement aggressive GC and notify users before expiring sessions.

Example: recovery flow summary

App starts and either creates or fetches sessionId.
App loads local checkpoint and fetches authoritative session state from server.
App computes missing parts list by comparing local and server checkpoints.
App resumes parallel uploads on missing parts, updating checkpoints after each success.
After all parts succeed, app calls finalize. Server completes multipart upload and verifies final hash.

Advanced: dedupe, delta uploads and chunk-addressing

If your service accepts many similar recordings or guests upload the same media, consider content-addressed chunks where chunk hashes are used to deduplicate storage. This reduces egress and storage costs but requires a chunk-index service and careful TTL handling for partial uploads.

Wrap-up: trade-offs and tuning checklist

Key trade-offs:

Chunk size vs retry cost
Concurrency vs client resource usage
Latency vs server-side validation depth

Tuning checklist before launch:

Validate uploads end-to-end with large QA files across flaky networks.
Run load tests to validate session store and object storage scaling.
Configure GC for stale sessions and abort incomplete multipart uploads periodically.
Provide clear client UI with resume guidance and background upload notifications.

Actionable takeaways

Use presigned multipart uploads and server-side session checkpointing to scale and reduce egress.
Compute and persist per-chunk checksums, and verify a final file hash before accepting the upload as complete.
Limit concurrency to a safe level, checkpoint after each part, and support local durable checkpoints to recover from crashes.
Monitor session health and costs, and leverage edge ingest where low latency and lower egress are critical.

Call to action

Ready to implement reliable resumable uploads for multi-hour media? Download our reference SDK and example server implementations to get started, or contact our engineering team for a design review tailored to your platform and compliance requirements.

Implementing Resumable Uploads for Multi-hour Recordings with Unreliable Networks

Hook: Your multi-hour recording must survive flaky networks — here’s how to build that

Overview: Goals and constraints

2026 context and why this matters now

High-level architecture

Defining the resumable session and checkpoint format

Choosing chunk size and concurrency

Chunk validation strategy: per-chunk hashing and Merkle-like finalization

Direct-to-storage with presigned parts: scalable and cost-effective

Server example: Node/Express session manager (presigned parts for S3)

Client pattern: slice, checksum, upload, checkpoint

Parallel chunk validation and optimistic concurrency

Recovery strategies for unreliable networks

Finalization and integrity verification

Security, compliance and cost controls

Monitoring, retries and observability

SDK design patterns and API contract

Edge cases and hard lessons from production

Example: recovery flow summary

Advanced: dedupe, delta uploads and chunk-addressing

Wrap-up: trade-offs and tuning checklist

Actionable takeaways

Call to action

Related Topics

uploadfile

Up Next

EXIF, Metadata, and Privacy: What to Strip From Uploaded Files

How to Build a Multi-File Upload Flow With Ordering, Removal, and Retry

Cross-Browser File Input Quirks Developers Should Test

Hook: Your multi-hour recording must survive flaky networks — here’s how to build that

Overview: Goals and constraints

2026 context and why this matters now

High-level architecture

Defining the resumable session and checkpoint format

Choosing chunk size and concurrency

Chunk validation strategy: per-chunk hashing and Merkle-like finalization

Direct-to-storage with presigned parts: scalable and cost-effective

Server example: Node/Express session manager (presigned parts for S3)

Client pattern: slice, checksum, upload, checkpoint

Parallel chunk validation and optimistic concurrency

Recovery strategies for unreliable networks

Finalization and integrity verification

Security, compliance and cost controls

Monitoring, retries and observability

SDK design patterns and API contract

Edge cases and hard lessons from production

Example: recovery flow summary

Advanced: dedupe, delta uploads and chunk-addressing

Wrap-up: trade-offs and tuning checklist

Actionable takeaways

Call to action

Related Reading

Related Topics

uploadfile

Up Next

EXIF, Metadata, and Privacy: What to Strip From Uploaded Files

How to Build a Multi-File Upload Flow With Ordering, Removal, and Retry

Cross-Browser File Input Quirks Developers Should Test