How to Build a File Upload System That Handles Large Files and Resumes After Network Failures

A smartphone in hand showing a file upload progress screen

Almost every web application that lets users upload anything will eventually need to handle files larger than the original demo accounted for. A document tool that started with 1MB PDFs ends up needing to accept 200MB scanned archives. A video platform that demoed with 30-second clips finds that the real users are uploading 4GB raw camera files. A team collaboration tool discovers that the most common upload pattern is dragging a 600MB Sketch file from an office over a hotel WiFi connection that drops every 90 seconds.

The straightforward upload code from the framework tutorial does not survive any of these. A single fetch with the entire file as the request body times out at the proxy layer, runs out of memory in the browser, or fails completely the first time the WiFi blips. The user gets a generic error message after watching a progress bar fill for 20 minutes and tries again, then again, and eventually gives up.

Building a file upload system that actually holds up requires accepting four constraints up front: files can be large, networks can be flaky, browsers have memory limits, and users will not patiently restart from zero. Once those constraints are designed for instead of patched around, the pattern that emerges is consistent across most stacks. This article walks through how to think about the system, what each layer is responsible for, and where the surprising failure modes are buried.

smartphone displaying a file upload progress screen
Photo by Dylann Hendricks on Pexels

The Four Constraints That Drive the Whole Design

The simplest file upload sends the whole file in one HTTP request. The browser reads the file into memory, the request body contains the file bytes, the server receives the request and writes the file to storage, and the response confirms success. This works for files under 50MB or so over a stable connection. It fails for everything else, and the failures are not graceful.

The first constraint is file size. Browsers can typically read files up to a few hundred MB into memory before they slow down significantly. Past 1GB, the browser tab can become unresponsive or crash on lower-end devices. The server side has its own limits: most proxies and gateways have request body size caps in the 100MB to 4GB range, and many of those caps cannot be raised without infrastructure work.

The second constraint is network reliability. A 4GB upload at 10 Mbps takes around 53 minutes. The probability of a network blip in any given 53 minute window on a real-world residential or mobile connection is somewhere between 30 and 80 percent. A system that requires the upload to complete in one continuous request will fail for the majority of large uploads even on functional networks.

The third constraint is request timeout. Most proxy layers (load balancers, reverse proxies, CDNs) have request timeout caps in the 60 to 300 second range. A single request streaming a 4GB upload at 10 Mbps will exceed those limits. Raising the timeouts is possible but introduces other problems, including connection pool exhaustion and zombie requests that consume memory long after the user gave up.

The fourth constraint is the user experience. A user who has uploaded 80 percent of a 4GB file and lost their connection wants to resume from 80 percent, not restart. Restarting from zero is the failure mode that produces the worst user reviews and the most support tickets.

Chunking: The Pattern That Solves Most of It

The standard answer to all four constraints is chunked upload. Split the file into small pieces (typically 5MB to 10MB each) on the client side, upload each chunk in a separate HTTP request, and reassemble the file on the server once all chunks have arrived.

The chunked pattern gives you three things at once. Each request is small enough to fit comfortably within proxy timeouts and request body limits. A failed chunk can be retried independently without restarting the whole upload. And the upload progress can be reported at chunk granularity, so the UI can show meaningful resume points.

The implementation has three layers worth thinking about separately.

The client layer reads the file in fixed-size chunks (the File.slice() API is the standard tool here) and uploads each one as a separate request. The order of chunks does not strictly need to be sequential, but most implementations send them in order so that resumption is straightforward.

The transport layer handles retries. If a chunk fails, it gets retried with exponential backoff. After some number of failures, the upload pauses and surfaces a "connection lost, will resume when network is back" state to the user rather than failing immediately.

The server layer accepts chunks, stores them in a temporary location keyed by an upload ID, and assembles the final file once all chunks have been received. This is where the protocol design lives: how chunks are identified, how completion is signaled, and how partial uploads are cleaned up.

The Tus protocol is an open standard for resumable file upload that handles the protocol details. Several implementations exist for major languages and frameworks. Using Tus saves you from re-implementing the wire format yourself and gives you compatibility with existing client libraries.

When we ship resumable upload features, the first system test we run is plugging the upload mid-stream into a tool that randomly drops 30 percent of requests. If the upload does not recover cleanly, we have not built it yet, no matter what the happy path looks like. - Dennis Traina, founder of 137Foundry

The Upload ID and the Server-Side State

The thing that makes resumption work is the upload ID. When a client starts an upload, it asks the server to create an upload session and gets back an ID. Every subsequent chunk references that ID. If the upload is interrupted, the client can later ask the server "where did we leave off for upload X?" and resume from the next missing chunk.

This means the server needs to track upload state. For each open upload session, the server tracks the upload ID, the expected total size, the chunk size, the chunks that have been received, and a TTL after which abandoned uploads get cleaned up. This state can live in a database (Postgres, MySQL) or a key-value store (Redis), depending on volume and access patterns.

The cleanup TTL matters more than it sounds. Without it, abandoned uploads accumulate in temporary storage forever and eventually fill up disk. A reasonable default is 7 days. After 7 days with no activity, the partial chunks are deleted and the upload ID is invalidated. The user trying to resume an old abandoned upload gets a clean error rather than a corrupted retry.

server room with networking equipment
Photo by Brett Sayles on Pexels

The Storage Layer

Where do the chunks actually go while the upload is in progress? Two patterns dominate.

The first pattern is direct-to-storage uploads. The client uploads each chunk directly to an object storage service like S3 using a pre-signed URL that the application server generates. S3 has native support for multipart uploads, which is essentially a chunked upload protocol with its own wire format. The application server never touches the file bytes, which makes scaling significantly easier and offloads the bandwidth cost from your application servers.

The second pattern is application-server uploads. The client uploads chunks to the application server, which stores them in temporary local or shared storage, then assembles the final file and moves it to long-term storage when the upload completes. This pattern gives you more control over the upload (you can scan chunks for malware, validate content types, apply rate limits per user) but consumes more bandwidth and compute on the application servers.

Most production systems use the direct-to-storage pattern for the chunk uploads and add metadata tracking on the application server. The application server handles upload session creation, authorization, and final assembly notification, but the bytes flow client-to-storage without passing through application servers.

S3-compatible storage exists from most major cloud providers, and several open-source object stores (MinIO, Garage) implement the same protocol. Standardizing on the S3 multipart upload protocol means your client code works against essentially any modern object store.

Authorization and Pre-Signed URLs

Letting the client upload directly to storage means you cannot use the storage credentials directly in the client. Anyone could then upload arbitrary content to your bucket. The standard solution is pre-signed URLs: the application server generates a time-limited URL that grants permission to upload to a specific path, and the client uses that URL for the upload.

The pre-signed URL pattern lets you keep storage credentials server-side while still allowing direct client-to-storage uploads. Every chunk request uses a fresh pre-signed URL (or, more commonly, an initial multipart upload session that grants permission to upload chunks within that session for a fixed window).

For the multi-tenant SaaS case, pre-signed URLs are also where per-user authorization lives. The application server checks that the requesting user is allowed to upload to the destination path before generating the URL. Storage permissions stay simple at the bucket level; user-level permissions live in the application.

The Progress Reporting Surface

Reporting progress accurately is a UX problem that gets overlooked until users complain. The naive approach reports progress based on how many chunks have been uploaded, which produces a jittery progress bar that jumps from 60 percent to 75 percent in one step.

The smoother approach reports progress within each chunk using the XHR or fetch API's progress events. The client tracks total bytes uploaded as the running sum across all chunks plus the current bytes in flight on the current chunk. The result is a continuously updating progress percentage that matches what users expect from desktop file upload UIs.

For uploads that take more than a couple of minutes, the progress display benefits from an estimated time remaining calculation. The simple version divides remaining bytes by current bytes-per-second. The robust version uses an exponentially weighted moving average to smooth out short-term bandwidth fluctuations. The Wikipedia article on exponential smoothing covers the math, which is a few lines of code.

Resuming After Interruption

The resume flow is the part of the system that justifies all the complexity. When the network drops mid-upload, the client should pause, surface a "connection lost" state to the user, and resume as soon as the network returns.

Detecting connection loss can be done several ways. The fetch failures themselves are the most reliable signal: when chunks start failing repeatedly, treat that as a connection loss. The browser's navigator.onLine event provides a second signal that the network has gone away. Some implementations use a heartbeat request to a small server endpoint as an active check.

When connection returns, the client queries the server for the upload session's current state ("which chunks do you have?") and resumes from the next missing chunk. The Tus protocol calls this the HEAD request: the client sends HEAD to the upload URL and the server responds with the current offset. The client then resumes upload from that offset.

The user-facing experience is that the progress bar pauses with a small "reconnecting" indicator, the connection returns, and the upload continues from where it left off. This is what users expect from desktop file managers and what every modern web upload should do.

The Validation and Scanning Layer

Files uploaded to a web application often need validation before they can be considered safe to expose to other users. Content type validation, size validation, virus scanning, and content-specific validation (image dimensions, video codec, document structure) all happen after the upload completes.

Running these checks on a complete file is straightforward. Running them on partial chunks is not. The simplest pattern is to assemble the full file after all chunks arrive, run validation on the assembled file, and only then move it to its final location. If validation fails, the partial file is deleted and the user is notified.

This adds latency between "upload finished" and "file is available." For large files, validation can take several seconds to several minutes (a 4GB video file passing through ClamAV takes a few minutes). The UX needs to account for this: an upload-complete state followed by a processing state that resolves to either ready or rejected.

For sites that handle untrusted user content at scale, ClamAV is the standard open-source virus scanner and integrates cleanly into post-upload workflows. Commercial scanners exist with broader detection coverage for production deployments.

file storage server with status indicators
Photo by Jordan Harrison on Unsplash

Where 137Foundry Helps

Building a production-grade file upload system involves more decisions than it looks like from the outside: chunking strategy, storage architecture, retry and resume logic, progress reporting, validation pipelines, and the UX states for every failure mode. Most teams get the happy path working in a sprint and discover the hard edges over the following 6 months in production.

The 137Foundry web development services team has shipped upload systems for SaaS platforms, document tools, video applications, and creative tools across a range of file size and throughput profiles. The 137Foundry services hub covers the broader application engineering work that surrounds upload pipelines: storage strategy, processing pipelines, and the operational tooling for tracking upload health in production.

For background on the broader engineering approach, the 137Foundry homepage describes the work we do across data integration, web development, and AI-assisted automation.

A Short Checklist Before Shipping

Before declaring a file upload feature ready for production, the system needs to pass a few specific tests beyond the happy path.

The slow connection test: throttle the connection to 1 Mbps in browser dev tools and upload a 1GB file. The upload should complete cleanly, the progress bar should remain accurate, and the time-remaining estimate should converge to something reasonable within the first 30 seconds.

The connection drop test: start a large upload, disable the network mid-upload, wait a minute, re-enable the network. The upload should pause, the UI should show a clear reconnecting state, and the upload should resume from where it stopped, not from zero.

The browser tab close test: start a large upload, close the browser tab, reopen the application within a few hours. The application should offer to resume the in-progress upload if you return to the same context, or at minimum should not silently lose the partial data.

The corrupt chunk test: simulate a chunk being corrupted in transit (use a proxy that mangles one chunk's bytes). The server should detect the corruption via checksums or hash comparison and request a retry of that specific chunk rather than accepting the bad data.

Passing those four tests covers most of the production failure modes for upload systems. The system that does not pass them will produce support tickets within the first month of real-world use.

Closing on the Architecture

The file upload system that most web applications need looks like this in summary: a client-side chunker that reads the file in 5MB to 10MB pieces, a chunked upload protocol that lets each chunk retry independently and supports resume, direct-to-storage uploads via pre-signed URLs for the chunk data, an application-server-managed upload session that tracks chunk state and authorizes the operation, post-upload validation and scanning before the file becomes accessible, and a UI that handles the connection states honestly rather than failing on the first network blip.

That is more components than the framework tutorial showed, but the resulting system is the one users do not complain about. The infrastructure to support it exists in commodity form (S3, Tus, common HTTP clients) and the patterns are well-documented. The work is the engineering judgment about which components fit your throughput, file size, and validation needs, not the wire-format implementation.

Need help with Web Development?

137Foundry builds custom software, AI integrations, and automation systems for businesses that need real solutions.

Book a Free Consultation View Services