We have deployed and operated over one million VDI and remote desktop sessions across K-12, higher ed, healthcare, government, and general business customers. In that time we have tuned protocols, diagnosed pixel jitter over satellite links, hunted down input latency in operating rooms, and explained to more than one executive why their Citrix session "feels slow" when the ping is fine.

Most explanations of how remote desktop works stop at "it sends screen updates over the network." That is true the way "cars move when you press the pedal" is true. Here is what actually happens between your click and the pixels on your screen, in five steps, with the details that matter when you are troubleshooting real deployments.

Step 1: Input Capture and Serialization

When you click the mouse or press a key on your local device, the remote desktop client intercepts the input before the local OS processes it for anything other than the client window itself. The client captures the event — key code, modifier state, mouse coordinates, button state, timestamp — and serializes it into the remote desktop protocol's input message format.

The protocol matters here. RDP (Remote Desktop Protocol), ICA (Citrix), PCoIP (VMware Horizon), Blast Extreme (also Horizon), HDX, and the newer NICE DCV and Parsec all have different wire formats. Most modern protocols send input as small messages — typically 20 to 60 bytes per event — over either the main protocol channel or a dedicated input channel. Input is small. Input is almost never the bottleneck. If your session feels laggy on input, it is almost certainly the return trip that is slow, not the input trip.

The client also does one thing that matters for remote work: it applies keyboard layout translation. If your local keyboard is US and the remote session is French AZERTY, this is where the translation happens. Getting this wrong is the source of about 30 percent of "remote desktop is broken" tickets from international customers.

Step 2: Transport Over a Secure Channel

The serialized input travels over a TLS-encrypted TCP connection, or in some modern protocols a UDP-based transport with its own retransmit logic. The choice between TCP and UDP is one of the most important architectural decisions in a remote desktop deployment and it is worth understanding.

TCP is reliable, ordered, and polite — it backs off when it sees packet loss. That is bad for interactive sessions, because a single lost packet on a lossy link can stall everything until retransmission. UDP-based transports (PCoIP, Blast Extreme UDP, RDP UDP, EDT in Citrix) are built to tolerate packet loss gracefully: they retransmit only what must be retransmitted, and they treat the session as a stream of frames rather than a stream of bytes.

On a clean LAN you will not notice the difference. On a WAN with 2 percent packet loss, TCP will make a session nearly unusable while UDP-based transports will keep it acceptable. This is why we default to UDP-capable protocols for anything that goes over the public internet or a WAN with unknown quality. Any remote desktop deployment that forces TCP over a lossy link is going to feel slow regardless of bandwidth.

The encryption layer adds latency too, but not much — a modern CPU can do TLS at line rate. The TLS handshake adds a second or two to session startup and then disappears from the hot path.

Step 3: The Remote Host Processes the Input

On the remote host — a VDI desktop, a terminal server, a Windows Server RDS host, or a single workstation being remoted into — the remote desktop server receives the input message, deserializes it, and injects it into the OS input queue. On Windows this is SendInput() or the equivalent kernel-level path. On Linux it is usually an X or Wayland event.

From the OS's perspective, the remote input is indistinguishable from a local keyboard or mouse. That is important: it means the OS processes the input with the same latency it would process a local input, and any application that runs on it does not need to know it is being driven remotely.

The host now does whatever the application asks it to do — move a window, start a process, paint a new frame. This happens at the speed the host can do it, which depends on CPU, GPU, memory, and storage just like a physical workstation. If the host is under-resourced, this is where the latency comes from. A lot of "remote desktop feels slow" complaints are actually "the VDI host is under-sized for the workload" complaints.

Step 4: Frame Capture and Encoding

After the OS and application have updated the display, the remote desktop server captures the new pixels and decides what to send back. This is where the engineering gets interesting, because sending every frame at full resolution is impossibly expensive — a 1920x1080 desktop at 30 FPS uncompressed is roughly 1.5 Gbps.

Every modern protocol does some combination of:

Region-based updates. Only the parts of the screen that changed are captured and encoded. If you are reading a document that is not scrolling, almost nothing is sent.
Codec-based encoding. H.264, H.265, or AV1 for video-like content, with GPU acceleration where available. JPEG or custom lossless codecs for static regions. Text-aware codecs for crisp font rendering.
Adaptive quality. The encoder measures the return-path bandwidth and adjusts quality dynamically. This is why your session quality drops when your Wi-Fi gets worse — the protocol is trading sharpness for frame rate.
Frame skipping. Under congestion, the protocol drops frames rather than falling behind. An interactive session that is a bit blurry is better than one that is stuttering.

The quality of this encoding step is most of what separates "feels great" from "feels terrible" in remote desktop. Citrix HDX and VMware Blast Extreme have spent 20 years tuning this. RDP has improved a lot in the last five years. Parsec took the game-streaming approach and reached sub-frame latency on low-latency links. The choice of protocol is essentially a choice of encoder tuning.

Step 5: Decoding and Display on the Client

The encoded frame arrives at the client, is decrypted, is decoded back to pixels (on GPU where available), and is drawn in the client window. The client then waits for the next frame or input event and the whole loop repeats.

The client decode path matters for thin clients and low-power devices. An old Chromebook trying to decode H.265 in software will struggle. A modern laptop with GPU decode will handle 4K H.264 at 60 FPS without breaking a sweat. When we deploy thin clients for customers we match the decode capability to the protocol — it is not enough to say "any device will work," because the decoder is the bottleneck on cheap hardware.

The Full Round Trip

Start to finish, the click-to-pixel round trip on a well-tuned session over a LAN is about 20 to 40 milliseconds. Over a WAN with 30 ms of network latency, add roughly 60 ms for the round trip plus encoder and decoder time — typically 90 to 120 ms total. Over a satellite link with 600 ms of latency, you are in "barely acceptable for typing, not acceptable for drawing" territory no matter what protocol you pick.

Humans start to perceive input lag above about 50 ms and find it annoying above about 100 ms. Knowing where your round trip lands on that scale is the difference between a remote desktop deployment that gets love and one that gets tickets.

When someone says "my remote desktop is slow," we walk through these five steps in order. Input capture is almost never the problem. Transport is often the problem. Host processing is often the problem. Encoding is occasionally the problem. Decoding is rarely the problem. Knowing where to look saves hours of guessing.

How Remote Desktop Actually Works: Five Steps from Click to Pixel

Step 1: Input Capture and Serialization

Step 2: Transport Over a Secure Channel

Step 3: The Remote Host Processes the Input

Step 4: Frame Capture and Encoding

Step 5: Decoding and Display on the Client

The Full Round Trip

Talk with us about your infrastructure

On-Premise Infrastructure

Private Cloud

Public Cloud

AI & Automation

How Remote Desktop Actually Works: Five Steps from Click to Pixel

Step 1: Input Capture and Serialization

Step 2: Transport Over a Secure Channel

Step 3: The Remote Host Processes the Input

Step 4: Frame Capture and Encoding

Step 5: Decoding and Display on the Client

The Full Round Trip

Talk with us about your infrastructure