Network Messages vs. Deltas

Ladna · July 2, 2014

I've been working on a netcode implementation using game saves and delta compression instead of network messages in a fork of PrBoom+ called Doom2K, and I've reached a point where I'm not only convinced it's viable, I believe that deltas are superior to network messages. I'm making this post in hopes that it will be useful to the greater Doom source port development community.

First, this is not at all a novel idea. To my knowledge, Doomsday also uses deltas, and the original idea came from Carmack in Quake 3. So, credit where credit is due.

Second, there are significant, but manageable costs. A singleplayer client running MAP01 uses < 24% CPU on my machine (Core 2 Duo P8600 - 2.4 GHz); a delta client uses < 30%. The increase will be larger for more complicated maps. Memory usage, depending upon your point of view, increases insignificantly. Servers have to keep between 1 and N save files in memory (where N = # of clients), so perhaps the worst case scenario is a 32 client game on a Sunder map: saves on MAP06 are roughly 165KB, which is around 5.3MB for a 32 client game. In the grand scheme of things, an extra 6MB serverside is insignificant. Bandwidth is comparable to (often less than, because the nature of deltas is to only send new information) traditional message-based solutions.

Third, there are downsides. The main downside is that the client always loads the latest state from the server, so it's possible, for example, for a cyberdemon's rocket to spawn far out in front of it instead of right next to it if your local client has high latency or packet loss. There are ways to mitigate this. In this specific example, projectile movement can be implemented using vectors + start time instead of current position + momentum, and then some kind of non-linear speed curve can be applied. Regardless, each case is special (player spawns, etc.) and likely requires a specific solution.

There are significant benefits as well. Desyncs are resolved (depending upon your latency and packet loss) nearly immediately; ghost monsters, moving player corpses, hanging rockets, etc. are gone for good. Local prediction is immediate and seamless. There is no lag awakening monsters or activating lifts, rockets and plasma spawn immediately, opponents die immediately. There is no local jitter or jerking, even on moving sectors. Full prediction means that even clients with high (> 100ms) pings will feel just as good as clients with LAN pings (even though other problems resulting from high latency, i.e. having to aim ahead of opponents, remain). In testing Doom2K, I routinely use 300ms latency with >= 10% packet loss.

I contend that the delta model is superior to the message model in many ways (perhaps except for resource usage), but primarly because deltas are more consistent. Using network messages, unless the server stamps every message with a world index (TIC) and provides a reliable way of communicating to the client that it is finished sending updates for a given world index, and the client buffers messages until it receives the "updates for world index N finished" message, the client will display a world with information from different indices. It's possible and even likely that the client displays an entirely inconsistent world, which results in things like jitter while on moving sectors, or desyncs. Furthermore, once a port implements the stamps, the "updates finished" message, and the clientside buffer, they've essentially implemented deltas, albeit poorly. I realized this while implementing EE's C/S netcode and trying to ensure a consistent world.

Modifying a port to support deltas is far less intrusive than modifying a port to support network messages. Any such modification to support asynchronous networking (i.e., clientside prediction) requires some serious modification of TryRunTics and P_PlayerThink, but aside from that, I essentially just pulled out the savegame building logic, made savegames portable, reworked file handling (PWADs, DeH/BEX patches) and disabled sounds when re-running TICs. Network messages tend to be scattered throughout the codebase, and client behavior differs from server behavior to the point where Odamex separated client and server in the name of clarity. Deltas do not require this. In addition, deltas only require a small amount of code. Doom2K's netcode is < 5k LOC, with the help of libraries (ENet, libXDiff, cmp) and some simple data structures.

The mental model of deltas is also much simpler than that of network messages. Network messages require the programmer to constantly consider sync, and clients have to only run parts of the game simulation at certain times. This, as ZDaemon, Zandronum and Odamex have discovered, is tiresome and error prone. In contrast, deltas allow the client to run the game simulation (almost) entirely as normal; the programmer does not have to worry about sync, timing, world indices, network IDs, etc., and everything remains synchronized and consistent. This allows developers to focus on adding features instead of continually squashing yet another complicated desync bug. Odamex has > 100 network messages; Doom2K plans 9 (perhaps 10) and only currently uses 3. While this comparison is unfair -- Odamex features lots of functionality that Doom2K doesn't, and Doom2K uses libraries to implement some of Odamex' functionality -- it should be noted that Doom2K achieves full sync right now with only 3 messages. Odamex would require dozens of messages to achieve the same thing.

Because the mental model is so much easier, questions such as, "how do you implement ACS in a C/S port" are answered simply: serialize the game state just like you would in a save game, and send the delta to your clients.

You can find the code at https://github.com/camgunz/d2k, in the netcode branch. Doom2K is still highly experimental and not at all ready for general consumption; it is useful now only as a guide for source port developers. Diffs against PrBoom+ will be minimally useful at best, as I've performed tons of reformatting to improve the readability of the code. The juicy stuff is in the n_*, m_delta, p_user, p_tick and p_saveg modules.

Dr. Sean · July 3, 2014

Sending changes in an entity's state is indeed a far better scheme than periodically sending an entity's full state. It allows for superior decoupling of the game simulation from the network serialization as entity states are simply serialized at the end of the game loop.

I employ a similar scheme in a new branch I am developing based on the TRIBES engine architecture. In my approach, the game play simulation sets a flag whenever it changes the state of an entity. For example, if P_ZMovement() changes an entity's Z position, P_ZMovement calls a function to set a flag that the Z position was changed this during gametic. Under the hood, each entity has a circular buffer of bitfields, with a different bitfield for each of the last N gametics. And each bit of the bitfield represents whether a specific field of an entity's state was modified in that gametic (eg, Z position). Thus when sending a client the delta state of an entity gametics 50 and 55, that entity's bitfields for gametics 50 through 55 are bitwise AND'D together to get all list of all fields that were modified and need to be serialized.

This paper describes the architecture of the TRIBES engine and subsequent TNL library.
http://www.pingz.com/wordpress/wp-content/uploads/2009/11/tribes_networking_model.pdf
Video and slides for similar information: http://coderhump.com/austin08/

This paper describes Doom 3's architecture, which is an advancement over Quake 3 Arena.
http://mrelusive.com/publications/papers/The-DOOM-III-Network-Architecture.pdf

This video describes Halo's architecture, which is based heavily on the TRIBES architecture.
http://www.gdcvault.com/play/1014345/I-Shot-You-First-Networking

andrewj · July 3, 2014

Dr. Sean said:
Thus when sending a client the delta state of an entity gametics 50 and 55, that entity's bitfields for gametics 50 through 55 are bitwise AND'D together to get all list of all fields that were modified and need to be serialized.

You mean OR'D together, surely?

Dr. Sean · July 3, 2014

andrewj said:
You mean OR'D together, surely?

Surely I do!

Ladna · July 3, 2014

Yeah sorry, I didn't mean to bang on Odamex, it's just the c/s port I know the most about :)

The Tribes architecture strikes me as a way to have very fine-tuned control over bandwidth and priority, but I think it's ultimately a network message system that's just more powerful, and thus more complex. There's no way I would implement all that myself, and I assume you're gonna build on TNL (so old though... does it even work?) or TNL2 (kind of old...?).

I'm not sure how you would implement any kind of scripting with TNL. I guess there's some RPC functionality, but the latency on that would worry me.

I'm also not sure how TNL maintains a consistent world. I might be overvaluing this, but I think it's critical for unlagged.

Finally, the Tribes netcode architecture pretty much requires C++. For a lot of Doom ports that's no problem, but for projects looking to extend Chocolate Doom or PrBoom, that's a big change, and kind of a non-starter for non-C/C++ ports (like Mocha Doom, for example).

===

There are a couple major differences between Doom2K's delta implementation and Doom III's. First, Doom2K doesn't use a PVS; the goal is to present a full, consistent world, not just the part you can see. I'm not opposed to the idea if bandwidth or resource usage become problematic, but fast stream compression is probably the first solution I'd try though.

Second, Doom2K's delta compression uses VCDIFF, which is essentially a handful of commands that tell a receiver how to mix bytes around to make the new file. This means that compression is usually far better than 32:1.

Third, I think bit packing is a waste of effort. Just use fast stream compression. It will (probably?) do a better job anyway.

This is kind of outside the discussion, but ENet has support for multiple channels. Doom2K only specifies 2, one for reliable packets and one for unreliable packets, and it currently only uses the unreliable one. Eventually the reliable channel will be used for things like auth, server/client messages, votes, RCON, etc. ENet doesn't make you do this; you can send unreliable and reliable packets in the same channel, I just do it this way for conceptual clarity. Reliability is nice because, unlike the state deltas, I don't have to keep track of receipt myself. I can just have ENet resend things, and for non-latency critical messages (i.e., stuff I don't send as fast and as often as possible until the other side confirms receipt manually), that's fine.

Which is all to say that I guess I was a little misleading; Doom2K uses messages (of course), it just uses a big delta message to synchronize state between all the clients & the server. I don't think there's any debate about the best way to implement voting or RCON or whatever though, haha. No one is like, "Ugh the lag on this RCON is sooooo bad". I'm only arguing that state deltas are the best way to synchronize game state.

Dr. Sean · July 3, 2014

Ladna said:
...I assume you're gonna build on TNL (so old though... does it even work?) or TNL2 (kind of old...?).

I'm writing something from scratch but using the TRIBES white-paper as a basis for the message reliability system. There are some things I want to do that just looked too difficult to shoehorn into the TNL libraries.

I am looking to have the network message format described in a data-definition language in such a way that the client and server can compose message writing and parsing objects from the format definition. When a client connects to a server, the server sends its message format definition to the client and the client then builds objects to parse each type of message. This would allow the client to connect to a server that uses a different message protocol version. More importantly, it would allow clients to play network demos recorded with older versions automatically.

Ladna said:
I'm also not sure how TNL maintains a consistent world. I might be overvaluing this, but I think it's critical for unlagged.

TNL achieves world consistency by always sending the changes in entity state since the last acknowledged state - the same as Q3A.

Ladna said:
There are a couple major differences between Doom2K's delta implementation and Doom III's. First, Doom2K doesn't use a PVS; the goal is to present a full, consistent world, not just the part you can see. I'm not opposed to the idea if bandwidth or resource usage become problematic, but fast stream compression is probably the first solution I'd try though.

PVS does become an issue with cooperative play, where you have 300 monsters and an assload of projectiles flying everywhere. I plan on using PVS even for DM or CTF in the interest of reducing network latency by transmitting smaller packets.

What I've read while researching the topic is that routers do not subject small, fixed-size packets transmitted at regular intervals to as much buffering (read: additional latency) as irregularly sized packets when the router comes under load. So my goal is to send 35 packets per second at 64kbps with a fixed packet-length of 228 bytes (and pad if it's smaller). When the connection between the client and server appears to become congested, the packet-length and the packet frequency is temporarily reduced to avoid those unseemly 2-3 second latency spikes that occur when the router's incoming packet buffer gets filled.

Ladna said:
Which is all to say that I guess I was a little misleading; Doom2K uses messages (of course), it just uses a big delta message to synchronize state between all the clients & the server. ... I'm only arguing that state deltas are the best way to synchronize game state.

Agreed. It looks like you're sending delta states for the entire game world to keep things conceptually simple and I'm sending them on a per-entity basis for finer-grained control. Both methods will get you there. The only important thing is to move away from manually sending network messages for entities like it's 1997.

andrewj · July 4, 2014

Dr. Sean said:
I am looking to have the network message format described in a data-definition language in such a way that the client and server can compose message writing and parsing objects from the format definition. When a client connects to a server, the server sends its message format definition to the client and the client then builds objects to parse each type of message.

That will no doubt be an interesting project for you to work on.

But look at the bigger picture for a minute, is it really worth doing a system like that? Do users really care about network demos that much?

You can get a bit of future proofing by adding some reserved fields to the entity structure (etc) with support in the protocol for sending them. Since they are never used (now), they will not cause any significant increase in network traffic. But it is about 50x easier to implement than the system you propose. And if you do need to break the protocol at some point, supporting one or two protocol variants in the code is not overly burdensome.

My experience is with EDGE, where I implemented a quite sophisticated save-game system, where each structure (all the fields) was described in the save file. In principle it allowed old save games to be loaded by newer engines (with modified / extended structures like mobj_t). In practice I found a few things: (1) users didn't really care much about being able to load old save games, (2) loading old save-games often didn't work that well, since the missing fields often lacked important information, and (3) I ended up changing the compression codec used, and that meant breaking all compatibility anyway.

Ladna · July 4, 2014

That packet stuff is interesting; you're probably right but I'm not sure I care enough to try and do that haha.

andrewj said:
Do users really care about network demos that much?

Yeah, at least the competitive community does. I agree about save games, plus it's probably not too hard to write a little utility that upgrades old saves.

I do think there's a lot of value in DDL'ing saves and demos though. A lot of those tools provide methods to ensure compatibility, versioning, optional fields, default values, etc. Implementing this stuff by hand is tedious and error-prone, and I'd imagine most source port developers will just say "fuck it" and drop compatibility... OR have 7 slightly different functions called depending upon save version (ugh). It also makes it easier to support cross-platform saves & demos, something that's pretty important in competitive game play.

Demos in particular are supposed to be preserved for posterity, so the more we can maintain compatibility the better.

Of course, with deltas, DDL'ing the save is the same as DDL'ing the demo (more or less). Without deltas, things get more complicated.

Dr. Sean said:
TNL achieves world consistency by always sending the changes in entity state since the last acknowledged state - the same as Q3A.

Yeah, but can't the client still receive messages for some entities and update them, while missing messages for other entities, or does it wait until it gets a full TIC's worth of updates, and then how does it know?

Dr. Sean said:
The only important thing is to move away from manually sending network messages for entities like it's 1997.

LOL yes.

kb1 · July 7, 2014

I am massively interested in what you are describing, but I must admit that your description flew over my head.

Is the idea that a client connects to the server, downloads the latest save game, and then downloads and applies set of deltas from the server, and is therefore in sync? And, from that moment on, tranceives deltas of all mobj/plat property changes, frame by frame, instead of transceiving player movement actions?

If that's true, then, if a packet is lost, or there is lag, the monsters/plats will move using normal Doom logic, and then be "corrected" when the actual delta packet arrives, thereby allowing each client to run top speed, without waiting for packets.

Is that what you are describing? If so, it sounds quite interesting!

Now, how does this handle inconsistency? Each client runs it's own simulation, which can desync. How does this system correct that exactly?

(This is mind-boggling stuff :)

Ladna · July 7, 2014

Yeah you've got it exactly.

If the client "mis-predicts", i.e. runs the game simulation different than the server does, it is corrected whenever a new delta is received. This happens quite often actually, using either network messages or deltas, so any networked port has to have a robust method for correcting desyncs.

Oda/ST/ZD's method is to send a "full update" every so often, but that full update isn't even full (because it would be gigantic) so while it can solve some problems, it isn't a complete fix.

In D2K, there are two main indices: gametic and command index. Whenever a client receives a delta from the server, it receives the gametics that the delta moves between (i.e., "this delta is the difference between gametic M and gametic N") and the index of the latest command it has received from that client. This way, the client can generate a new state using the previous game state and the delta, load that new state, and run any commands not yet received by the server. D2K calls this "catching up", but it's just re-running the clientside prediction again. Afterwards, the client can generate a new command, run it, send it to the server, rinse and repeat.

===

If you're talking about maintaining consistency between clients, you have to synchronize their command generation and 'TIC running' (game simulation running). This is what Doom's original netcode does, and essentially why it's so inflexible (in-game joining -- the other huge shortcoming of the original netcode -- isn't at odds with synchronous netcode).

kb1 · July 8, 2014

I cannot claim to be very experienced, but it seems to me that the simulation could be given last-agreed-upon (LAU) player positions, and firing states, including the local player. The local player would see their screen move fluently, but the underlying simulation would only "see" the player in last-agreed-upon position. That would keep all of the monster simulation in sync. However, this would cause local lag in weapon firing/door opening. But, that may be acceptable to some degree (200ms to see your weapon get fired, or a door opened).

The player would "feel" like they were moving fluently, and the screen would render fluent movement, but the underlying simulation would see the player move in chunks. Player movement prediction would show a fluid player on other clients, but their simulation would use the agreed-upon position.

It's sort of a hybrid between the vanilla "peer-to-peer" type approach, and a pseudo-prediction, client fake-out. It makes no difference if a monster sees you at tic 1200, or at tic 1204. But, the player will notice if their movement is choppy.

I think it is vital for the simulations to stay in sync at all costs, especially for coop 100+ monster slugfests. You can fake the client view pretty well, but you can't run the simulation backwards, unless you store this massive history. I don;t think you have to.

In most each case, you can hide latency:
Player movement: Fake local player view, use LAU mobj position for local and remote simulation.

Weapons: Each non-hitscan weapon has a spin-up time - could hide some latency there.

Doors: Some lag in door opening can be reasonable (< 200 ms ok?)

Hitscan weapons are the biggest issue, but, again, some lag is acceptable. Force hitscans to be agreed upon, and the simulation stays in sync. Regrettable to be affected by lag here, but gun flash could hide a small amount of latency.

On a LAN, there's no reason for the simulation to ever go out of sync. Those issues should be corrected 100% before attempting to hide lag. And, I believe that, at reasonable latency, that can be hidden without having to correct clients.

But, yes, this is very hard to achieve, and your approach is slick, in that it can correct desyncs in a less-than-100% synchronized environment.

What could be interesting is to implement both approaches: First, strive to keep all clients in sync at all times. But, if a desync does occur, use your method to maintain recent saves + applied deltas, to get the client up and running quickly. And, the save+delta approach allows new clients to join midgame, with a minimum distraction.

Wow, maybe I do know something about it?? Hell, it's an interesting thing to try anyway. I'll have to write some code to fake latency on my LAN to try it out!

Hope that was helpful. I'd love to hear some comments on my approach, from coders more experienced than I am. Let us know how it turns out. You've already got some of this working, right? Is it completed, or in an alpha stage?

Maes · July 9, 2014

kb1 said:
If that's true, then, if a packet is lost, or there is lag, the monsters/plats will move using normal Doom logic, and then be "corrected" when the actual delta packet arrives, thereby allowing each client to run top speed, without waiting for packets.

Actually, due to the way Doom works, the only packets that are really needed are the players', because their movements are the only truly unpredictable factor, but everything else (monsters, plats etc.) is 100% predictable if both the client and server use the exact same engine or they are 100% equipotent, and if the players' actions are followed correctly by all clients.

That doesn't stop pure MP/DM games on e.g. ZDaemon to go out of sync, however.

Edward850 · July 9, 2014

Maes said:
That doesn't stop pure MP/DM games on e.g. ZDaemon to go out of sync, however.

This tends to be because the client is already imperfect by the time they connect (lack of proper RNG table info, missing mobj info, improper gametic timing) and doubled with the corners cut to send all the information about active mobjs, plats, coordinate fixed point value culling (only sending the top 16 bits), etc. The clients don't really have a chance to be in sync in the first place.

Ladna · July 9, 2014

kb1 said:
I cannot claim to be very experienced, but it seems to me that the simulation could be given last-agreed-upon (LAU) player positions, and firing states, including the local player. The local player would see their screen move fluently, but the underlying simulation would only "see" the player in last-agreed-upon position. That would keep all of the monster simulation in sync. However, this would cause local lag in weapon firing/door opening. But, that may be acceptable to some degree (200ms to see your weapon get fired, or a door opened).

The acceptability of that kind of lag depends upon your point of view. In coop, it might be OK for lifts to be lagged, but in a fast-paced 1v1, the disadvantage is significant enough to be completely unfair. Ditto for stuff like rockets/plasma/weapon pickups, etc.

FWIW, this is how Oda/ST/ZD work right now. The server sends "LAU" positions for everything once in a while, and then the client re-predicts its local position from there (Server: Your position at TIC 1240 was x/y/z; Client: OK, I am at TIC 1245 so set my position to x/y/z and run 5 more commands until I catch back up).

kb1 said:
The player would "feel" like they were moving fluently, and the screen would render fluent movement, but the underlying simulation would see the player move in chunks. Player movement prediction would show a fluid player on other clients, but their simulation would use the agreed-upon position.

This is also kind of how Oda/ST/ZD work as well. Clients send their commands to the server, but they don't necessarily arrive at the proper timing; i.e., the client may send commands 1241, 1242, 1243 and 1244 28.5ms apart, but the server may receive them all at TIC 1246. You have a choice at that point: buffer the commands and run 1 per TIC, run them all, or run them at some other rate (2 per TIC or whatever).

The decision you make doesn't make a huge difference to the sending client because of clientside prediction, but if you run 4 commands for a player, every other player is going to see that player move 4x as fast... unless you implement some kind of smoothing... which is by definition inaccurate.

kb1 said:
It makes no difference if a monster sees you at tic 1200, or at tic 1204.

Ohhhh but it does. For example, a lot of the simulation depends on the value of "leveltime", if I run TIC 1241 with a different leveltime value than the server, desyncs are highly likely, and without something to reconcile those desyncs (you can get lucky with state/position updates... but at some point a "full" update will be required) the client will be in an irrecoverable state. There are very few things that don't impact the game state, which is why I'm skeptical of PVS filtering. Like Dr. Sean said, it's most useful for coop, but I would have to think pretty hard about it... which I haven't done.

kb1 said:
I think it is vital for the simulations to stay in sync at all costs, especially for coop 100+ monster slugfests. You can fake the client view pretty well, but you can't run the simulation backwards, unless you store this massive history. I don;t think you have to.

You don't necessarily have to, so long as you perfectly exempt the game simulation during clientside prediction and perfectly update the simulation based on server messages. In practice, this is very hard to ensure.

For example, let's say the server sends the message that an Imp fired a fireball and a Sargeant moved into its path at TIC 1242.

First of all, it's not a given that clients & servers receive messages, because UDP doesn't guarantee delivery. I can make those messages "reliable" by waiting for the client/server to acknowledge receipt, and resending those messages if I don't get that... but how long do I wait? Your average ping (round-trip) in North America is something like 60-70ms, almost 3 TICs. So if the client doesn't ACK the fireball message by TIC 1245, I have to resend it, and then wait until TIC 1248, and so on. You can see how latency can pile up, especially for connections that are already bad.

Back to our example. Let's say for some reason I drop the "Sargeant moved" message and the fireball continues to hurtle towards me. Desync! The question now is how the netcode resolves the problem. There are usually messages to damage/kill/remove actors and set their targets, so (unless I drop those too), this desync probably won't be too bad: I'll get a message that the Sargeant was damaged, that its target changes to the Imp (infighting!!!) and that the fireball has been removed from the game. But if I miss even one of those, the game simulation will start to spiral out of control. This desync resolution strategy is clearly not ideal.

You may argue that dropping messages is rare and netcode shouldn't be expected to handle something as ridiculous as data just disappearing into the Internet. The reality is that packets drop even on wired LANs, and netcode has to deal with it, or suffer the consequences of desyncs. Furthermore, Doom has become international, and players are often connecting to each other overseas or through cellular and satellite connections. Will their experience be degraded? Probably. But as a netcode developer, you ignore them at the peril of your port's popularity.

kb1 said:
In most each case, you can hide latency:
Player movement: Fake local player view, use LAU mobj position for local and remote simulation.

Weapons: Each non-hitscan weapon has a spin-up time - could hide some latency there.

Doors: Some lag in door opening can be reasonable (< 200 ms ok?)

Hitscan weapons are the biggest issue, but, again, some lag is acceptable. Force hitscans to be agreed upon, and the simulation stays in sync. Regrettable to be affected by lag here, but gun flash could hide a small amount of latency.

Latency hiding is an interesting topic, and there is some stuff on the Internet about it. Doom can't really use it though, and I'll illustrate. When you fire the SSG, your local client (assuming it implements clientside prediction) starts going through the motions of firing: it moves your player sprite through the firing animation, makes the firing sound, maybe even spawns the bullet puffs or blood spots on its own (without waiting for the server). It then waits for the server for the reaction of anything hit by your pellets, damage calculation, line/monster activation, etc.

Latency hiding here would be the server sending everyone the message that you fired before you actually fire. Doom can't do this because the server also has to move your player through the firing frames. It can't just send everyone a message [Player 2 fired the SSG at TIC 1243] based on your command at TIC 1241, because you might move or die between now and then.

kb1 said:
On a LAN, there's no reason for the simulation to ever go out of sync. Those issues should be corrected 100% before attempting to hide lag. And, I believe that, at reasonable latency, that can be hidden without having to correct clients.

Yeah a LAN is an ideal situation, but things can still go wrong. Old machines can experience CPU lag on complex maps, etc. In fairness, there's not a lot you can do about that.

kb1 said:
What could be interesting is to implement both approaches: First, strive to keep all clients in sync at all times. But, if a desync does occur, use your method to maintain recent saves + applied deltas, to get the client up and running quickly. And, the save+delta approach allows new clients to join midgame, with a minimum distraction.

Well, clients have to know they've desync'd, which is pretty hard (Server: [Remove fireball]; Client: Shit, what fireball... OK now what...?). You also have to take latency into account. Assuming clients can detect a desync (which again, is very hard), they have to request a full update, and using deltas, they have to know the "last good sync point" so they can request an applicable delta, and then they have to wait for the server to send it, then acknowledge they received it, and before you know it, 10 TICs have elapsed and you're dead anyway.

Furthermore, deltas grow in size the longer you don't send one; i.e., the delta between TICs 1241 and 1242 might be 80 bytes, and the delta between TICs 1241 and 1243 might be 120 bytes. If you do the research (as crazy old Dr. Sean did), you find out that you can't just send shitloads of bytes whenever you want; even if your server has something like 100mbit/s upload, routers will freak out at you and start dropping your packets on purpose (yes, this is how the Internet is actually designed to work, believe it or not). So you have to do some complicated stuff to ensure your delta sizes stay under a certain size and then "flush the buffer" as it were when they approach that size limit. At that point, you may as well just send the delta every TIC anyway.

kb1 said:
Wow, maybe I do know something about it?? Hell, it's an interesting thing to try anyway. I'll have to write some code to fake latency on my LAN to try it out!

Yeah! Be careful though, netcode can kind of be a rabbit hole. We have yet to really get into things like unlagged, player movement smoothing, clientside prediction, collision smoothing, that kind of thing. You may find out it's way more work than you bargained for ;) I know I feel that way sometimes haha.

Dr. Sean linked some good stuff; you can also google around for Yahn Bernier, the guy in charge of netcode for Valve games.

I know Dr. Sean has a lag tool he wrote, but I use Zalewa's baller Gamer's Proxy. I thought I might write one in Go, but why?

kb1 said:
You've already got some of this working, right? Is it completed, or in an alpha stage?

I just merged D2K's netcode back into the master branch. Again, it's not at all ready for public consumption; it needs a lot more testing and I've undoubtedly broken some things, but I'll be around to brag once I get it presentable ;) I will say that it performs amazingly in testing, but I'm biased.

EE's C/S netcode is based on network messages. I tried to ameliorate the problems with that architecture in various ways, and I ended up implementing crappy deltas. At this point, the C/S branch is extremely outdated. My current plan is to get D2K's netcode production-ready and cleaned up, port it to the latest EE, and send Quasar a pull request on GitHub. There is some stuff in the C/S branch that is still useful, WAD downloading, master advertising, banlists, etc., but the network message architecture is just "doomed" to failure (heh).

Maes said:
Actually, due to the way Doom works, the only packets that are really needed are the players', because their movements are the only truly unpredictable factor, but everything else (monsters, plats etc.) is 100% predictable if both the client and server use the exact same engine or they are 100% equipotent, and if the players' actions are followed correctly by all clients.

If you synchronize every TIC then yes, this is the case. That's how the original netcode worked. If you allow players to run the game sim, or their local commands, independently (asynchronously), then you enter this rat's nest.

kb1 · July 9, 2014

Thanks for the very detailed response. I think I now understand what you're trying to accomplish (I'll reiterate further down this post).
First, some definitions:
simulation - The behind-the-scenes game state update of all mobjs, lifts, etc (TryRunTics)

presentation - The rendering of those mobjs, doors, lifts, etc, to the display, from the player's point-of-view.

Normally, these two use the same dataset: The screen is rendered by directly using the current mobj state, and the current player position and angle.

What I was trying to describe (which may not work well) is this:
1. Decouple the simulation data from the presentation data.
2. Run the simulation using the "normal" dataset, ensuring that each client runs the exact same simulation (by sending player positions).
3. "Fake" the presentation dataset in key areas:
- let the client use local controls to adjust "local dataset" player position and view angle
- begin weapon frames on the client immediately after pressing fire
- paint the screen using this "presentation dataset".

When the local player fires, make it look like it is occurring, and send it across the network, but [b]do not update the simulation until all clients (including local) actually receive the fire command.

With this approach, the simulation is always in sync, but the lag is hidden. The local player's view is updated at full speed, and there is immediate feedback when a weapon is fired. The local client starts showing firing frames, but the bullet is actually only fired after all clients agree that a weapon was fired.

I would guess that 90% of the lag "feeling" is when players turn or run - that's when the user has the feeling that lag is occurring. The other 10% is in weapon firing, and maybe plats.

This system is actually fair: Regardless of lag, every weapon firing must make a round-trip before being registered.

Unfortunately, the only way to know how a scheme will perform is to build it, and test it out. Things of this nature tend to be long, complicated coding sessions, and "how does it feel?" is subjective.

I do not know if what I suggest is feasible or not.

Now, the beauty of Ladna's solution is that, largely, he can concentrate his efforts on the delta system, which will work, even if the simulation has desync bugs, and even if packets are lost, and even if a client is cheating. Regardless, his system will put things back where they should be.

It's kind of like a guitar player practicing vs. playing a gig: When you are practicing, you try to hit every note, and, if you hit a sour note, you start over, and perfect your sound. But, when it's time to play a gig, even if you miss a note, you had just better keep playing, and get back in time with your band members. During the gig, it is acceptable to miss a note or two, just as long as you keep playing the song, and get back in time.

Likewise, with Doom, when writing the code, you must strive to detect and eliminate all possible desyncs, so, when a desync is detected, an error message is appropriate, and helps with debugging. But, when playing a multiplayer game, a desync/crash is completely unacceptable - you just want to get back to the fun.

Ladna's solution does just that - it gets the game going again. But it goes a step further, by elevating desyncs to not only a normal, expected phenomenon, but as a required function of a unsophisticated prediction scheme.

If I understand correctly, new client prediction, lag-hiding, or smoothing algorithms could be flat-out wrong, and the save+delta system will allow the game to continue, by correcting the simulation dataset, possibly every frame (which I imagine could cause performance problems).

Another benefit is that, during development, you could toggle new prediction/smoothing schemes on and off, seeing their effects, without having to reload the exe, reload the wads, idclev to the level, clip to a certain troublesome area of a map, etc. Hell, the save+delta system could possibly allow a client with bad RAM to limp along in a game!

Good stuff, but, man, it's got to be a dog, doesn't it? That's a lot of data to save/load. Even with disk caching.

I hope I can get some time to try a few things on my own. Please report your progress occasionally, and I'll do the same. I'll be watching this thread for sure!

kb1 · July 9, 2014

Sorry about 2x post.

Just wanted to mention that I'll be checking out Dr. Sean's links, and also Ladna's source changes.

Good luck!

Ladna · July 9, 2014

:)

A couple things:

If I were doing all the save/load stuff on the disk, then yeah it would be slow. I do everything in RAM though, so (while it's not blazing or anything) a whole state load takes < 1ms. It eats CPU, sure, but meh.

I'll reiterate that the best thing about deltas is it means you can stop devoting insane amounts of time to fixing game-breaking bugs. You can have the slickest scripting system in the world, the best renderer, portals, 3D floors, blah blah blah, but if you constantly have ghost rockets hanging in the air, or opponents just flicker around the map, or things just blatantly disappear, your game is unplayable.

I implemented deltas in D2K in around 3 months, and the majority of that time was spent writing my own MessagePack library, making PrBoom+'s game saves portable, and getting clientside prediction to sync up properly. Now that there's an example, it shouldn't take that long to implement in other ports.

I say all this because, frankly, I think netcode is boring. It's something that you don't at all notice when it's good, and you can't ignore when it's bad. My hope with deltas was that I could get great netcode with minimal investment, and move on to bigger and better things without having to constantly augment the netcode to keep up. It looks like it's a success, so I posted here to let other source port devs who might be interested in adding/upgrading netcode that deltas seem viable. I need to do more testing on huge maps; it's possible that bandwidth or CPU usage is just too high when we're talking about 1000s of actors. But we'll see. All I'm saying is that the model is superior to network messages, and that I've got an implementation that seems practical.

DaniJ · July 9, 2014

I must concur with Ladna (and pretty much everyone else developing a modern game netcode) that network deltas are a far superior design.

Fortunately for me, Doomsday was already there by the time I joined the project way back in 2005. I cannot even begin to appreciate how messy the client/server implementation must be in ports that do this using messages and step-synced game state...

As Ladna explained, another big win with the use of deltas is the nice separation of concerns. This is to such an extent that one can go so far as to split the application into "thick" client and "thin" server. Doomsday took that step a few years ago in fact and now we're onto the next phase; pursuing a unified play-simulation where even local games connect to a local server.

While Doomsday lacks many of the nice user-facing features of ports like Zandronum, it is however a pretty nice set up on a technical level (IMHO). Anyone interested in networking in the context of a Doom port may find the Doomsday codebase an interesting read, as much of the related mechanisms (delta merge pools, prediction smoothing, "player action requests" (using doors, firing weapons, etc...)) are also present and may help to give further context to some of the "related" theory mentioned here.

kb1 · July 10, 2014

Very well, then. It is helpful to me to have this technology "pre-vetted", so to speak, as I have been contemplating going from vanilla peer-to-peer, to client/server with my home port. I put an extreme effort into achieving perfect sync, and I can run complex coop games on my LAN for hours without a desync.

But, as Murphy's Law dictates, every now and then, "poof! sync error! crash!", which really sucks after an hour long fight. The fact of the matter is that desyncs WILL occur occasionally.

The save+delta technology can act as the "ace in the hole", which is great.

You know, with a little effort, you could have some nice side benefits of this technology:

1. "Universal" demo format - Other source ports could play your demos, even if your port has features not supported by those other ports. That's exciting by itself!

2. By storing the saves, the deltas, and some player movement tics, you could have the ability to rewind demos. Basically, when the user rewinds, you revert back to a previously-saved state, apply deltas as needed, then apply some player move tics. If you do this quickly enough, you'd get a crude rewind effect. Massively expensive internally, but, who cares?

3. Connect to servers hosting different source ports!! Holy crap, could it work??? :)

Sign In

Network Messages vs. Deltas

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in