Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
AlexMax

Reasonable subset of demo compatibility checking?

Recommended Posts

Chocolate Doom aims to be 100% vanilla-compatible, and it accomplishes this by playing a gigantic corpus of COMPET-N demos and ensuring the demo output is identical to doom2.exe. However, that always seemed to me like complete overkill - not to mention kind of difficult to set up, since you have to gather all of the necessary PWAD's just to run the tests.

It seems to me that if you're aiming to be reasonably vanilla-compatible (i.e. not necessarily 100%), or if you just want to make sure that the changes you just made didn't break some not-obvious corner case, there ought to be some reasonable subset of demos that you should test against. My question is, what demos would that consist of?

Share this post


Link to post

I wonder how long it takes and how much space is occupied by the PWADs/demos. It might not be so terrible, especially once you have the stuff scripted.

Share this post


Link to post

Fraggle has a script somewhere online (somewhere), and it just uses a local mirror of the COMPET-N file stuff and copies and unzips the file as needed and then runs the game. The only flaw is the output file format, it just does a diff on an input file with specified output and it would be better if it were CSV or similar.

As for demos, I would probably just pick the longest ones that run the entire game on UV and NM, preferably max and speed demos.

Share this post


Link to post

I must say that I don't really understand the question. Usually the overhead in any Test driven development model is the upfront cost in building the necessary test framework, writing the tests, etc. Once built, you can run the tests as often as you can afford time wise.

So, I must conclude that the issue here is the time it takes to run the full suite of tests? If so then the demos you want to prioritise are those from the largest and most complex maps, so as to cover as much of the demo functionality as possible. (Which demos will depend entirely on the coverage of your tests. Without digging into those and the architecture of the port, I really can't be any more specific).

Share this post


Link to post
GhostlyDeath said:

Fraggle has a script somewhere online (somewhere), and it just uses a local mirror of the COMPET-N file stuff and copies and unzips the file as needed and then runs the game. The only flaw is the output file format, it just does a diff on an input file with specified output and it would be better if it were CSV or similar.

The Compet-N mirror I use for the tests is on archive.org (I uploaded it). But it's not really the output file format that's the limitation, rather that's a side effect of the mechanism that's being used for testing. statcheck compares the statistical data produced by Doom's -statcopy parameter and checks that Chocolate Doom's result matches Vanilla (I've generated expected statdump outputs for the entire Compet-N corpus). It's admittedly a rather crude measure, but the fact that it can be run over a huge number of demos to do a literal Vanilla-Chocolate comparison is pretty nice.

AlexMax said:

My question is, what demos would that consist of?

http://www.doomworld.com/vb/doom-speed-demos/35214-spechits-reject-and-intercepts-overflow-lists/

I think PrBoom+ actually has automated tests to test these demos. Not sure though.

I'd like to set up tests like these for Chocolate Doom as well, though I would like to do something more in-depth and precise than what statdump comparisons do (for the reasons I've listed above). Something like a hash of all the positions of all items in the level for every tic would probably be a good expected result.

DaniJ said:

So, I must conclude that the issue here is the time it takes to run the full suite of tests?

They do admittedly take a long time to run through, though the statcheck tool supports running tests in parallel which is nice on modern multicore machines.

However, I do think AlexMax has a really good point. A lot of the things needed for the modern standard of "Vanilla compatibility" are obscure: they're not things commonly triggered during normal play. Many only happen in particular levels that trigger particular overflows, so even though statcheck runs through probably hundreds of hours of demos, it's almost certainly not testing all the cases it could do. As AlexMax says, they're corner cases, and to do proper regression testing, we ought to be specifically testing those corner cases.

Share this post


Link to post
fraggle said:

However, I do think AlexMax has a really good point. A lot of the things needed for the modern standard of "Vanilla compatibility" are obscure: they're not things commonly triggered during normal play. Many only happen in particular levels that trigger particular overflows, so even though statcheck runs through probably hundreds of hours of demos, it's almost certainly not testing all the cases it could do. As AlexMax says, they're corner cases, and to do proper regression testing, we ought to be specifically testing those corner cases.

In that case, you're arguably running tests at the wrong level and one should instead test specific cases through explicit tests of functionality.

Although I'm unfamiliar with your test setup it sounds to me like the coverage of your tests is fairly low, if one can run hundreds of demos and still not hit all the cases you need to test for. I would also guess that these tests are inspecting the same areas of code repeatedly, so simply throwing more demos at the problem doesn't actually help in the way one would expect.

Most likely a combination of explicit functionality tests and demo-sync tests will be necessary to feel confident enough that "yeah, I didn't break [something]".

Share this post


Link to post

Yeah I think a smaller bundle of demos that are "contrived" to test specifically for compatibility is a great idea. I don't know how representative they are, but maybe we can add Odamex' test demos to PrBoom+'s test demos.

fraggle said:

Something like a hash of all the positions of all items in the level for every tic would probably be a good expected result.


I definitely think this is the way to go too. Dunno about hosting though, at first I thought hosting it as a Git repository would be the best way to go, but it may be big binary files, and Git kind of hates those from what I've heard.

Share this post


Link to post

You know, this thread would've almost made more sense in Speed Demos. There are threads for overflow demos and scholars who have pretty much encyclopedic knowledge of weird occurences, heh. Even the CN archive as a whole might be unnecessary overkill.

Share this post


Link to post
fraggle said:

Something like a hash of all the positions of all items in the level for every tic would probably be a good expected result.


I tested something like that when I was adding demo compatibility to ReMooD and I found that certain cases of sub-tic desyncs were not detected. Using position alone, that would not check cases where a specific order needs to be performed or a specific way an action is performed and may delay the desync detection until a later tic (if at all). Though that was fixing up Doom Legacy code so that I could actually play Doom 1.9 deathmatch and a few solo demos rather successfully.

What I mean is, that there would be a desync but it would occur much later (even in another map) because a demo can desync but still have valid positions for quite awhile. There is the PRNG, line activation, hitscans, teleport fogs, lights, zombie pain and death sounds, P_CheckRadius, etc. So you would best opt in for some more checks.

That was the only way I could fix some of Doom Legacy's broken demo playback was to add more checks such as when line triggers were hit and other general actions. Since there were cases where in Vanilla the order is A then B, but Legacy had it doing B then A. Although some demos can desync without there actually looking like it desynced. And hence, this is how ReMooD ended up with 96 compatibility variables to play demos down to Vanilla with all the Legacy versions in between. You get to learn stuff like Legacy versions 1.32 and up have a slightly different SSG spread.

If this is to be some kind of thing where it is hoped that multiple ports could benefit from it then it would need more vanilla details since Doom is rather precise when it comes to ordering.

Might be a good thing to go engineer some levels that are essentially just "fun-houses" where you do stress testing on as many parts of the engine as possible. Though with Vanilla limits you would have to split it into multiple levels but then that way you could check intermission screen and story screen demo compatibility too. Make 4 very long (4-6 hour) demos of players playing through all 32 test levels to make sure things operate. One demo for each player in the game (1, 1+2, 1+2+3, 1+2+3+4). Would be boring to watch but a fast system could probably -timedemo that rather quickly.

Share this post


Link to post
DaniJ said:

In that case, you're arguably running tests at the wrong level and one should instead test specific cases through explicit tests of functionality.

Yeah, I guess it's a good argument. Hundreds of demos that exercise the same code paths are arguably a waste of time. Running it over lots of demos gives a kind of "feel good" confidence but arguably a more reasoned approach (examining things like coverage) would make more sense. I'd quite like to add the demos from the list I linked to earlier in the thread to statcheck's tests - I think that makes the most sense.

Most likely a combination of explicit functionality tests and demo-sync tests will be necessary to feel confident enough that "yeah, I didn't break [something]".

I should clarify that it's demo sync I'm primarily testing here rather than other functionality.

Ladna said:

Yeah I think a smaller bundle of demos that are "contrived" to test specifically for compatibility is a great idea.

I actually had the idea of making a minimalist IWAD for testing - like a massively cut down version of Freedoom with all blank textures, silent sounds/music etc. - that could be included with Chocolate Doom for standalone testing without needing an IWAD to run. It should then be possible to include some constructed levels that test various odd vanilla behavior, along with demos to run the tests.

Ideally it would be easiest to just include the levels and demos that exhibit these behaviors (the ones from the list). But they're large files and I don't want to bloat the source distribution unnecessarily; I'd also need to get permission to license them under the GPL, which for most levels/demos is probably never going to happen.

I definitely think this is the way to go too. Dunno about hosting though, at first I thought hosting it as a Git repository would be the best way to go, but it may be big binary files, and Git kind of hates those from what I've heard.

No big binary files should even be necessary: the end result should just be a small cryptographic hash. If everything checks out then the hash should match.

Share this post


Link to post

This has been an interesting thread to read through. My personal goals are far less lofty than perfect Vanilla compatibility - I was looking for something more along the lines of some assurance that hypothetical linedef action 2993 or whatever behaves the same both before and after working in that code. Or heck, something obvious like "Do mancubi fire 90 degrees to the right now?" I could go through and playtest after every change, but why playtest when you could run a script that makes a bunch of green checkmarks light up in 30 seconds or so.

You could perhaps have levels of compatibility - "These demos need to play or else you broke something pretty basic.", then "These demos need to play or else there's some wonky physics thing (say, plasma bumping, wallrunning, or gliding) that won't work anymore." Then there could be "These demos need to play or else you can't really call yourself Vanilla Compatible" for those who want perfect compatibility."

fraggle said:

I actually had the idea of making a minimalist IWAD for testing - like a massively cut down version of Freedoom with all blank textures, silent sounds/music etc. - that could be included with Chocolate Doom for standalone testing without needing an IWAD to run. It should then be possible to include some constructed levels that test various odd vanilla behavior, along with demos to run the tests.


Perhaps the IWAD would start out completely blank aside from test levels, but could be filled in with game resources from an IWAD by running a script so you could then record demos with it.

Share this post


Link to post
AlexMax said:

This has been an interesting thread to read through. My personal goals are far less lofty than perfect Vanilla compatibility - I was looking for something more along the lines of some assurance that hypothetical linedef action 2993 or whatever behaves the same both before and after working in that code. Or heck, something obvious like "Do mancubi fire 90 degrees to the right now?" I could go through and playtest after every change, but why playtest when you could run a script that makes a bunch of green checkmarks light up in 30 seconds or so.

This could be a pretty good argument for the statcheck approach. Doom has a lot of complex behavior - monster states etc. Testing with a huge number of demos at least gives some confidence that those states have been adequately tested.

Actually, Doom's internal state engine (the tables in info.c) presents something of a challenge for coverage analysis. In Doom's case, testing isn't just about confirming that every code path has been exercised - it's also about checking that every state_t has been exercised, every monster and thing type spawned as well.

Perhaps the IWAD would start out completely blank aside from test levels, but could be filled in with game resources from an IWAD by running a script so you could then record demos with it.

Yeah, realistically it wouldn't even need to be properly playable by a human, as long as demos play back okay. All textures plain black, all sprites invisible, etc. As long as it's compatible with normal IWADs so that demos can be recorded, it doesn't matter so much.

Share this post


Link to post

With kb1, at a time we had whipped up a system which generated log files with identical formatted output from both my source port (MochaDoom) and his own (kbDoom, a branch of prBoom if I recall correctly), and then compared those with an external comparator. Common checkpoints were the calls to P_Random (there was a finite number of spots in the code where they can be called) and thing spawn/damage/death event.

The purpose was to determine the reason for Mocha's desync in demos, by using kbDoom as a reference. and it really helped: it can now play at least the IWAD demos back, after a number of mistakes and differences were identified. However, the comparison procedure was painstaking, and at some point we discovered that kbDoom itself diverged from the intended behavior, and was not 100% vanilla compatible itself. Then with time we pretty much forgot about it, but IMO it was a good system.

I should probably try to port it in a branch of ChocoDoom to use as a "universal reference".

Share this post


Link to post

If you have a 2-hour demo test, and it returns FAIL, now what.
You cannot repeat such huge tests in debugging the problem.
If it does not have a way to point out something very specific it will not have provided any information. If you are testing an engine with such a thing you probably already know it is going to fail. You would have to test after every patch of a few lines or more.

A single statcheck at the end of demo is not very precise in identifying some area of code to look at.

Smaller test increments would provide more resolution. Three second demos with a statcheck on each would be coarse but better than a huge test file.

Explicit detection of sync problems, like looking at Random calls vrs sync points in the demo would be much better for debugging.

I wrote an automated test and verify system for my last project.
The test cases were written to provoke output from the system into a log file, which was then checked against the correct output.

It required several hooks in the program being tested.
- A switch to put output to a log. For Doom it could put something like the number of Random calls since the last test number, and the current Random number.
- A switch to set a TESTING flag, to block some test adverse behaviors like waiting for input.
- Some way to force set state that the user normally controls. For Doom every comp flag comes to mind. There are probably others.
- Would be aided by Test numbering added to the demo format. Each log output would include the current Test number gotten from the playing demo. This would delimit the portion of the demo that triggers the error. About 10 test numbers per second of demo is about right.
- You could save the tics since demo started instead, but you could not look at the demo file and see what it was doing at that time. Don't want to hand count those. Count wrong and spend an hour looking for the bug in the wrong bit of demo file. Would need a tool to dump the n-th tic demo commands from a demo file to do actual debugging.

Then you need a test program to subject the engine to the test and to compare the log against a known good log.

It was very effective at finding coding bugs in the program.
It takes about as much time to maintain the test verify program and its test files as it does to maintain the tested program.
I spent a huge amount of time writing tests. However, for every test written it usually found some problem in the code, that I could fix immediately. For that to work the correct test log must be generated by hand. Some tools were created to help generate those correct test logs, being careful to not copy any code from the tested program.

Share this post


Link to post
GhostlyDeath said:

There is the PRNG, line activation, hitscans, teleport fogs, lights, zombie pain and death sounds, P_CheckRadius, etc. So you would best opt in for some more checks.


I don't think anyone's arguing that mobj positions are a perfect measure of sync. They're just the easiest, and most general way of checking it. It might be worth it to explore this more though, like what's a good cross-section of engine behavior? Does RNG sync have a place? What about mobj states?

dew said:

You know, this thread would've almost made more sense in Speed Demos. There are threads for overflow demos and scholars who have pretty much encyclopedic knowledge of weird occurences, heh. Even the CN archive as a whole might be unnecessary overkill.


I do think something like that might be the way to go. I think maybe you could TAS a demo together (based on Freedoom or a stripped Freedoom) that tested all the weird-ass behavior. That works around the licensing problems that Fraggle brought up, and you can avoid having fifty 5-second demos (plus their PWADs?) in your repo.

On the other hand though, having 1 demo per weird behavior can be really helpful for testing.

fraggle said:

Yeah, realistically it wouldn't even need to be properly playable by a human, as long as demos play back okay. All textures plain black, all sprites invisible, etc. As long as it's compatible with normal IWADs so that demos can be recorded, it doesn't matter so much.


I'd rather have a "real" demo though, one that you can watch. Troubleshooting a "blank" demo replay has to be hyper-aggravating.

fraggle said:

No big binary files should even be necessary: the end result should just be a small cryptographic hash. If everything checks out then the hash should match.


Yeah you're right. I guess I was just thinking of keeping the hash for each tic, but you can just hash those together at the end, or whatever. I do kind of agree with wesleyjohnson (and Ghostly, in a way) though. Seeing that the demo failed is useful as long as you start out with really good compatibility. If you're starting a new physics implementation (or a major port), it's not very useful.

Share this post


Link to post

My port is based on the original ID source code with the pre-broken demos, so it has never worked. I've found documents saying that the demos are known to be broken in that version, but I cannot find exactly what broke it. Is there something somewhere that would give me a fighting chance to fix it?

Share this post


Link to post
Maes said:

With kb1, at a time we had whipped up a system which generated log files with identical formatted output from both my source port (MochaDoom) and his own (kbDoom, a branch of prBoom if I recall correctly), and then compared those with an external comparator. Common checkpoints were the calls to P_Random (there was a finite number of spots in the code where they can be called) and thing spawn/damage/death event.


I always thought that having a standardized form of a verbose dump file like you describe for demo compatibility checking was a great idea. Perhaps it would be less useful for Chocolate Doom vs doom2.exe, since you can't exactly easily modify doom2.exe to dump this data, but if Chocolate implemented the ability to create such a dump-file, with directions to put the same logging in your own port, you could test your port against chocolate which is the next best thing to .exe compatibility.

Share this post


Link to post
jeff-d said:

My port is based on the original ID source code with the pre-broken demos, so it has never worked. I've found documents saying that the demos are known to be broken in that version, but I cannot find exactly what broke it. Is there something somewhere that would give me a fighting chance to fix it?


I'd grep for "demo" through the PrBoom+ source code.

Share this post


Link to post
Maes said:

With kb1, at a time we had whipped up a system which generated log files with identical formatted output from both my source port (MochaDoom) and his own (kbDoom, a branch of prBoom if I recall correctly), and then compared those with an external comparator. Common checkpoints were the calls to P_Random (there was a finite number of spots in the code where they can be called) and thing spawn/damage/death event.

The purpose was to determine the reason for Mocha's desync in demos, by using kbDoom as a reference. and it really helped: it can now play at least the IWAD demos back, after a number of mistakes and differences were identified. However, the comparison procedure was painstaking, and at some point we discovered that kbDoom itself diverged from the intended behavior, and was not 100% vanilla compatible itself. Then with time we pretty much forgot about it, but IMO it was a good system.

I should probably try to port it in a branch of ChocoDoom to use as a "universal reference".

Since then, I created a much more advanced version which creates humanly-readable, yet extremely compressed output files which show the state of each thing on a tic-per-tic basis. With command-line switches, you can choose a specific tic range to examine in detail, and/or get CRC hashes per tic, and/or per level. It also logs movement changes, thing flags, damage, PRNG state, action function calls, and more. Again, the files created are easy-to-read, yet highly compressed, and can be condensed into a single comparable hash per tic, or just one hash per level. The idea is that you use the one hash per level approach, until a mismatch occurs. At that point, you can re-run the demo, requesting increasing level of detail, as it helps focus on where the problem occurs. Typically it will tell you which function is incompatible, and, many times, why it is incompatible. It also works equally well for diagnosing network sync issues.

I call it the SyncDbg toolkit, and it can be implemented fairly easily into C ports, or ported to other languages without too much difficulty. I sent this new toolkit to the Odamex guys. They seemed to be interested, but I haven't heard from them since.

I will upload it if someone provides me: #1. some interest in actually implementing it, and #2. A dedicated place to host it.

Share this post


Link to post

In retrospect, if we were to reimplement this from scratch, I'd go for a human readable, yet easily tokenizable format, like a form of JSON or XML. One big token/bracketed JSON object per tick, and then it would be easy to select specific fields and types of information during comparison (sub-tokens or sub-objects) regardless of how much extra info or different (visual) formatting there is.

E.g. you want to see just calls to P_Random? No problem, filter out everything else. You want to see calls to P_Random AND death events, regardless of how much extra stuff the two ports spit out in their output? No problem, once again, just filter for those two categories. Do you need to debug complex things like e.g. linedef crossings or spechit lists, but the other port doesn't? No problem, just ignore this kind of tokens. Does the other port spit out things you don't understand? Again, no problem: ignore, ignore, and ignore.

The biggest problem we faced, I recall, was that log output had to be limited to exactly one token of information per line, the two Doom ports had to output exactly the same number and type of tokens in the same order, and it was impossible to filter them out by category. There were other problems too like number format (decimal vs hexadecimal) which were enough to send the comparator titties up :-p

With a parseable, tokenizable format, these aspects could easily be taken care of in a visualizer tool (TBQH, I wouldn't use C or C++ for making such a tool though, not when languages like Java, C# or even Javascript can handle XML and JSON much more easily with built-in tools)l.

Share this post


Link to post

Maybe this is too harsh (if so sorry, these are cool ideas!), but to be honest, I wouldn't use any of this stuff. I would just DDL Choco/PrBoom(+)'s savegames with MessagePack, delta compress them out to a file, and dump the same data in my port. Then I'd write a tool to examine both files, and print out the first desync it finds. This would be something like 20 lines of C. If it found a desync, I'd set a conditional breakpoint in GDB on gametic, and step through the tic. I could do all this in an evening, and in fact, this was my plan. I really don't think there's a need for a Frankenstein Doom that's instrumented to hell (with all the attendant maintainability problems).

It actually makes me think that it might be possible to do the same with doom2.exe, for even better compatibility checking. I don't know if this works, but I think you could modify the ticcmd's in demos to include BT_SPECIAL, BTS_SAVEGAME, and BTS_SAVEMASK in the .buttons field. Wouldn't the game save every tic? I guess it would get overwritten all the time though, hmm. Maybe you can do a neat trick with a FUSE filesystem to rename files instead of overwriting them. Dunno.

===

Anyway, I'm a lot more interested in a set of demos that approaches a comprehensive test suite than how to test for and diagnose compatibility issues. To me, testing for compat is the easy part; knowing what's compatible and what's not is expertise I don't have.

===

jeff-d said:

My port is based on the original ID source code with the pre-broken demos, so it has never worked. I've found documents saying that the demos are known to be broken in that version, but I cannot find exactly what broke it. Is there something somewhere that would give me a fighting chance to fix it?


Depending on what your port supports (Boom, etc.) you can look through PrBoom(+), Chocolate Doom, and Eternity for the same functions and see what's different. There aren't THAT many functions to look through, physics wise (Boom linedefs though, that's a different story...). If it were me, I'd just vimdiff through the different files (p_user.c, p_map.c, etc.).

Share this post


Link to post

Speaking from experience here (mine and kb1's).

The trouble with Ladna's approach (which I had to try hard to convince kb1 that it would be insufficient, BTW :-p) is that you just identify when the debug occurs, but not why it occurred, nor does it give you any hint of what to look for. Certainly, you do need to know on which tic it occurs, but alone that's not enough, and certainly restoring the game on that tic (with all the problems that Doom savegames can have in the first place....) is not the answer.

The jist of it was:

There may be errors in the code/behavior that silently accumulate and are asymptomatic, UNTIL you reach that particular desync tic. Their cause lies in previous tics, but if you have no "history" to find out what went wrong (a history of P_Random calls is useful, but also certain types of linedef intercept calculations), it's unlikely you'll manage to guess exactly what to fix.

In the case of source ports there's always the chance of programmer error, especially if some things are rewritten/ported or substituted by "equivalent" functions (e.g. I found out that my direct translation of linux doom's linedef side code was wrong, while a direct port of Boom's code worked as intended).

But as they say, hic Rodi, hic salta. ;-)

Share this post


Link to post
Maes said:

Speaking from experience here (mine and kb1's).

The trouble with Ladna's approach (which I had to try hard to convince kb1 that it would be insufficient, BTW :-p) is that you just identify when the debug occurs, but not why it occurred, nor does it give you any hint of what to look for.


Ehhhhhh, you're missing the debugging part. Graf Zahl infamously hates GDB, but the ability to save a GDB configuration is baller. You can break on every call to P_Random, or conditionally to each of the pr_* calls (I actually did this a couple months ago to find PRNG desyncs in the sound code... seriously), and log anything you want to a file. This goes for anything, PIT_CheckLine, T_MovePlane, whatever.

To me, this approach is far preferable to having instrumentation switches or #ifdef's everywhere. It's one thing to have Frankenstein Doom as a tool, it's another for the port you're developing to have to become Frankenstein Doom to achieve compat.

I really just need demos that demonstrate engine behavior, and a way to narrow it to the point of desync.

Maes said:

There may be errors in the code/behavior that silently accumulate and are asymptomatic, UNTIL you reach that particular desync tic. Their cause lies in previous tics, but if you have no "history" to find out what went wrong (a history of P_Random calls is useful, but also certain types of linedef intercept calculations), it's unlikely you'll manage to guess exactly what to fix.


Where could this stuff accumulate that I can't serialize it?

*See edit below

Maes said:

But as they say, hic Rodi, hic salta. ;-)


Aha (after looking it up), we'll see. My dog's been sick lately, but maybe I can do it tonight. It'll probably have to wait until the weekend though.

===

* EDIT:

P_Random is a good example. So you don't need to track P_Random calls in your output, you just need to serialize rng.rndindex and rng.prndindex. The comparison tool can then print out the discrepancy and its tic. Then, set a conditional breakpoint in GDB (or whatever) for that TIC and have it log a trace whenever P_Random or M_Random is called. This is the kind of workflow I've actually used before, to great success, and it required very little modification (and again, the comparison tool is laughably small).

Share this post


Link to post
Ladna said:

Where could this stuff accumulate that I can't serialize it?


Wasn't your point exactly that you don't need to serialize stuff with the magic of gdb and save states? And that doing so would make your code a "FrankenDoom"?

Unless you are so cool and awesome that you can have all the serialization, without the frankenbits ;-)

However, again from experience, all the desync errors we managed to fix with this method were not due to misplaced P_Random calls, but more due to things like geometry calculations (one bit off...and BOOM!) or stupid mistakes (e.g. using actual null reference value instead of S_NULL state in melee range checking ).

In general, the logging allowed us to see on which tic a desync occured, and at that point, monitoring a series of other actions in greater detail (e.g. "Shotgun guy #33 should attack Imp #34 now. Why didn't he?") allowed us to pinpoint the cause.

Sure, you can do it all with a debugger or with minimal "serialization" or printf() "debugging" if you want, but there are so many calls to follow, that strategically placing logging statements in your code may give you a better insight. The most complex conditions to debug, however, are geometry-based calculations. There, the depth of recursions that you may have to follow is staggering, even for very simple actions like a single Imp "looking" at a player O_o

The only "shortcut" there is to be very intimate with the code, to the point of knowing "Ah, this bit can only go wrong here, so I'll check JUST there".

Share this post


Link to post
Maes said:

Wasn't your point exactly that you don't need to serialize stuff with the magic of gdb and save states? And that doing so would make your code a "FrankenDoom"?

Unless you are so cool and awesome that you can have all the serialization, without the frankenbits ;-)


No I mean, right before gametic++, you add a little:

if (dumpstateparm) // or whatever
  P_DumpState()    // or whatever
P_DumpState works very similarly to the code in p_saveg, but I would use MessagePack instead of the ad hoc serialization in most source ports. The point isn't that there's no code, the point is that your instrumentation isn't scattered throughout your codebase. Here, it's centralized.

Share this post


Link to post

OK, now I get it. The three issues I see with such an approach:

  • Ensuring that you serialize absolutely everything, in a much more comprehensive way than p_saveg does. This might mean also saving (and restoring) absolute pointer values, or find a way to abstract them away, especially if you need to compare two different ports, the same port on different hardware, or be able to do an "absolute restore". At this point, I think that a total save state in an actual emulator would be more preferable.
  • You will still need to code a quite comprehensive comparator/ "Doom debugger" to see what changed between two consecutive tics. I admit, this could be used to do some quite elaborate things like tracing the paths of every map object, plot graphs etc.
  • However, elaborate as it is, it will still only allow you to make "before" and "after" comparisons, but tell you little about the "during", and the Devil is in the details, as they say. Within each tic, there's practically a whole new world of "subtics" or events. E.g. think of the sequence of actions that occur each tic in NUTS.WAD ;-) You'll definitively know if something is wrong, but not why it went wrong if all you save is the final state after a tic. No substitute for painstaking for being able to look "between the lines" or "scattering instrumentation", as you say

Share this post


Link to post

1. I've done almost all of this for d2k. Some stuff, like global collision engine variables, still isn't serialized, but it was really straightforward to do.

2. You can put stuff like section headers or mobj indices in the dump, so the comparator stays real simple.

3. That's what a debugger is for. You can't possibly anticipate everything you'll need. Or you can, and it becomes impractical. The odds of getting it right without a huge file are slim. Alternatively, you can take kb1's approach, but to me, that looks an awful lot like a (sweet) debugger.

Share this post


Link to post
Maes said:

Speaking from experience here (mine and kb1's).

The trouble with Ladna's approach (which I had to try hard to convince kb1 that it would be insufficient, BTW :-p) is that you just identify when the debug occurs, but not why it occurred, nor does it give you any hint of what to look for. Certainly, you do need to know on which tic it occurs, but alone that's not enough, and certainly restoring the game on that tic (with all the problems that Doom savegames can have in the first place....) is not the answer.

The jist of it was:

There may be errors in the code/behavior that silently accumulate and are asymptomatic, UNTIL you reach that particular desync tic. Their cause lies in previous tics, but if you have no "history" to find out what went wrong (a history of P_Random calls is useful, but also certain types of linedef intercept calculations), it's unlikely you'll manage to guess exactly what to fix.

In the case of source ports there's always the chance of programmer error, especially if some things are rewritten/ported or substituted by "equivalent" functions (e.g. I found out that my direct translation of linux doom's linedef side code was wrong, while a direct port of Boom's code worked as intended).

But as they say, hic Rodi, hic salta. ;-)

Most of the issues Maes and I encountered have been worked out in this new version. My view is that if you had a program that would tell you why, you wouldn't need a programmer! Stated differently, when the desync occurs IS why the desync occurred, if you've captured enough info.

Ladna's approach is not specific enough for me. The changes required to get my toolkit working are minimal, and, yes, they require ifdefs, but, that's a good thing - it keeps my code insulated from the port code, and makes it easy to find.

You don't want to just track every variable. Each port should have the freedom to code it's functions as the port author sees fit, as long as the result matches vanilla. If one port accomplishes, say, firing the pistol using 5 variables, and I code it to work with 4, as long as my code fires the pistol at the proper angle and time, it does not matter that I have arrived at that result using different variables and/or code.

And, that's the point - what you want to capture is the results. The new toolkit only outputs differences, so in the tics inbetween a monster firing, I might log that mobj's movement, but not its flags. Based on how well the old toolkit worked to debug and correct desyncs in Mocha Doom, I stand by this new version, as being a pretty damn sharp tool for finding desyncs. Sure, you need to know your Doom codebase a bit, and you need to do a bit of attaching to get the toolkit to work in your port, but it's really not a lot of work.

There are many ways to attack the issue of desyncs. I designed this new toolkit to be the easiest, quickest way that I could think of.

Again, if someone would host my files, and, maybe a web page, I will post the toolkit, and an explanation of exactly what it does, and how to implement it. Basically, you dump my .c and .h files into your port, then add various "#ifdef SYNCDBG Call a syncdbg routine #endif" lines into certain function calls. I have it implemented in my code base, where it stays, and just change a switch and recompile when I need it. It slows down the port just a bit, but it's actually rather efficient.

As far as "scattering instrumentation" goes, it's pretty clean. And, that's what it takes, if you want perfect sync the easiest way possible. Why I know this is a good approach is this: I used to add quick and dirty debugging to find and fix single specific desyncs, and it was a pain in the ass. However, with the syncdbg toolkit, I dedicated my time to making a powerful desync detection system.

Here are the goals I tried to achieve:
1. Easily humanly-readable output
2. Per-tic 32-bit CRC hashes of the entire tic, that remain consistent, regardless of the syncdbg settings used.
3. Per-level 32-bit CRC hash of the entire level's action.
4. Ability to hide all detail, and output just hash codes.
5. The ability to specify a tic range, where full detail is logged.

The secret to quick debugging is by using option 2 with option 5. Run the demo once, outputting just hashes. Stop when a desync occurs. Then, run the demo again, but, this time, request to see full details for a small range of tics around the desync.

Again, there are many ways to do it, but I confidently stand my system up against anyone's, and claim that it works well, and fast. And, it's already written.

It's actually kinda neat to be able to see the action of a Doom demo, in a text file.

Let me know. In any case, good luck!

Share this post


Link to post

So I think AlexMax's original idea of a "reasonable subset" is pretty promising. Should we try and work with the Doom Speed Demos community to get some together? Is 1 demo or multiple demos better? Should the project be based on Freedoom (what about ongoing Freedom changes though...)? I feel like the thread got sidetracked a little, but I'm still pretty interested in the idea.

Share this post


Link to post

From a debugging perspective I'd prefer a set of demos, each focused on a particular aspect of compatibility. A set of "conformance tests", if you will. Isolating the conformance test outside of a traditional demo with lots of other stuff going on will make it easier to focus on the test itself.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×