Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
esselfortium

Idea: METADATA lump for Slade and Doom Builder

Recommended Posts

Thanks for the input kb1! I agree with most of what you said.

kb1 said:

Some notes:
. Non-compliant editors will alter lumps without maintaining METADATA, so a compliant editor must properly "reconnect" the METADATA on WAD load

yeah, the "header" data in the metadata outline i posted can be used to ensure the metadata is intact, and if not a process of "finding" lumps based on name and hash could be implemented, but this may be difficult. maybe ignoring 0-size lumps with duplicate names would be enough.

. METADATA will get invalidated by non-compliant editors, so METADATA-aware editors must be resilient against this.

there definitely needs to be some thought put into this

. METADATA could be used for support of long lump names/texture names. Ports could be modified to support long lump/texture names as well. This adds a new importance to the proper handling of METADATA.

I'm still unsure about letting ports at the metadata, even if just to read it. I'd be happy to hear more port authors' thoughts on this.

. Ports could be modified to show some metadata about the maps as they are loaded, such as map author, description, long map name, etc.

This is what MAPINFO is for. I don't think it makes sense to make it a METADATA thing too. A good way to look at the distinction is that METADATA is not wad content, authors aren't intended to ever really interact with it, it's for lump information for editors mainly. MAPINFO is content for creators to make. This distinction is part of why I think ports shouldn't need to read the data at all.

Share this post


Link to post
esselfortium said:

I'm skeptical of having sourceports read METADATA. We already have MAPINFO formats for that stuff, for one thing, and a lot of standard Doom content (like a map!) isn't contained within a single lump, anyway, which makes METADATA very unsuited to displaying information about a map. Long map names already exist ingame: they're just the display name of the map :)

This stuff is just not METADATA's purpose, and complicates the premise to accomplish things that are already accomplished elsewhere.

I don't disagree - it gets into new, unknown territory. Maybe the editors should be implemented first, and we see just how good they maintain the extra data. Maybe at that point, we are not comfortable moving forward with any source port usage of the lump. Or, maybe the majority of authors use compliant tools, and it seems like a reasonable thing. Time will tell.

jmickle66666666 said:

Thanks for the input kb1! I agree with most of what you said.
yeah, the "header" data in the metadata outline i posted can be used to ensure the metadata is intact, and if not a process of "finding" lumps based on name and hash could be implemented, but this may be difficult. maybe ignoring 0-size lumps with duplicate names would be enough.

You're welcome! About 0-sized lumps: It's not that 0-sized lumps have no hash, they just have the same hash. Inside the main functions of the editor, you have to assume that METADATA is proper and intact - in fact, you must enforce it by fixing the METADATA at load time. But, you also have to assume that a non-compliant editor got hold of the WAD first, so you have to reconnect the lumps to whatever METADATA you have available.

And, unfortunately, this can lead to some rare ambiguous situations. But, generally, how many 0-sized, identically-named lumps will there be? The absolute worst-case scenario is that METADATA "jumps" from one lump to another. And, for this to occur, you would have already detected that METADATA is out of sync, because you detect that at least 1 lump didn't match, right?

So, maybe you prompt the user. Remember that, you're already in a situation where the user edited the lump using a non-compliant editor. And, the beauty of how METADATA works is that, even in this situation, you recover most all of the data, automatically. Prompting the user for where to apply the METADATA seems like a reasonable compromise. And, maybe this could even be avoided, if there was a simple way to move METADATA in the editor. In that case, a general warning dialog could inform the user of an ambiguous situation with METADATA, with a description of where to go to fix the problem.

jmickle66666666 said:

there definitely needs to be some thought put into this. I'm still unsure about letting ports at the metadata, even if just to read it. I'd be happy to hear more port authors' thoughts on this.
This is what MAPINFO is for. I don't think it makes sense to make it a METADATA thing too. A good way to look at the distinction is that METADATA is not wad content, authors aren't intended to ever really interact with it, it's for lump information for editors mainly. MAPINFO is content for creators to make. This distinction is part of why I think ports shouldn't need to read the data at all.

You're probably right.

Now, on the long lump name issue, it might be cool to show the long name in the lump list, in the editor. Maybe ports would use that information if a MAPINFO didn't exist - maybe that would be the limit of how a port might use the data.

All I'm saying is that, it's a cool feature, and down the road, who knows how useful it may turn out to be?

Final notes:
I can't stress enough that every possible lump manipulation, from load to save, and everything inbetween, needs to be carefully mapped out, like on paper. If coded perfectly, it will be awesome. But, a single mistake will be disasterous. Some interesting situations can arise:

. How do you handle multiple METADATA lumps in a WAD? It can occur when merging 2 WADs.

. The METADATA lump grows as lumps are added, so, maybe it should be stored at the end of the WAD, during save (otherwise, you must determine it's size before writing it, which means you must calculate it before the save, which can be done, but could be tricky).

. Do you allow the user to delete a METADATA lump? Should be ok, since you've already loaded METADATA info into memory, and you'll write a fresh version during save.

. Do you allow the user to edit the METADATA lump manually? Probably pointless, since you'll rebuild it upon save. A dedicated METADATA-editing function makes more sense.

. Please note that you cannot track METADATA on the METADATA lump (Think about it :)

And, just throwing it out there: A new WAD format would mitigate all of these issues. Ports with ZIP packaging already enjoy timestamps, long file names, and folders. A new WAD format could handle all of this and more. Just saying...

Share this post


Link to post

In practical terms, the only 0-byte lumps that have some metadata people would care about are map headers. Though it's true nothing prevents a wad from containing several MAP01s...

kb1 said:

And, just throwing it out there: A new WAD format would mitigate all of these issues. Ports with ZIP packaging already enjoy timestamps, long file names, and folders. A new WAD format could handle all of this and more. Just saying...


The ports that could possibly implement a new WAD format if it existed for Doom mods* have also already implement ZIP archive support. So where's the use case?


(* There are WAD2 and WAD3 formats already, but they don't concern us.)

Share this post


Link to post
kb1 said:

Final notes:
I can't stress enough that every possible lump manipulation, from load to save, and everything inbetween, needs to be carefully mapped out, like on paper. If coded perfectly, it will be awesome. But, a single mistake will be disasterous. Some interesting situations can arise:

. How do you handle multiple METADATA lumps in a WAD? It can occur when merging 2 WADs.

There should only ever be one METADATA. The process of merging means adding all the lumps from one wad to the end or start of another, in which case the position information can just be modified, and the METADATA information merged simply.

kb1 said:

. The METADATA lump grows as lumps are added, so, maybe it should be stored at the end of the WAD, during save (otherwise, you must determine it's size before writing it, which means you must calculate it before the save, which can be done, but could be tricky).

I can see this being sensible, but I fear it's an optimisation that wouldn't be entirely taken advantage of after seeing how the editors I've looked at work :)

kb1 said:

. Do you allow the user to delete a METADATA lump? Should be ok, since you've already loaded METADATA info into memory, and you'll write a fresh version during save.

No. Ideally I wouldn't let the user even *see* a metadata lump.

kb1 said:

. Do you allow the user to edit the METADATA lump manually? Probably pointless, since you'll rebuild it upon save. A dedicated METADATA-editing function makes more sense.

No.

kb1 said:

. Please note that you cannot track METADATA on the METADATA lump (Think about it :)

Of course :) The METADATA lump is a special lump, meant as an extension to the WAD format, within the limits of the WAD format, it should never be treated like a normal lump aside from the areas where it is necessary (wad writing).

kb1 said:

And, just throwing it out there: A new WAD format would mitigate all of these issues. Ports with ZIP packaging already enjoy timestamps, long file names, and folders. A new WAD format could handle all of this and more. Just saying...

True, but this situation is for extending the WAD format. As Gez pointed out, any port that can support more than just WAD files already has a zip-based format, which makes METADATA entirely pointless. The entire point is to ease manipulation of WAD files specifically.

An interesting counterpoint is instead of extending the WAD format with METADATA, maybe there could be a standard for a zip-format that compiles into WADs just for distribution. The tradeoff here is that there is a difference between the WAD you play and the file you work on, which is a new paradigm for doom modification and not ideal. However, it would be the simplest way to solve all the issues METADATA is aiming to solve.

Share this post


Link to post
jmickle66666666 said:

There should only ever be one METADATA. The process of merging means adding all the lumps from one wad to the end or start of another, in which case the position information can just be modified, and the METADATA information merged simply.

Right. My point is that this and other types of situations must be considered. I am listing this as a possible situation that needs consideration.

jmickle66666666 said:

Of course :) The METADATA lump is a special lump, meant as an extension to the WAD format, within the limits of the WAD format, it should never be treated like a normal lump aside from the areas where it is necessary (wad writing).

I guess you understood that I was describing the paradox that, taking the hash of the METADATA lump, and writing that hash into the METADATA lump, changes the METADATA lump :)

jmickle66666666 said:

True, but this situation is for extending the WAD format. As Gez pointed out, any port that can support more than just WAD files already has a zip-based format, which makes METADATA entirely pointless.

Any port/editor can add support for any format it sees fit. And zip support does not provide all the data that METADATA does: Author, duplicate files in the same namespace support, description, change notification, etc.

jmickle66666666 said:

An interesting counterpoint is instead of extending the WAD format with METADATA, maybe there could be a standard for a zip-format that compiles into WADs just for distribution. The tradeoff here is that there is a difference between the WAD you play and the file you work on, which is a new paradigm for doom modification and not ideal. However, it would be the simplest way to solve all the issues METADATA is aiming to solve.

Without baking automated support into the latest editors, I don't think you'll find many people complying to a standard that causes them to do extra work. The beauty of METADATA is that it would be maintained automatically, and that it would provide a dedicated place for this type of data.

On another note, I think name, size and a 32+bit hash is sufficient to reasonably determine that the METADATA matches the file it points to. Having 2 hashes is probably overkill.

On the multiple "MAP01" 0-size lump thing, it's only an issue when dealing with WADs with missing/outdated METADATA. So the problem becomes one of resolving ambiguity. Maybe when you encounter a 0-byte lump, you use the hash of the following lump. If the only time this is a problem is when you have multiple MAP01's, this would work, because you almost always include all of a map's lumps when manipulating a map, right? Are there any other typical use cases where duplicate-named 0-byte lumps typically occur?

So, in reasonable use, say you encounter 2 identically-named 0-size lumps, from WADs with mixmatched METADATA reliability. This is already a somewhat rare occurrence, and you have a 50/50 chance of reconnecting the METADATA correctly. But if the hash from 0-size lump #50 was generated using lump #51's data, and that was always a METADATA rule, you could, most likely improve your chances closer to 100% (as long as the 0-sized lump was not last in the WAD, and one of the METADATAs was correct...)

Again, the worst possible scenario is that METADATA moves to a different 0-sized lump. Then again, you could make a rule that disallows 0-sized lumps from having METADATA. Maybe a map's METADATA lives in the THINGS lump, for example.

There's lots of options, but as long as you're consistent, and every scenario is managed with some care, even this issue can be mitigated.

Share this post


Link to post

Oh hey, this again :P

I did actually start on this a bit back when it was first discussed (branch here), however I quickly found that calculating the MD5 hash for each lump dramatically increased load/save times, which definitely isn't desirable. I'm not sure I had the most optimised MD5 code though, so maybe it could be improved.

As for the newly proposed spec, I'm not too keen on using JSON for the formatting, since I'd rather not have to write yet another parser or use yet another library just to read/write these things.

The alias thing I can see being pretty difficult to implement too, as useful as it would be.

Share this post


Link to post

Just encode lump's name into the hash (but make sure to always make it 8 byte wide, otherwise lump M with contents AP01 would be the same as empty lump MAP01).
Or simply provide no hash for the markers, since it's not like it's needed there anyway.
Having two markers with the same name — who the hell would do that and why?

Share this post


Link to post
sirjuddington said:

I'm not sure I had the most optimised MD5 code though, so maybe it could be improved.

If I were in charge, I would just use a lightweight CRC32 checksum and leave it at that.

Share this post


Link to post

Using CRC32 would certainly be handy since that's already implemented in SLADE (PNG require them).

Share this post


Link to post

I picked JSON since it's very flexible, and parsers are available for every language (pretty much), however, if there is another format that is similarly lightweight and extensible, that would fit better into slade, then that's also good.

If MD5 is too slow, and CRC32 would be smaller and faster, then that is clearly a better option.

The alias feature would indeed be very handy and possibly has the most practical, direct uses than the rest of the spec, but nothing has to come together all at once. For instance the first version can ignore everything but the hash/index etc, and leave everything else up to custom fields that are for the most part ignored, extra features can be added in successive version (an important reason to have an easily extensible format)

The proposed spec is mainly to switch discussion from vague ideas to distinct things we can throw out or solidify. While technically sound its definitely still intended to be just a basis, and I'm glad discussion has been productive for it so far!

Share this post


Link to post
jmickle66666666 said:

I picked JSON since it's very flexible, and parsers are available for every language (pretty much), however, if there is another format that is similarly lightweight and extensible, that would fit better into slade, then that's also good.

If MD5 is too slow, and CRC32 would be smaller and faster, then that is clearly a better option.

Maybe start with sirjuddington's code, since he's done the work already. JSON is a good format - quite readable, but most suited for humans. I would have probably done tab-delimited myself, but that's just me. But, if sirjuddington has some code written, it makes sense to run with it.

Name+Size+CRC32+Index is, by far, sufficient for these purposes. You're matching against a lump that you expect to match, so it's even less critical than, say, trying to find a matching file on your hard drive.

And, yeah, maybe dropping the 0-byte lumps is the easiest approach. It's not like anyone wants to author a 0-sized lump anyway!

Share this post


Link to post
jmickle66666666 said:

I picked JSON since it's very flexible, and parsers are available for every language (pretty much), however, if there is another format that is similarly lightweight and extensible, that would fit better into slade, then that's also good.

Ideally it'd be something both SLADE and GZDB can already parse, such as a UDMF-style format. The only problem with that is fields that can have multiple values like Authors. SLADE's parser for UDMF has some extensions (for reading the SLADE config files) allowing values to have comma separated lists for multiple values, and nested blocks.

jmickle66666666 said:

If MD5 is too slow, and CRC32 would be smaller and faster, then that is clearly a better option.

I think CRC32 should be sufficient for this purpose, yes. It doesn't have to be 100% clash proof.

jmickle66666666 said:

The alias feature would indeed be very handy and possibly has the most practical, direct uses than the rest of the spec, but nothing has to come together all at once. For instance the first version can ignore everything but the hash/index etc, and leave everything else up to custom fields that are for the most part ignored, extra features can be added in successive version (an important reason to have an easily extensible format)

Yeah that's fine, it's definitely not impossible to support, just compared to everything else it would take a lot more effort. The main thing is the virtual trees by using / in aliases, just aliases on their own wouldn't be too difficult to add I don't think.

Share this post


Link to post
sirjuddington said:

The main thing is the virtual trees by using / in aliases, just aliases on their own wouldn't be too difficult to add I don't think.

What is meant by "aliases"? Is this essentially long lump name support, or more like virtual folders within the WAD? (Both of which are fascinating ideas!)

Share this post


Link to post

kb1: aliases would essentially be both long-name support *and* virtual folder support. the idea being that "/" characters in the alias could be interpreted as folders/namespaces

sirjuddington:
If the slade UDMF parser already can support comma-delimited lists then that should be more than enough for any initial implementation of metadata. Nested blocks would only be necessary for increasing the potential application of custom fields, but is nice nonetheless.

I'm definitely up for using the existing UDMF parser for it in this case. While in ideal circumstances I would still prefer JSON, being a standard extensible format, UDMF fits the spec fine and in practical terms is by far the simplest.


For 0-sized lumps, I still think they should be indexed. Adding exceptions increases the complexity. The issue of having multiple 0-sized lumps with the same name can be easily safeguarded against with storing the position of the lump too.

Share this post


Link to post
jmickle66666666 said:

For 0-sized lumps, I still think they should be indexed. Adding exceptions increases the complexity. The issue of having multiple 0-sized lumps with the same name can be easily safeguarded against with storing the position of the lump too.

All I would ask is that you prepare to change your view, based on how well things work. There are ambiguous cases that can occur, and Murphy's Law states that they will. The proper approach is to minimize the possibility, or minimize the damage it causes, even if it requires a couple more lines of code.

Share this post


Link to post

Since it seems unavoidable that a non-compliant editor could potentially invalidate the metadata for some lumps in a way that couldn't 100% reliably be corrected by a compliant editor afterwards, why not design the metadata with the idea that any change to the wad file by a non-compliant editor would invalidate the metadata, and that this should only be reliably detected afterwards, instead of trying to be unreliably corrected? In other words, accept that the metadata lump would only have any credibility as long as the file was being edited by compliant editors only, and lose all credibility once edited by a non-compliant one. Would it really be that bad to let this happen after every non-compliant change, instead of only some of them and without being certain where, when and if at all?

Basic idea (probably flawed, but there might be other ways, too): Just identify each lump by its index only, and also save the wad file's last time modified into the metadata lump. If it was later detected to not exactly match the file's actual date modified, it'd mean that a change (that is, invalidation) by a non-compliant editor happened to the file.

Share this post


Link to post

But then it invalidates the metadata for all lumps. With a basic hashing/checksumming method, it will only invalidate metadata for actually modified lumps.

Share this post


Link to post
Gez said:

With a basic hashing/checksumming method, it will only invalidate metadata for actually modified lumps.

Most of the time, with the highest probability, yes. But sometimes, it could invalidate metadata for unmodified or all lumps as well.

Share this post


Link to post
scifista42 said:

Most of the time, with the highest probability, yes. But sometimes, it could invalidate metadata for unmodified or all lumps as well.

If, in your metadata lump, you individually record the name, size, and checksum of every other lump you expect to be in the file (with suitable grouping to account for marker tags), then the probability of any non-malicious change to the file going undetected is extremely small.

Added lump? Detected with probability 1, because you find a lump you weren't expecting.

Existing lump changes size? Detected with probability 1, because an expected lump now has an unexpected size.

Existing lump changes content without changing size? Detected with probability 0.9999999997 or higher (assuming a 32-bit or larger checkvalue with good mathematical properties) unless a malicious actor is making a deliberate effort to break the checksum algorithm, because an expected lump of expected size has an unexpected checksum.

Deleted lump? Detected with probability 1, because you don't find a lump you were expecting.

Renamed lump? Detected as a deletion + addition.

Erased metadata lump because it didn't know what to do with it? Detected with probability 1, because you don't have any metadata any more.

Scribbled garbage in the metadata lump because its handling of lumps it doesn't know about is defective? Detected with probability 1, because your metadata is garbage.

What accidental (rather than malicious) scenario are you realistically expecting that isn't covered by the above?

Share this post


Link to post
kb1 said:

All I would ask is that you prepare to change your view, based on how well things work. There are ambiguous cases that can occur, and Murphy's Law states that they will. The proper approach is to minimize the possibility, or minimize the damage it causes, even if it requires a couple more lines of code.

Oh absolutely, if in practice it makes more sense then of course i'm willing to change my view. For me it's just currently a weigh-up of complexity to "security" of the format. I think any danger is possibly not enough to warrant adding the complication of treating 0-sized lumps different for now.

Share this post


Link to post
grommile said:

(lots of good stuff)

I couldn't have stated it better. Proper checksums, like CRC, are designed to produce a totally different number with only one bit change in the source data. Sure, you could plug random numbers into a lump repeatedly, and, theoretically, after billions of tests, you could find a collision. But, this lump also has to be usable by Doom. The chances are so small, that it's as safe as it ever needs to be.

jmickle66666666 said:

Oh absolutely, if in practice it makes more sense then of course i'm willing to change my view. For me it's just currently a weigh-up of complexity to "security" of the format. I think any danger is possibly not enough to warrant adding the complication of treating 0-sized lumps different for now.

You're right, of course. The thing that ups the danger a bit is the fact that there is a specific scenario which will occur in WAD editing: Moving a second MAP01, created in Joe's Map Editor, into a compliant WAD that was edited in that same non-compliant editor (and therefore has an invalid MAP01 itself). Of course, it will be caught in the other lumps. But, people will probably, instinctively place their METADATA in the map marker, not the other lumps.

If there did need to be some special handling, it would be map lumps. Again, the optimal solution is to minimize the impact of a mismatch, or minimize what the user has to do to fix it. If it is built with those goals in mind, any issue should be reasonably salvageable. If that scenario occurred, luckily you will have some METADATA to show the user, which may just be enough, if you cannot reconnect to the lumps properly. However conflicts are handled, as long as all compliant editors do the exact same steps, it will work out fine.

Share this post


Link to post

Just saying that implementing a parser of anything non-esoteric in GZDB is quite simple and straightforward, so you can decide on the format based on SLADE.

Share this post


Link to post

Hmm, well in that case, while I'd still prefer to use a UDMF-like format, recently I've been investigating using something like duktape for scripting in SLADE, which from what I can tell has a JSON parser (would be kind of weird if it didn't :P).

Share this post


Link to post
Gez said:

What's happening with Lua BTW?

Nothing really. I added it a while ago when thinking of starting on some SLADE scripting stuff, but then decided other things were more important and left it there.

At the moment i'm kind of tossing up between Lua, Duktape (JavaScript) and AngelScript.

Share this post


Link to post

Apropos of nothing, I was testing hashing in PHP and generated the following table of various hash methods. I imagine PHP just uses a wrapper for boring old C implementations. (Each hash algo was tested as outputting both raw data and converted to hex, so I removed the hex ones, which is why the numbers aren't all there.)

Building random data ...
128000 bytes of random data built !

Results: 
   1.  adler32                 121.116 microseconds
   3.  md4                     181.913 microseconds
   4.  fnv1a32                 183.82 microseconds
   5.  fnv132                  184.059 microseconds
   6.  fnv164                  184.059 microseconds
   8.  fnv1a64                 185.012 microseconds
  13.  joaat                   229.835 microseconds
  15.  md5                     247.955 microseconds
  17.  tiger192,3              312.805 microseconds
  18.  tiger128,3              313.043 microseconds
  21.  tiger160,3              321.865 microseconds
  23.  sha1                    353.813 microseconds
  26.  crc32b                  387.907 microseconds
  27.  tiger192,4              403.881 microseconds
  28.  tiger128,4              404.119 microseconds
  30.  tiger160,4              412.94 microseconds
  33.  crc32                   468.015 microseconds
  35.  ripemd128               548.124 microseconds
  36.  ripemd256               556.945 microseconds
  40.  sha512                  646.114 microseconds
  44.  sha384                  701.904 microseconds
  45.  haval224,3              707.149 microseconds
  46.  haval256,3              709.056 microseconds
  47.  haval192,3              711.202 microseconds
  50.  haval128,3              766.038 microseconds
  51.  ripemd320               839.948 microseconds
  53.  haval160,3              843.048 microseconds
  55.  ripemd160               854.015 microseconds
  57.  sha224                  963.211 microseconds
  58.  haval224,4              967.025 microseconds
  62.  haval192,4              980.854 microseconds
  66.  haval160,4              1002.073 microseconds
  68.  sha256                  1024.961 microseconds
  69.  haval256,4              1113.891 microseconds
  70.  haval128,4              1173.019 microseconds
  72.  haval160,5              1233.1 microseconds
  73.  haval256,5              1235.008 microseconds
  74.  haval128,5              1235.961 microseconds
  77.  haval224,5              1260.042 microseconds
  80.  haval192,5              1477.956 microseconds
  82.  whirlpool               1827.001 microseconds
  83.  gost-crypto             2771.854 microseconds
  85.  gost                    2887.01 microseconds
  87.  snefru                  5300.045 microseconds
  88.  snefru256               5447.149 microseconds
  91.  md2                     20593.881 microseconds

Share this post


Link to post

Are all of those built into a PHP library? I'd love to see the implementations. Do you know of something I can download to see the source of all those hash functions?

Share this post


Link to post

So basically MD5 takes twice as much time as CRC-32 and SHA-1 takes thrice.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×