Pending UDMF Revisions

As you may (or may not) remember there was a pending revision that was be made to the UDMF standard which I never got around to formally adding. Just to make sure everyone is still on the same page, here's a review of what was previously agreed upon:

\n, \r, \t, and possibly \x## were to be formalized as escape character sequences inside quoted string tokens.

However, four additional issues have been brought to my attention by other people since then, so I want to discuss them:

  • There is a "hole" in the UDMF formal grammar which does not technically allow the number 0. This should probably be fixed.
  • The grammar also misspecifies the numerals which are allowed in octal numeric constants (8 and 9 are not valid octal digits).
  • The grammar for floating-point literal tokens differs significantly from the behavior of the C function strtod and this is a pain to various would-be implementors (myself included on this one). Is this necessary? Can it be changed at this point without serious consequences? How compliant are existing implementations with the letter of this part of the specification to begin with?
  • The final one, which I expect to be controversial, is that storage of floating-point vertices in base ten textual format may incur round-off error on write, and potentially also on read, causing potential data decay if a map is processed repeatedly by different tools. Should an alternate, accurate form of expression for floating point literals (say, hex quad word) be considered?

Share this post


Link to post

[QUOTE]Quasar said:
[B]

The grammar for floating-point literal tokens differs significantly from the behavior of the C function strtod and this is a pain to various would-be implementors (myself included on this one). Is this necessary? Can it be changed at this point without serious consequences? How compliant are existing implementations with the letter of this part of the specification to begin with?


What precisely is the problem here? Of course the grammar should be compatible with strtod but before making any decisions we need to know precisely what needs to change.

The final one, which I expect to be controversial, is that storage of floating-point vertices in base ten textual format may incur round-off error on write, and potentially also on read, causing potential data decay if a map is processed repeatedly by different tools. Should an alternate, accurate form of expression for floating point literals (say, hex quad word) be considered?


Indeed. This was brought up during the initial discussions but completely forgotten. However, if we change it now all existing UDMF tools will break and need an update - which would be a nightmare.

However, the precision of a double value is far higher than that of a Doom fixed point number so maybe it'd be enough to demand from any tool that can write an UDMF map it read to keep the required precision (i.e. the value written out must convert to the same fixed point number than the value initially read in) so that values do not decay. In ZDBSP I circumvented this problem by storing the values in text form and writing them out 1:1 unaltered.

Share this post


Link to post

Allowing a type of data to be represented in two different ways would break backward compatibility with applications that need to read UDMF data. There is one for which it'd be problematic: Doom Builder 2, which is currently unmaintained (last revision was on May 8; last codeimp commit was in January). ZDBSP and the various source ports would quickly pick it up.

However, there would still be the issue of making implementation more complicated, by having to test the format of the data and interpret it in two different ways.

Applications that merely output UDMF data without having to read it would not need to be updated.

If the new format replaces the old one entirely, instead of being an alternative, then backward compatibility is entirely lost. The existence of many UDMF maps would require backward compatibility anyway, so this is not an option.

Share this post


Link to post

I think the requirement that fixed_t -> UDMF -> fixed_t must always yield the same value as output when writing UDMF should be enough for anything - and for that a different representation of the number is not necessary.

At the moment I think, DB2 is the only tool that even does write UDMF output where this may be an issue.

Share this post


Link to post

This is why textual transfer formats for inherently numerical data equals fail. Numerous people (myself included) tried to dissuade you from this path long before a formal grammar was even a consideration.

Share this post


Link to post

Blah blah!

And binary formats fail because they are inherently non-extensible.

UDMF works great and any half-decent programmer can ensure that this never becomes a problem.

Share this post


Link to post

And your point is what exactly? That a flawed design in a supposed universal format is made better by more competent programmers? Who would have thunk it. Add that to your syntax definition :P

Share this post


Link to post
Graf Zahl said:

And binary formats fail because they are inherently non-extensible.

Why?

lump1 addr1 [len1]
lump2 addr2 [len2]

addr1:
iDataType1 [iLen1] data1
iDataType2 [iLen2] data2
0xffffffff ...

Textual format gave nothing except necessity of having its parser. Similar to "problem" of zdoom's compressed nodes which is "resolved" now.

Share this post


Link to post
entryway said:

Why?

lump1 addr1 [len1]
lump2 addr2 [len2]

addr1:
iDataType1 [len1] data1
iDataType2 [len2] data2
0xffffffff ...



What do you want to say?

And provide me with a flexible, extensible binary format that'll do the trick 5 years down the line without choking. That won't be easy.

As for the UDMF parser, big deal! Aside from handling the different fields which would be necessary with any binary format as well, all it needs to do is breaking a line into 3 tokens and convert one into a usable value. Don't tell me that's hard. Even the standard C runtime library has code to do it in less than 10 lines of code.

Why do you think that XML is so popular? Certainly not because binary formats are so great...

Share this post


Link to post
Graf Zahl said:

What do you want to say?

Probably I do not understand what you mean with "non-extensible"

I meant something like that:

typedef struct vertex_s
{
  int x, y;
} vertex_t

enum
{
  vertex_int_x = 0x00001000,
  vertex_int_y = 0x00001001,
} vertex_e

binary data for vertex1 {0x10, 0x20}:
0x00001000 0x00000010
0x00001001 0x00000020

int ReadVertex(byte **data, vertex_t *v)
{
  int all_data_is_known = true;

  while (*(unsigned int*)(*data))
  {
    unsigned int data_type = *(unsigned int*)(*data);
    int udmf_type = GetType(data_type);
    *data = *data + sizeof(data_type);
    switch (data_type)
    {
      case vertex_int_y:
        ReadData(data, udmf_type, &v->y);
        break;
      case vertex_int_x:
        ReadData(data, udmf_type, &v->x);
        break;
      default:
        all_data_is_known = false;
    }
  }

  return all_data_is_known;
}
You want to add z now. Then:
typedef struct vertex_s
{
  int x;
  float z;
  int y;
} vertex_t

enum
{
  vertex_int_x = 0x00001000,
  vertex_int_y = 0x00001001,
  vertex_float_z = 0x00001002,
} vertex_e

and
      case vertex_float_z:
        ReadData(data, udmf_type, &v->z);
Is it extensible?

Graf Zahl said:

As for the UDMF parser, big deal! Aside from handling the different fields which would be necessary with any binary format as well, all it needs to do is breaking a line into 3 tokens and convert one into a usable value. Don't tell me that's hard. Even the standard C runtime library has code to do it in less than 10 lines of code.

I did not say it is hard. I do not understand why binary format is non-extensible. For me both are equal in this respect. There is no difference between
if (!strcmp(name, "length")) seg.length = GetIntValue(value);
and
if (*(int*)buf == seg_length) seg.length = GetIntValue(buf+4);

Share this post


Link to post

XML is popular for representing data that lends itself to such a format. When it comes to basically tables of numerics like geometry then textual formats like XML hardly get a look in. The main notable exception is perhaps COLLADA (duh) but I wouldn't want to use that for DOOM maps either.

Share this post


Link to post

There are plenty of text lumps used for numerical data in Doom ports. All the DeHackEd stuff, for example.

Share this post


Link to post

Thats not the point. I don't dislike textual formats because of their form, I dislike them when used inappropriately. To date, from what I have seen of UDMF it would have be better modeled with a combination of BOTH. To me this is square-peg > round-hole territory.

Edit: For the record I'd just like to say that in my day job I spend the majority of my time coding with formats like XML.

Share this post


Link to post

So why did the Doom3 engine use text based level lumps then? Even id's programmers saw some sense in it, apparently.


@entryway: Ok, that'd be doable but still, having integer tags for the various fields sounds far too messy to me. Who controls the values? It'd make port specific extensions far too much of a hassle.

Share this post


Link to post
Graf Zahl said:

Even id's programmers

Heh. Even they!!

Graf Zahl said:

Who controls the values?

Port author for port(namespace)-specific data. UDMF specs for vanilla doom data

Share this post


Link to post

A little bit of difference between Doom3 and DOOM source ports Graf. DOOM map data is far from complicated. Heck, in DOOM there is only five fixed-length text strings in all the data structures that comprise a map. How is UDMF a benefit here? Not to mention that this is all packed into a ZIP where you can already combine different file formats to create aggregate formats.

We've been around this community for a long time Graf. I remember the ancient conversations we used to have about Doomsday's virtual filesystem. Back then you didn't even like the idea of ZIP support let alone native files or this kind of stuff.

So don't presume to think you can now champion "modern" practice to me without backing up your arguments. What you are suggesting to do with UDMF now is what its detractors said would happen sooner or later.

@Quasar I wholeheartedly apologize for the derail, this doesn't help fix the format.

Share this post


Link to post

Quasar said:
The final one, which I expect to be controversial, is that storage of floating-point vertices in base ten textual format may incur round-off error on write, and potentially also on read, causing potential data decay if a map is processed repeatedly by different tools. Should an alternate, accurate form of expression for floating point literals (say, hex quad word) be considered?

9 decimal digits are sufficient to represent single-precision floating-point values in a binary->decimal->binary conversion, and 17 are sufficient for double-precision. See http://docs.sun.com/source/806-3568/ncg_goldberg.html, especially this paragraph and the section on Binary to Decimal Conversion.

(I'm not sure how many digits are needed for Doom's 16.16 format because the math gives me too much of a headache right now, but I think the calculations would be similar...)

Share this post


Link to post

A binary solution would be a "strongly typed" system like, for example, the PNG format, or to give a gaming example, the ESP/ESM format used by Bethesda Softworks in their last three games (ref). But that is not a perfect solution either. Trying to make it as flexible as text can be can result in a rather complicated format.

An advantage of text is the large array of specialized tools that already exist. I've seen a UDMF map generator written in PHP, for example.

The main drawback is its size. Especially given its nature as a "limit-removing" map format, the filesize can reach tremendous heights. This nearly makes the ability to read compressed files an additional requirement for ports.

DaniJ said:

A little bit of difference between Doom3 and DOOM source ports Graf. DOOM map data is far from complicated. Heck, in DOOM there is only five fixed-length text strings in all the data structures that comprise a map. How is UDMF a benefit here?

Theoretically, said text strings could be no longer fixed-length in UDMF. A sidedef's texture could become (in a port allowing that) something like "Castle\Stonewalls\GraniteSlabs03" instead of "CTSWGRS3". For the modder, it is a lot more practical to be able to give meaningful names to textures and to sort them in folders and subfolders; so this would not just be a useless gadget. Not necessarily the most important feature ever, but it would have its use, and it's an example of something impossible in the existing binary formats.

CODOR said:

(I'm not sure how many digits are needed for Doom's 16.16 format because the math gives me too much of a headache right now, but I think the calculations would be similar...)


Well, the minimal non-null value in 16.16 fixed point is 0.0000152587890625. That's 16 decimal digits.

Share this post


Link to post

That is a fair point Gez but compared to the list of drawbacks and caveats its not much of a benefit. Especially when its not intended to be human readable/written. I could very easily do the same in the original DOOM format with hashing, plus a companion text lump/file containing the full "pretty" paths, that is supposed to be human-editable. Doomsday implemented something very similar long before I joined the community and before ZIP support: http://dengine.net/dew/index.php?title=DD_DIREC

Share this post


Link to post
Gez said:

Theoretically, said text strings could be no longer fixed-length in UDMF. A sidedef's texture could become (in a port allowing that) something like "Castle\Stonewalls\GraniteSlabs03" instead of "CTSWGRS3". For the modder, it is a lot more practical to be able to give meaningful names to textures and to sort them in folders and subfolders; so this would not just be a useless gadget. Not necessarily the most important feature ever, but it would have its use, and it's an example of something impossible in the existing binary formats

I do not see any reason why binary format may limit strings size. Pseudo-code:

case vertex_t_name:
  ReadData(data, udmf_type, &v->name);

void ReadData(byte **data, int udmf_type, void *value)
{
  switch(udmf_type)
  {
  case type_str:
    value = strdup((char*)(*data));
    *data = *data + strlen(value);
    break;
  ...
  }
}

Share this post


Link to post
DaniJ said:

That is a fair point Gez but compared to the list of drawbacks and caveats...




I consider the drawbacks and caveats of a binary format much more problematic than UDMF as it is.

To each his own, I guess, but I strongly disagree with what you say.

Share this post


Link to post
Graf Zahl said:

I consider the drawbacks and caveats of a binary format much more problematic than UDMF as it is.

Enumerate these problems. The only problem I heard is "binary format it non-extensible", but this argument is wrong. Binary format is extensible in the same way as textual.

For me, if you do not plan to edit data with text editors, textual format has only one advantage: it looks nicer. On the other hand that beauty is pointless, you have *10 in size, /2 in speed of parsing and you need to write this parser.

Btw, I agree, that textual format for UDMF is the best choice. Only because it looks nicer.

Share this post


Link to post
Graf Zahl said:

I consider the drawbacks and caveats of a binary format much more problematic than UDMF as it is.

To each his own, I guess, but I strongly disagree with what you say.

The reason we disagree here is most likely due to the differing needs of the ports we are involved with. You see, Doomsday has a considerable amount of additional lighting and geometric data which other ports don't that we'd like to be stored alongside the map data. It would be horrendously slow to parse this kind of stuff from a UDMF format map (not mention the shear size the maps would become).

Or to put it more succinctly; UDMF is an interchange format and what we want is something more along the lines of a cache/upload-friendly format.

Share this post


Link to post
entryway said:

For me, if you do not plan to edit data with text editors, textual format has only one advantage: it looks nicer. On the other hand that beauty is pointless, you have *10 in size, /2 in speed of parsing and you need to write this parser.



That's not an issue. The code to read a tagged binary format is hardly less complex than ZDoom's UDMF parser.

As for speed, yes, of course it's slower but what does it matter if loading a map takes 0.2 seconds longer. A considerably larger amount of time is needed to set up all the internal data, mostly the texture precaching.

Concerning size, I don't care about uncompressed file size. Zipped there'd be no significant difference.

In the end I still consider any binary format a dead end. The entire computing world seems to move away from them. Even modern word processors and spreadsheets store their data in text based formats, mostly XML - and these are a lot more bloated than any Doom map could ever be.

Share this post


Link to post

Right. Comparing a 3D game engine to a spreadsheet...? Excel doesn't have to upload and transform that data on the GPU every other frame.

XML isn't suitable for everything Graf. Why do you think modern game engines allow importing COLLADA during development but bake everything into binary for production?

Share this post


Link to post
DaniJ said:

Right. Comparing a 3D game engine to a spreadsheet...? Excel doesn't have to upload and transform that data on the GPU every other frame.


Huh? Do you see any place where the text is actually stored to be repeatedly parsed? Of course it'd be read once when a level starts and stored in internal data structures.

This statement does not make sense at all...

XML isn't suitable for everything Graf. Why do you think modern game engines allow importing COLLADA during development but bake everything into binary for production?


Size issues? If there is one thing XML has a problem with it's size. And of course the relative complexity of a parser. Guess why UDMF does *NOT* use XML.

With UDMF the largest map I have seen is 9 MB text. That same map is 1.8 MB in Hexen map format. Parsing such a map takes far less than a second and to fully set it up (precaching all graphics and such) ZDoom will create more than 20 MB of internal data (GZDoom even 50 due to the larger textures) So why bother about the memory requirements for the map text? We'll need the RAM anyway shortly afterward.

Share this post


Link to post

Does base-10 necessarily lose data (for any arbitrary data)? Or is this purely an implementation concern? If the latter, then I think it would be most appropriate to leave it alone. Otherwise, perhaps a base 16 decimal format (0x[0-F]*.[0-F]*) would work (a parsing function for that would be quite trivial, I think)? The concern I see with using a hex qword (assuming you mean to store a double) is that it kind of goes against the point of using a text format to begin with: not locking ourselves into a particular binary representation.

Share this post


Link to post

Sorry Graf but you are completely missing the point. We do not want to keep a copy of that data in local memory all the time just so we don't have to parse it again. Its too big and furthermore its a complete waste of system resources. 70MB of data is nothing compared to what I'm talking about :P

Share this post


Link to post

Why octal and hexadecimal are allowed? Is there any sense for map data?

Share this post


Link to post
DavidPH said:

Does base-10 necessarily lose data (for any arbitrary data)?



It depends on how the program works. If done badly it can cause sequential roundoff errors which would add up. Of course, if one assumes that the values will always be processed as fixed point it's something that's easily solved with a little tinkering.

For an unrelated program I have written a floating point parser/writer which do not damage the data beyond the required precision.

I don't think that any changes to the map format are needed to address this. What is needed though is some instructions how to read/write floating point values from/to UDMF.

DaniJ said:

Sorry Graf but you are completely missing the point. We do not want to keep a copy of that data in local memory all the time just so we don't have to parse it again. Its too big and furthermore its a complete waste of system resources. 70MB of data is nothing compared to what I'm talking about :P


Out of curiosity, what kind of data would require that much memory that this would become an issue?

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now