Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Sign in to follow this  
printz

UDMF TEXTMAP format questions

Recommended Posts

I'm trying to implement UDMF support in one of my projects, and I have some questions about the base UDMF specifications:

1. Is 1.1 the latest specs version for vanilla/Boom? http://www.doomworld.com/eternity/engine/stuff/udmf11.txt

2. Can strings include newline characters (carriage return and/or linefeed)?

3. Are backslash escape sequences specified? I'm asking because I see that the sequence \" is already supported in the quoted_string regex "([^"\\]*(\\.[^"\\]*)*)"
When reading a string containing \", should that sequence be decoded into " ?
What other escape sequences are expected?
How should lone or invalid backslashes be handled?

4. Can I use literal UTF-8 characters in strings, or do I have to encode them into escape sequences?

5. Can you recommend some reusable code or library that parses a TEXTMAP into a data structure? Without validating the semantics, just the syntax.

Share this post


Link to post
printz said:

I'm trying to implement UDMF support in one of my projects, and I have some questions about the base UDMF specifications:

1. Is 1.1 the latest specs version for vanilla/Boom? http://www.doomworld.com/eternity/engine/stuff/udmf11.txt


Yes, it is.

2. Can strings include newline characters (carriage return and/or linefeed)?


That's not part of the spec. ZDoom supports anything it supports in HUD messages because it uses the same function to convert escape sequences.

3. Are backslash escape sequences specified? I'm asking because I see that the sequence \" is already supported in the quoted_string regex "([^"\\]*(\\.[^"\\]*)*)"
When reading a string containing \", should that sequence be decoded into " ?
What other escape sequences are expected?
How should lone or invalid backslashes be handled?


Again not part of the spec. The only requirement is that the string must be written back exactly the same as it was read when developing a tool.

4. Can I use literal UTF-8 characters in strings, or do I have to encode them into escape sequences?


Neither. ZDoom as the only port currently supporting UDMF does not have any unicode support so if you output UTF-8 or any other form of unicode it'll result in garbage.

5. Can you recommend some reusable code or library that parses a TEXTMAP into a data structure? Without validating the semantics, just the syntax.


No library but if you want something that just reads and writes back, ZDBSP may be your best option to start.

Share this post


Link to post

Cool, thanks, good to know I have free reign on some of these situations. Since I'm making a textmap reader, not a writer, I can support reading UTF-8 and other encodings that don't violate the regex.

Share this post


Link to post

I've noticed an oddity in the integer regular expression:

integer := [+-]?[1-9]+[0-9]* | 0[0-9]+ | 0x[0-9A-Fa-f]+
Value 0 is not matched.

Share this post


Link to post
printz said:

4. Can I use literal UTF-8 characters in strings, or do I have to encode them into escape sequences?


In ReMooD I plan for UDMF to read as UTF-8 since there is no defined character format for UDMF.

printz said:

I've noticed an oddity in the integer regular expression:

integer := [+-]?[1-9]+[0-9]* | 0[0-9]+ | 0x[0-9A-Fa-f]+
Value 0 is not matched.


Well, most values default to zero so why would you need to put it explicitely?

Share this post


Link to post
printz said:

I've noticed an oddity in the integer regular expression:

integer := [+-]?[1-9]+[0-9]* | 0[0-9]+ | 0x[0-9A-Fa-f]+
Value 0 is not matched.


You can have 00 or 0x0, though.

I think the first should be:
[+-]?[1-9]*[0-9]+

Share this post


Link to post
GhostlyDeath said:

In ReMooD I plan for UDMF to read as UTF-8 since there is no defined character format for UDMF.



Then you need your own namespace. It is true that the base spec does not define this but you cannot just impose UTF-8 on it without consensus.

Share this post


Link to post
Graf Zahl said:

Then you need your own namespace. It is true that the base spec does not define this but you cannot just impose UTF-8 on it without consensus.


Sadly, ZDoom is also imposing ASCII on the specification without consensus.

The only alternative would be to force ASCII, except in these cases:

  • There is a UTF-8 BOM.
  • The file is UTF-16.
  • The file is UTF-32.
I would also highly be uncomfortable relying on the system's character encoding since what happens if the user uses EBCDIC for some reason?

ReMooD has been unicode capable for years. It was for awhile backed in UTF-8 and UTF-16, then transitioned to UTF-8, now it is back to UTF-8 and UTF-16 (due to Java Strings).

Share this post


Link to post
GhostlyDeath said:

Sadly, ZDoom is also imposing ASCII on the specification without consensus.

The only alternative would be to force ASCII, except in these cases:

  • There is a UTF-8 BOM.
  • The file is UTF-16.
  • The file is UTF-32.
I would also highly be uncomfortable relying on the system's character encoding since what happens if the user uses EBCDIC for some reason?

ReMooD has been unicode capable for years. It was for awhile backed in UTF-8 and UTF-16, then transitioned to UTF-8, now it is back to UTF-8 and UTF-16 (due to Java Strings).


No, that won't work. External tools will choke on these. The only solution I see is that unless some spec explicitly defines a encoding, one must be specified.

For the base spec the only string fields are user variables anyway, so right now it's not an issue yet, and besides - no editing support for these exists if I'm not mistaken (all UDMF-capable editors use Hexen-format specials exclusively at the moment)

But it needs to be defined EXPLICITLY, you cannot impose your own agenda here just because it suits you.
For the ZDoom namespaces I just made a change to explicitly specify ISO 8859-1 as the required encoding. IF ZDoom ever gets Unicode support it will have to specify new namespaces for this in order to prevent ambiguities.

This definitely needs some input from other source port maintainers who are interested in UDMF.

But here's the limitations that need to be obeyed so that all existing tools don't start to choke:

1. No BOM allowed
2. No 16 or 32 bit encoding allowed

Share this post


Link to post
Graf Zahl said:

No, that won't work. External tools will choke on these. The only solution I see is that unless some spec explicitly defines a encoding, one must be specified.

For the base spec the only string fields are user variables anyway, so right now it's not an issue yet, and besides - no editing support for these exists if I'm not mistaken (all UDMF-capable editors use Hexen-format specials exclusively at the moment)

But it needs to be defined EXPLICITLY, you cannot impose your own agenda here just because it suits you.
For the ZDoom namespaces I just made a change to explicitly specify ISO 8859-1 as the required encoding. IF ZDoom ever gets Unicode support it will have to specify new namespaces for this in order to prevent ambiguities.

This definitely needs some input from other source port maintainers who are interested in UDMF.

But here's the limitations that need to be obeyed so that all existing tools don't start to choke:

1. No BOM allowed
2. No 16 or 32 bit encoding allowed


Fair enough. For the ReMooD namespaces (one for each of Doom, Heretic, and Hexen) I shall just force UTF-8. This will complicate decoding for example as the namespace would have to be read for a switch to occur. However, it is not that much a problem to do such a thing in Java.

But yes, a UDMF 1.2 would have to specify these things. The 1.2 spec should also permit a "0" value legally based on the regex.

Another alternative is to have a global key such as:

character_encoding = "utf-8";
Seeing that UDMF does support user supplied fields, we could actually come to a non-standard addition agreement.

ReMooD itself for ReMooD specific fields will have prefixed items "remood_", but they would only be activated if under the ReMooD namespace.

Share this post


Link to post
Graf Zahl said:

For the base spec the only string fields are user variables anyway, so right now it's not an issue yet, and besides - no editing support for these exists if I'm not mistaken (all UDMF-capable editors use Hexen-format specials exclusively at the moment)

I'm not sure what you mean by that, but GZDB supports named scripts for the various ACS_Execute specials.

Share this post


Link to post

The regexp errors in the spec were noted years ago and I stated that they should have been taken to represent the inputs that are accepted by the C library strtol/itoa/strtod functions where they were in conflict. Nobody could agree to the publishing of a new spec at the time so no formal correction was ever posted.

Eternity will be adding UDMF support soon, it will be strictly ASCII. Character values of >= 127 have internally defined purposes in EE, we will not be changing that or making exceptions to it for UDMF. UTF-8 in strings outside port specific namespaces should be taken as having undefined behavior.

Share this post


Link to post
Quasar said:

Eternity will be adding UDMF support soon, it will be strictly ASCII. Character values of >= 127 have internally defined purposes in EE, we will not be changing that or making exceptions to it for UDMF. UTF-8 in strings outside port specific namespaces should be taken as having undefined behavior.


Are those character values that are 127 and up only effective in the EE namespace?

Share this post


Link to post
Quasar said:

Eternity will be adding UDMF support soon, it will be strictly ASCII.




Out of curiosity, what character codes do have special meaning and what are they for?
To be honest, I do not understand why this is so important to keep that it blocks international language support, but well, it's your decision.


So, in any way - this definitely means, we need to strictly specify for every namespace, which encoding it needs to use so that parsers can act accordingly.

For the ZDoom namespaces I already took preemptive action yesterday and made ISO 8859-1 the mandatory encoding. Of course, Unicode support will eventually be added to ZDoom, but if that happens we'll simply define a new ZDoomUTF8 namespace, and the problem is solved (outside of editors, of course.)

But if ASCII is specified for a specific namespace, it should be strictly enforced, either by stripping out everything that does not belong or by entirely rejecting bogus strings. I do not like 'undefined behavior'. That's an open invitation to modders to do it wrong.

Concerning the Regexp, feel free to change it as you deem proper. The spec should be corrected if it's broken now.

Share this post


Link to post
Graf Zahl said:

Out of curiosity, what character codes do have special meaning and what are they for?

I think EE's text printing routines interpret a byte 0x80 + CR_* as a formatting code. Text colour changes etc.

Share this post


Link to post
RjY said:

I think EE's text printing routines interpret a byte 0x80 + CR_* as a formatting code. Text colour changes etc.



Thanks. Yes, that's it.

But still, considering the low number of EE projects out there, is there really any reason to set this in stone, instead of using just one of these values as an escape character and then encoding the actual color as a second one in the string? It not only would free up the entire >= 160 range of printable international characters, but would also allow far more flexibility with expanding the color feature later.

If we ever decide to define a common Eternity/ZDoom namespace with shared advanced features this will need to get resolved anyway so that both ports can read the texts and convert the control character sequence. Why not do it properly right away?

Believe it or not, there are people out there who create Doom mods in languages other than English. In ZDoom all they need to do is to provide a font with the required characters - here they are stuck.

(Oh, and Unicode is definitely on the table for ZDoom - sooner or later - maybe if Randy becomes active again...)

Share this post


Link to post
Graf Zahl said:

But still, considering the low number of EE projects out there, is there really any reason to set this in stone, instead of using just one of these values as an escape character and then encoding the actual color as a second one in the string? It not only would free up the entire >= 160 range of printable international characters, but would also allow far more flexibility with expanding the color feature later.

If we ever decide to define a common Eternity/ZDoom namespace with shared advanced features this will need to get resolved anyway so that both ports can read the texts and convert the control character sequence. Why not do it properly right away?

RjY said:

I think EE's text printing routines interpret a byte 0x80 + CR_* as a formatting code. Text colour changes etc.


ReMooD supports text colors for example. Instead of using ASCII characters I instead use Unicode characters. Why Unicode? Because it has private ranges where I can use the stuff however I want to. ReMooD itself uses the 0xF100-0xF1FF range for its special formatting and colors. If you wanted to type the stuff in game, ReMooD translates '{?' into these special sequences. Basically if the character is not 'x' it is treated as a single character which acts as a single base36 character addition to 0xF100. Otherwise if the character is 'x', two more hex digits are read which allow one to use the entire sequence. Currently most of the table is not allocated but I would not be limiting it to just color choices, now with Java I might be able to get actual font rendering if a bitmap character does not exist and instead fallback to a font installed on the system. This of course would be able to be used with bold, italics, and underline and such.

The current color table is:http://remood.org/images/remood_20120620_textcolors.png.

Rather simplistic choices but ReMooD only has 16 player colors to choose from and these escape sequences permit me to colorize players by their skin color in messages (such as when they die).

Share this post


Link to post

That'd be a good option if you have unicode (which neither ZDoom nor Eternity use at the moment.

But it shows even more need to properly define how to use colors in UDMF strings instead of letting each port cooking up its own incompatible method.

Sadly this is an issue that was grossly overlooked when UDMF was defined.

Share this post


Link to post

Other grammar regular expression errors and oversights noticed during a discussion on IRC:

integer := [+-]?[1-9]+[0-9]* | 0[0-9]+ | 0x[0-9A-Fa-f]+
Besides my mildly pedantic notice above about 0 not being matched, there's also the problem that assumed octal values (as from the second variant) will accept digits 8 and 9.
keyword := [^{}();"'\n\t ]+
This will allow keywords with characters that appear as tokens elsewhere in the syntax, such as equal (=). This makes tokenization difficult because expressions like a=b=c; will become valid UDMF, where the correct components would be a, =, b=c, making the parsing context sensitive.

Share this post


Link to post

I intend to offer a 1.2 draft specification soon which will fix all this stuff, and hopefully doesn't exact actual requirements on any existing implementors...

Share this post


Link to post
fraggle said:


Well actually changing everything to wchar_t was a huge pain and I eventually just got tired of doing it so I had this hybrid UTF-16/32 and UTF-8 thing going on until I removed all of the UTF-16/32 wchar_t code and had a straight UTF-8 codebase. So yes, you were correct. When I decided to just use UTF-8 I did remember this conversation. It was faster to go to UTF-8 (removal of L and change some types) than it was to add L and types.

However, today I know better and am a far superior programmer than I was 6 years ago.

Reading that and correlating with recent events, that log makes me feel like scifista.

Ironically, my decisive incorrect decision then for UTF-16 is correct now for Java (since char is 16-bits unsigned, and Strings are made of chars; adding a "byte"-string would break everything and make everything slow because it would have to be converted to String before it is used with anything requiring String).

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  
×