UDMF 1.2 RFC: Heredoc / Text Embedding

Quasar · June 3, 2009

Some of us would like to see a way to extend UDMF with the ability to provide information that does not perfectly adhere to its strict grammar, such as scripting.

We've had some discussion on IRC and on the ZDoom forums in the last couple of days about what the best solution would be. I personally favor a heredoc implementation, like we have in EE for EDF, because it allows the text to be inserted without any translation performed on it, and without the tools involved having to be aware that this translation needs to take place.

CodeImp prefers simply using the string literal mechanism that is in place, and requiring the input to be escaped. Since UDMF is not, after all, primarily designed to be readable, this is acceptable as far as that goes, but does have the problem of requiring the transformation step to be performed while writing and reading the string value. This solution would involve the addition of more character escape values to the standard (ex: \n and \t).

We want to keep this process as democratic as possible, so we are soliciting further opinions. As far as heredocs go, here is a good article on what they are and what they can be used for:
http://en.wikipedia.org/wiki/Heredoc

Let me stress that although I have a personal preference, it isn't going to get in the way of this process, and I'll be happy as long as we come up with a working solution that it technologically clean.

Gez · June 3, 2009

Could the problem of accidentally including the end delimiter within the heredoc be alleviated by making a heredoc part of a sub-block, using "heredoc" as a keyword to identify it. I.e., something like this:

someblock
{
   fieldname = heredoc
{@"
       lorem{ipsum}blah blah blah;
"@}
}

If the heredoc opening and closing tokens aren't immediately next to the block opening and closing tokens, then there's a problem.

someblock
{
   fieldname = heredoc
{@"
       lorem{ipsum}blah blah blah;
       for some reason there's "@ here,
       and then we have some random },
       but hopefully the parser will ignore them
       'cuz they're not right next to each other.
"@}
}

Or would it be too complicated for the syntax?

Quasar · June 3, 2009

UDMF grammar doesn't currently include any idea of a subblock, so that would actually be adding an entirely new concept to the language.

In addition, it doesn't really matter how many characters you add to the heredoc delimiters -- unless they have the property of the Linux-style heredocs shown in the wiki article, where you specify your own closing delimiter, then there is always at least one character sequence which you cannot include inside the heredoc itself.

In your case, you've just changed that sequence from "@ to "@}. You can see how adding characters further won't solve anything. Now you CAN settle on use of a character sequence that is highly unlikely to ever be used for anything. But CodeImp didn't seem impressed with that idea on IRC. :)

One of my suggestions was a hybrid idea where you would have @[char] [char]@, where char is selected to not conflict with any given two-character sequence in the contained string. So for example you could have this:

foobar
{
   somefield = @H "@"@"@"@ H@;
}

This one uses H so that it can contain "@ sequences. The objection offered to this was that it bars use of a string containing @ proceeded by one each of every one of the available ASCII characters. Is that sacrifice really too big, is the question? Or is it a sacrifice which should be made because the utility vastly outweighs the shortcoming?

Parsing this hybrid idea still only requires one character of look-ahead, making it much easier to deal with (IMO) than the Linux-style ones where you have an arbitrary-length identifier used as the closing delimiter.

Graf Zahl · June 3, 2009

Quasar said:
CodeImp prefers simply using the string literal mechanism that is in place, and requiring the input to be escaped. Since UDMF is not, after all, primarily designed to be readable, this is acceptable as far as that goes, but does have the problem of requiring the transformation step to be performed while writing and reading the string value. This solution would involve the addition of more character escape values to the standard (ex: \n and \t).

I agree with CodeImp. This method has the significant advantage that the low level code of the parser does not need to be changed. I'd consider such a requirement a major problem, depending on what code is being used. The current grammar uses only a very small amount of tokens so most text parsers are well equipped to handle it. As soon as any kind of complexity is added here some parsers may have to bail out because they can't handle the constructs (e.g string literals spanning multiple lines of text.)

In comparison, writing an escape character translator is a trivial matter and can be done in place with the parsed string - and it could be implemented with any parser without having to go to the low level.

So any implementation that merely expands the syntax without having to alter the basic grammar is definitely appreciated.

Let's not forget that some tools need to retain the data in as unmodified a form as possible (e.g. node builders) because all they need to do with it is to write it back out.

For example, the UDMF parser in ZDBSP only reads string literals from quotation mark to quotation mark without any kind of interpretation. It does not convert escape characters. The only special treatment it has is to skip over \" sequences in strings.

Sub-blocks are already a major obstacle here and significantly increase the complexity of required maintenance and for more involved constructs it gets even worse. So let's be careful. I don't want to see that there's no UDMF-capable tools because they have problems with data maintenance. The less structure the better.

Quasar · June 3, 2009

Graf Zahl said:
I agree with CodeImp. This method has the significant advantage that the low level code of the parser does not need to be changed. I'd consider such a requirement a major problem, depending on what code is being used. The current grammar uses only a very small amount of tokens so most text parsers are well equipped to handle it. As soon as any kind of complexity is added here some parsers may have to bail out because they can't handle the constructs (e.g string literals spanning multiple lines of text.)

An argument from complexity fails in my opinion, because the option for heredocs that I suggested originally is simpler to parse from a file than a string literal is. You yourself mentioned the need for ZDBSP to skip \" sequences when reading past strings. That style of heredocs don't even require that. They only require scanning for the end token by using one character of look-ahead. Everything inside is purely literal; no translation required.

Graf Zahl said:
In comparison, writing an escape character translator is a trivial matter and can be done in place with the parsed string - and it could be implemented with any parser without having to go to the low level.

But it still requires the translation to be implemented into everything intent on reading the data, and it still renders the data in cases exceptionally ugly. Let's take a look at what a reasonable example might appear like:

thing
{
   user_scriptfield = "var i;\nfor(i = 0; i < 10; i++)\n{\n   ACS.PrintBold("I just wanted to say \\"Hello World!\\"");\n}";
}

It will work from a technical point of view, sure.

Graf Zahl said:
Sub-blocks are already a major obstacle here and significantly increase the complexity of required maintenance and for more involved constructs it gets even worse. So let's be careful. I don't want to see that there's no UDMF-capable tools because they have problems with data maintenance. The less structure the better.

I wouldn't worry about those. I'm really not prepared to entertain suggestions that involve subblocks. They're not needed for this extension no matter how we implement it.

SoM · June 3, 2009

I really don't see why this is such a big issue. Any simple state-based text parser can handle this. It only requires one character of read-ahead which is entirely reasonable. The UDMF parser doesn't have to worry about what's between the @" and "@ at all, it can just read it as a block of text.

...so the problem is...?

Quasar · June 3, 2009

BTW, there has never been anything stopping anybody from interpreting \n and \t in strings as linebreaks or tabs already.

The UDMF grammar uses a regular expression for strings which specifies, in effect, the following:

everything between a " and the next " unless that quote is proceeded by a \, in which case it is an escaped quote mark, and unless that slash is proceeded itself by a \, in which case it is a literal \.

So this allows \\ to mean \, and \" to mean ". But since anything else is allowed, and UDMF enforces no particular interpretation upon \n and \t, those are technically already allowed by the spec.

That basically means that if heredocs are rejected, updating the spec is not technically necessary. We may still wish to do it, though, to *enforce* interpretation of those sequences.

Graf Zahl · June 3, 2009

Quasar said:
An argument from complexity fails in my opinion, because the option for heredocs that I suggested originally is simpler to parse from a file than a string literal is.

No, it is not. This is only true if you work with a dedicated parser or a library that happens to be capable of it. Many parsing tools are not. As long as it is guaranteed that string literals can't span more than one line of text you could even use fgets to parse UDMF.

But it still requires the translation to be implemented into everything intent on reading the data, and it still renders the data in cases exceptionally ugly.

Sorry, I disagree here. 90% of all tools only will read the data to eventually write it back unchanged. And I consider making this an easy task far more important than some minor perceived convenience on your part. Example: Why should something like ZDBSP ever have to bother with some involved syntax? It doesn't do anything with string literals. It reads them literally, including opening and closing quotation mark and doesn't bother with anything further. Its parser is not geared towards reading unquoted whitespace as valid data (in fact it just skips over it.

Let's take a look at what a reasonable example might appear like:
thing
{
   user_scriptfield = "var i;\nfor(i = 0; i < 10; i++)\n{\n ...
}

So where's the problem? Yes, sure, inside the UDMF text lump it looks 'exceptionally ugly' but when escape characters are replaced during parsing you get exactly what you want.
If we start designing the grammar so that the machine readable data looks nice we are doomed. I don't care about how this looks in the map lump. Under normal circumstances that shouldn't be visible to the user anyway.

And your argument against escape characters doesn't really hold. At least '\n' should be recognized no matter what so that strings with included newlines can be specified in UDMF (e.g. for printing multiline texts.)

So to sum it up:

Your heredocs suggestion: Requires adjustment of the actual parser and may limit the usability of certain libraries. Any tool that wants to read UDMF needs to handle an additional syntax construct.

Escape character support: Requires a postprocessing step of string literals but it doesn't affect the actual parser. If the parser already handles escape sequence it's great but if not it can be done with minimal overhead. Tools that don't need this data won't have anything additional to do. They just read the raw string data and keep it that way for writing back later.

Sorry, but I can't see any advantage in your format at all. It only complicates matters for absolutely no gain beyond some questionable 'beautification' of the code.

Quasar said:
BTW, there has never been anything stopping anybody from interpreting \n and \t in strings as linebreaks or tabs already.

Of course not. But it could be a source of confusion and inconsistency if it is interpreted on a case by case basis.

Quasar · June 3, 2009

So regardless of the rest of it, we should go ahead and add the escape characters. They're useful in any event.

\n and \t are obvious. What else should be supported? EE uses \a to signal a noise in the console, but support for that through UDMF isn't critical obviously.

Graf Zahl · June 3, 2009

Quasar said:
EE uses \a to signal a noise in the console,

You still use that...?

Regardless, the ones that should be supported are \r, \n, \t and \xhh. I don't think the rest is needed.

andrewj · June 4, 2009

Quasar said:
Since UDMF is not, after all, primarily designed to be readable, this is acceptable as far as that goes, but does have the problem of requiring the transformation step to be performed while writing and reading the string value.

Isn't there already special handling for \" ?

The real question seems to be: just how editable the UDMF data should be.

I don't think users are ever expected to edit the UDMF directly, the primary value of the text format is for programmers to debug it, hence it is OK to have lines which contain a 50000 character string (for example).

So I agree with CodeImp.

Graf Zahl · June 4, 2009

andrewj said:
Isn't there already special handling for \" ?

Yes, there is - and this really is the only special treatment a parser needs if the program is not interested in the string's contents but only in getting them parsed.

The real question seems to be: just how editable the UDMF data should be.

Seriously, not at all. It's just a means to store map data in a form that's open to extensions.

I don't think users are ever expected to edit the UDMF directly, the primary value of the text format is for programmers to debug it, hence it is OK to have lines which contain a 50000 character string (for example).

So I agree with CodeImp.

Of course, with a 50000 character string one would risk buffer overflows in a parser that uses a fixed size array to parse the string into.

ZDoom doesn't do that but I may have to fix this in ZDBSP eventually. Currently the string buffer there is only 4096 bytes. Of course, parsing a 50000 character heredoc field would run into the same problems so it's a non-issue because it needs to be addressed no matter what.

CodeImp · June 4, 2009

I've said it before, but since I asked for a more open discussion I'll have to say it again now that it's here;

I think this is a moot point. You do not write UDMF in a plain text editor so there is absolutely no need to write strings without escaping certain characters.

A second type of string notation in the format is just going to make a parser needlessly more complex. We already have a string notation that works well, can't be broken and the characters " and \ are already escaped, all it needs is just the addition of \n and \t to make it suitable for scripts inside UDMF. We could optionally also add \xhh but that is not even needed.

Quasar · June 4, 2009

While readability was never a primary goal in UDMF, to say that we intended it to be completely unreadable is a false assertion.

If that had been true from the get-go, we could have specified a format where everything is stored without any kind of whitespace allowed, we certainly didn't need to bother adding comments, values for many things could have been reduced to single letters (T or F for true or false, for example), braces and equals signs were completely unnecessary, etc.

I designed the basic language that became UDMF with the idea that it should be human-readable. It is based on the same syntax supported by libConfuse, which is meant for human-editable configuration files (like EDF). Not that readability should be a primary concern, but that it would at least be on the scales somewhere. So that's just my personal stance on the matter.

The suggestion of storing scripts in string literals IS the easiest thing to do, I cannot disagree with that; it is self-evident. To me it still doesn't seem like the RIGHT thing to do, but again, just my opinion.

Graf Zahl · June 4, 2009

Quasar said:
The suggestion of storing scripts in string literals IS the easiest thing to do, I cannot disagree with that; it is self-evident. To me it still doesn't seem like the RIGHT thing to do, but again, just my opinion.

Well, to CodeImp and me it doesn't sound right to complicate the parser for no real benefit. Let's face it: Your problem is solvable without redoing the basic syntax so that's what you should do.

If we start tinkering with UDMF each time something is not 'perfect' we will end up with something unusable in a short amount of time. And don't forget that you are forcing anyone working with UDMF to handle your addition - even if they have no use for it.

The alternative is tempting: Refuse loading maps with a namespace that allows constructs that were retroactively added to the grammar and have no use for the engine/tool in question.

SoM · June 4, 2009

Graf Zahl said:
Well, to CodeImp and me it doesn't sound right to complicate the parser for no real benefit.

so one additional state and one byte of read-ahead is complicated? .... really? Like... seriously?

Quasar · June 4, 2009

The alternative is non-compliance. Non-compliance is death of the standard. It should not even be discussed.

Just because you don't like my idea and I don't like yours doesn't mean there's no agreement. I've already conceded that the escape characters will be added.

Since nobody else even gives a shit enough to weigh in here, we'll just consider the discussion on heredocs to be closed. Expect 1.2 spec whenever I feel like getting it written up.

I don't suspect that EE will allow arbitrary scripts in UDMF under this system, because due to the nature of the Aeon API, scripts for EE will be covered under the GPL license, and GPL requires that source code be available in a human-readable format.

Probably what we'll do is encourage only single-line statements that can be sent straight to an eval call. That should keep everyone happy in the end.

Does that sound good enough, or is there still some reason we need to get up in arms over things that are only ideas and suggestions?

CodeImp · June 4, 2009

SoM said:
so one additional state and one byte of read-ahead is complicated? .... really? Like... seriously?

There is more to it than just that, if you want to do it right. But it is not a matter of work, it is a matter of mess.

Quasar said:
GPL requires that source code be available in a human-readable format.

I don't think anyone would blame you if escape characters are used for some of the whitespace. It is certainly not unreadable.

Graf Zahl · June 4, 2009

CodeImp said:
There is more to it than just that, if you want to do it right. But it is not a matter of work, it is a matter of mess.

I think that statement sums it up perfectly.

Plus, I repeat myself: Not all parser code can be changed easily to add support for the new syntax.

andrewj · June 5, 2009

Quasar said:
I don't suspect that EE will allow arbitrary scripts in UDMF under this system, because due to the nature of the Aeon API, scripts for EE will be covered under the GPL license, and GPL requires that source code be available in a human-readable format.

O_o

I don't see the logic there. The script _is_ human readable, but has merely been encoded as a long string and poked into a WAD file.

Quasar · June 5, 2009

Well I should have probably said we will not encourage it, rather than not allow it. It should be able to slide in, since there is an editor that will support the editing. The GPL says that "source code" is defined as the "preferred form" of a work for making modifications to it. That's a bit vague as far as legal wording goes if you ask me. What one might prefer is obviously not what everyone prefers :)

Graf Zahl · June 5, 2009

Quasar said:
The GPL says that "source code" is defined as the "preferred form" of a work for making modifications to it. That's a bit vague as far as legal wording goes if you ask me. What one might prefer is obviously not what everyone prefers :)

I hate legalese.

If that is taken literally it should even be a problem to embed script code into a format that's mainly supposed to be machine readable to begin with. If you wanted to follow this to the letter the script should be a separate lump.

One thing though: Could you please explain the licensing conditions that apply here? I don't understand how any scripting tool can impose a license on the scripts being written - and frankly if someone did I'd steer clear of such tools because it's an attitude I don't like at all.

sirjuddington · June 5, 2009

I'd have to agree with CodeImp etc on this one too. I don't see a good reason to add extra complexity to UDMF parsers for something that's already simple enough to handle with how things currently work.

Quasar · June 5, 2009

Graf Zahl said:
One thing though: Could you please explain the licensing conditions that apply here? I don't understand how any scripting tool can impose a license on the scripts being written - and frankly if someone did I'd steer clear of such tools because it's an attitude I don't like at all.

It's not the license of the scripting engines etc. that will cause this; it's that the API we intend to expose, which we have codenamed "Aeon," will very nearly be as powerful and expressive as linking native code with EE.

Since this means that making a modification using Aeon will be, in effect, making a new program out of it. With that being the case, we feel that scripts written to use it will automatically fall under the GPL. To say otherwise would effectively punch a hole in the license.

Since our goal has always been to compile scripts at runtime rather than using anything that requires bytecode, this makes very little practical difference to the end user. The scripts already have to be open-source just by the nature of how they work, so this is only really changing one thing - permission for reuse. But you already know my stance on sharing - I think it's good for the community :)

Graf Zahl · June 5, 2009

Ok, since it's all your own code you could easily clarify that portion.

It's not that encoding the script as an escaped string literal makes it any less reusable.

BTW, how far are you with that stuff? It'd be interesting to see how quickly other programmers can handle such a task, seeing that ZDoom's scripting is still going nowhere after all these years (not my fault. I could have had it running 1.5 years ago if there had been a chance of it being included...)

Quasar · June 5, 2009

It's just getting started. I've been working with the SpiderMonkey engine, and we're about to start designing the API to use through it. SpiderMonkey is the ECMAScript engine from Mozilla, and satisfies all but a couple of the criteria I laid out for a good scripting engine previously.

The ones it misses on are save/restore capabiilty for the virtual machine (what data gets saved will have to be determined by the user, though hopefully we can provide some automation for this), and no innate ability to support fixed-point (this should still be doable with methods, but we intend to play down the need for it anyway).

Our previous need to move native code out into it is no longer considered an issue, so I put aside the reasons which previously lead to ECMAScript's rejection as a candidate on those grounds. We were about to try to implement a language on our own, but as our requirements kept building, we found that what we were about to do was reinventing the wheel.

Mainly, SoM decided we needed iterators, ala foreach loops on mobj reference arrays. I decided that it would be a cold day in hell before I was ever able to code support for that in a language of my own making :)

CodeImp · June 5, 2009

Foreach loops are nice, they are a bit easier than the usual loops, which is nice for script programmers. They are also less flexible, but most of the time you make a loop, you run through the entire collection from start to end anyway. But yea, I can see it is a bunch of extra work (on top of the lot of work on your own language).

andrewj · June 6, 2009

Quasar said:
Mainly, SoM decided we needed iterators, ala foreach loops on mobj reference arrays. I decided that it would be a cold day in hell before I was ever able to code support for that in a language of my own making :)

Heh.

Quake C has a simple idiom for this. You call an engine function to do some search (e.g. all objects in a certain area), and it returns the first object, and all the rest are have been linked through a standard 'next' field.

The downside is that you can only handle one search at a time.

Sign In

UDMF 1.2 RFC: Heredoc / Text Embedding

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in