Jump to content
Search In
  • More options...
Find results that contain...
Find results in...

AI research and Doom multiplayer

Recommended Posts

27 minutes ago, sun_stealer said:

@kb1 I apologize for not being active here, I was quite busy with work and just life in general) I will provide my comments to what you said once I clear things out.


As a project update: I hit a major roadblock being unable to train any good policies with memory (LSTMs) and I suspect bugs in the research framework I was using. Ended up rewriting the algorithm in PyTorch from scratch and now it looks like it is working better. But it took a lot of effort.

No apologies necessary. Please, take your time. For a few posts there, we were going back and forth frequently...so I was a bit thrown when you didn't reply - my bad.


I don't envy you, chasing bugs. It's the type of system where it's tricky to even know that you have bugs, not to mention trying to track them down :)

Share this post

Link to post
Posted (edited)


Please find my comments below:


> In a reward system, the worst possible action can be bounded at 0, yet the best possible action cannot be bounded (because it's infinite).


In reinforcement learning the standard is to use what is called "discount factor". When we evaluate whether the action in timestep t was good or bad, we multiply future reward/penalty received at timestep t+i by a coefficient that is close to 1 (e.g. 0.999) raised to the power i, so 0.999^i

The more we look into the future the less we pay attention to future events. This creates nice numerical guarantees: a sum of future rewards is always bounded.

This is also a parameter of the algorithm that allows us to interpolate between a shortsighted agent (e.g. 0.9) and an agent that cares a lot about far future (e.g. 0.99999). Although using high value for discount factor seems attractive, the relatively shortsighted agents are much easier to train because variance of returns is much lower.

Bottom line: we try to avoid infinities, whether rewards are positive or negative, the agent's objective is always bounded even in infinite horizon case.


> If your reward system is mainly only using number of frags, here's the problem: Shots that kill are rewarded, but shots that do damage, but don't kill are not rewarded at all. This makes it very difficult to learn how to use the lesser weapons like fist, pistol, and chainsaw: If a single use doesn't cause a frag, there's no reward for using them, thus no knowledge to learn. Sure, eventually a single use will yield a kill, but this will appear as noise, along with the rare death caused by stepping into a nuke pool.


Currently, in fact, I am using a small reward for dealing damage for exactly the reason you described. This is called "credit assignment problem": it is hard for the agent to learn behavior (deal damage without killing) that will be beneficial only, let's say, 300 timesteps into the future.

Still this is not impossible. There's always a balance between making the task easier by adding more reward shaping versus making the most unbiased agent that won't care about anything except the final objective.

One interesting approach to this is meta-optimization. You can have an inner learning loop that maximizes shaped reward and an outer loop that optimizes coefficients for rewarding certain events to maximize only one final goal: number of frags. One such algorithm is called Population-based Training, but there are many.


> You can actually combine both systems (add up and scale penalties, then scale and subtract rewards...or vice-versa).


Yeah, this is what we're currently doing. There is also a neural network called Critic that predicts the future rewards in any given situation. This prediction is then subtracted from the actual return to generate the quantity called Advantage. If advantage is positive we did better than we expected, if it's negative we did worse.

Advantages are normalized (subtracted mean and divided by standard deviation) so sign and magnitude of rewards does not actually matter that much. Only relative quantities matter.

> Gathering data can be challenging with a screen-pixels-only approach. In the case of causing damage, I suppose you could detect blood sprays. You could also read weapon counters, health, armor, and keys by "seeing" the on-screen numbers. Personally, I would dig these stats from within game memory. The stats are by no means hidden from a human...therefore not cheating.


We are feeding some info into the neural network directly: ammo, health, weapon selected, etc. But only the information that's available to the player through GUI.


> The scoring can also be a combination of accumulative stats, as well as stats altered during this tic.

A very truncated example score system:


Your code actually looks rather similar to mine :)


> In closing, the concept I'm trying to advise on is that the scoring function is how you instill desire onto the AI. The more stats you can feed it, the better. Each of these stats should have weights assigned to them, that let the AI intelligently choose the best action, within a list of bad choices. The magnitude of such weights is not that important - what is important is the relative magnitude of one weight vs. the other. Some empirical testing will help you adjust those weights - it's not very difficult once you get into it.


Yes, this is not far from my approach.

Although my philosophy is that by shaping reward too finely we can instill undesirable human bias. E.g. we can discourage some behaviors that would otherwise help to maximize the number of frags. What if shooting the wall in the particular situation scares away the opponent and allows us to get to health? Just an example.


There's an interesting take on this in recently released DeepMind podcast where they talk about their bots for Capture the Flag game. The author of the paper says that the bots aren't necessarily "smarter" than human opponents at all times, but they are completely ruthless and relentless. E.g. if the bot is holding a flag and running past an opponent, it won't even turn to look if it knows that it can score. It does not care about absolutely anything else, except the objective. And this allows the bots to be highly efficient and actually extremely hard to beat. 


Share this post

Link to post

Sadly, there's no project update at this time. We decided to focus on some theoretical things that we found while working on the project, namely how agents learn to use memory with recurrent neural networks etc.


I am planning to go back to this project later)

Share this post

Link to post

I was thinking again about porting VizDoom functionality into a client-server version of the game. Having an ability to actually run a 24/7 server with AI bots just makes the project so much cooler and gives me motivation to work on it.


I looked at VizDoom codebase and actually the difference between ZDoom and VizDoom is not that scary, quite manageable.

Also, I compared Zandronum and Odamex. Zandronum is based on a much newer version of ZDoom, around 2.8.1. Just like VizDoom. Therefore certain files just have 1to1 correspondence, which makes the porting so much easier.

On the other hand, looks like Odamex is based on the older version and the differences are larger.


Do I actually gain anything by using Odamex over Zandronum? I can emulate "classic" Doom gameplay in Zandronum by configuring my server to restrict vertical mouse movement, jumps, etc, right?

What would you guys recommend using? Are there any other things I didn't consider?


@Fonze @Maes @Doomkid @Decay


BTW, I trained some policies based on recurrent neural nets, and this version of the bot is pretty insane :https://www.youtube.com/watch?v=Lk8OWLVGpVM


Share this post

Link to post

There are substantial differences between Odamex and Zandronum.


Zand is far more advanced and has many new features that Odamex lacks, as Odamex was based on ZDoom 1.22 from a few centuries ago. One strong advantage Odamex has over Zandronum is that it can record demo files that are compatible with the vanilla Doom executables, Zandronum can't do that. If you want to mimic vanilla gameplay, Odamex is going to be better for that overall, but Zandronum has all sorts of crazy stuff like skins, new gamemodes, OpenGL rendering, 3D floors and polyobjects, etc etc.  Odamex does have some other random cool features that Zand lacks such as multi-wad rotations for servers and a better rcon system, but these are limited use features for most users (apparently).


Long story short, if using Zandronum is easier, do that. They're both very widely cross compatible so that shouldn't really be an issue either!

Share this post

Link to post

The obvious answer is Zandronum because its easier for you to port and the most widely played client server port. However, I personally would love to see this in Odamex because it currently lacks bot support at all and desperately needs them. 


Both :)

Share this post

Link to post
16 hours ago, Hekksy said:

The obvious answer is Zandronum because its easier for you to port and the most widely played client server port. However, I personally would love to see this in Odamex because it currently lacks bot support at all and desperately needs them. 


Both :)

Ackshually, i went back to working in bots for Odamex. They are the ZCajun bots so expect them to be garbage at first, though they really aren't that hard to improve. Just takes a little bit of patience and a couple hours.


In Odamex, it would be rather hard to use the bots since you would have to open multiple instances, i imagine having them all visible on-screen at the same time. Which is obviously very impractical and not really comfortable at all.

Share this post

Link to post
On 9/10/2019 at 9:45 AM, Avoozl said:

I am surprised to see that this thread isn't by GoatLord.


It doesn't involve rectally insertable computers (AFAIK), so why would it be?

Share this post

Link to post

Bare with me when I bring this up..

But the oldie Doom Legacy version with "ACBOT" was actually a pretty good bot for its time "for Deathmatch" anyway. ~ Co-op, not so good.

One of them were supposedly based on my movements, node movement somehow, I'm not sure how TonyD did that.. nvl, the bots name was "YOMOMMA". ~ as he asked what I wanted to call the bot.



Share this post

Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now