Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
WadArchive

Closing Wad Archive

Recommended Posts

16 hours ago, peido said:

Hi.

 

I have had an idea that maybe can be done in the future to improve the archive. I'm not saying it is easy nor am I saying it is possible, I don't know, this is just an idea I had right before falling asleep in bed.

The idea is to have AI to complete every wad (it doesn't have to be speedrun) and to save the demo. Then, if the archive had a demo player, people could more easily see obscure wads before downloading them.

 

Now that I have written the idea down, it really seems kind of a dumb idea, but maybe in the future with the improvements of AI.

Don't forget, the WAD Archive - er - archive is just that: It is a backup and data dump of the original wad-archive.com site on archive.org. See first post by @WadArchive for details.

 

I took it upon myself to see what I could do to partially resurrect it - details in my posts in this thread and here - but essentially, the archived data is 255 (I think) 3 to 4 GIGABYTE archives which contain many thousands of zipped and compressed WADs/PK3s and images, with the associated texts stored in the JSON MongoDB dumps.

 

In order to do what you suggest, you would need to sequentially extract the wad or PK3 files (stored as UUID-based filenames), get your AI (which one BTW?) to play/record the demo and then automatically save the demo back to the archive files with a corresponding UUID filename. Now, this would be quite difficult as the archive files are HUGE anyway, and inserting more data may well exceed the max filesize. Note however that the python library I used to access the individual files within the large archives does not need to extract the contents in order to work with the individual files. Additionally, you would need to modify the schema of the metadata (as provided in the JSON files) so you could access the demos after saving. I would strongly suggest importing all the metadata JSON files into a mongo database as that will greatly assist when playing around with the data.

 

I suspect that the original wad-archive data that is in the archives also ran in a database (possibly MongoDB as well) which would make this more straightforward. Now, I never even attempted to create a database with a terabyte of data, but that is certainly possible if you have the hardware.

 

I would suggest you start with using the AI to access a single WAD file on your filesystem, play and record a demo, and get it to save back to the filesystem. You would of course need an API key to the AI so you can access it programmatically.

 

It does sound like an interesting project, go for it!

Share this post


Link to post
1 hour ago, peido said:

I understand, the amount of work would be huge :S

Indeed it would.

 

If you are interested in doing something programmatic with this, I do suggest starting simple, and seeing if you can get your AI to respond to a simple programmatic request (not necessarily to play a WAD of course). 

 

I did find these though, and they might get you going:

 

>>> EDIT 2 <<<

VizDoom defintely looks like the solution here. It looks to use OpenAI as its LLM. There is a lot to read/learn, but I think it has potential to achieve what you want to do:

 

https://vizdoom.cs.put.edu.pl/tutorial

 

 

>>>EDIT<<<

A Python doom-playing lib:

 

https://github.com/Farama-Foundation/ViZDoom

 

This looks really cool actually.

 

 

This is for OG Doom (well, FreeDoom). It's very in-depth, Python-based:

 

May need an account for this one:

https://medium.com/@james.liangyy/playing-doom-with-deep-reinforcement-learning-e55ce84e2930

 

A summary, may have useful links too.

https://www.cmu.edu/news/stories/archives/2016/september/AI-agent-survives-doom.html

Edited by smeghammer

Share this post


Link to post

Right, I got VizDoom package installed and running on a venv.  I had a quick play and I think this package is most certainly a good place to start.

 

I did this:

The file `basic.py` starts a basic game run, without any learning built in. Then the code is described.

 

I then ran `learning_pytorch.py` - which DOES need a LLM (torch) and this is requires packages totalling about 1GB. This script ran (about 10 mins for me) calculating moves, and then replayed the games. The test WAD is just a square room with one monster, but the point is to train the AI player to kill monsters, move around, hit switches, exit the level etc. based on some input rules and iterative feedback. I do think, based on my quick tests at least, that this or one of the other scripts will be the place to start pulling out the archived WADs and playing them. 

 

HOWEVER, the point of this code is to train the LLM to actually play a map or maps. I suspect working out how to train, rather than get it to load/play a map, will be the time-consuming bit.

 

Full list and description of these scripts: https://github.com/Farama-Foundation/ViZDoom/blob/master/examples/python/README.md

 

>>>  EDIT  <<<

The place to start is probably with PyTorch is the official docs. Very involved and ML-focused. The full E2E starter tutorial uses a fashion database! (https://github.com/zalandoresearch/fashion-mnist) and ends up making a prediction.

Edited by smeghammer

Share this post


Link to post

I created a Github Pages clone of the archive last year. It pulls most of its data straight from the archive.org datadump and the rest of it is hosted by Github; there's no database and it costs me nothing to run, so it shouldn't go down.

 

It's similar to smeghammer's web app (which I wasn't aware existed at the time), but is implemented completely differently.

 

https://wadarchive-browser.github.io/

(Note: you need a relatively recent browser to access the mirror.)

 

Screenshots:

Spoiler

chrome_Q5H9qBUys5.png.448abd5a808a39fac416d5635394b70a.png

chrome_bXZyl3HidM.png.1432ae9e2e4a4f34438c645fa949c091.png

chrome_uYadWuwanO.png.458c14a55feb293302104d277d68144f.png

 

 

Share this post


Link to post

How do I actually download the torrent? The torrent on the internet archive site only produces about 500MB of data, not a single wad to be found. I have a bunch of spare HDDs out of old NAS devices and would very much like to download the entire archive (mostly because I'm a data hoarder), instead of hammering the FTP idgames mirrors (which is also pretty slow).

 

This is what is inside the wadarchive_archive.torrent file, opened with QBittorrent:

image.png.52bd72d3a28b429860eee68b5127cc33.png

 

And no, before anyone asks, the torrent linked in the comment is the identical file, with the same contents. How do I go about doing this? Problem with my torrent client perhaps? Or am I going to manually have to download every indiviual 4GB chunk or so? What's the best approach here, assuming unlimited bandwidth and storage?

Share this post


Link to post
Posted (edited)

@ObserverOfTime yes, each 4gb chunk is a compressed archive of encoded WAD data, essentially indexed by UUIDs.  I got them all and made this: 

You will need the big files locally - they are not in the repo as they are too big. The above python app will give you a front end for local archive files once you have them.

 

And I also made a Javascript based front end to get at the archive content directly from my website at:

 

https://www.smeghammer.co.uk/wad-archive/

 

Basically, if you want more background on all this, check my posts here and in the linked thread. Happy to answer any questions.

 

 

 

Share this post


Link to post
Posted (edited)

@smeghammer Thank you for a solution, but I ended up just generating a list of download URLs with a little batch script (yuck) that I can feed to my download manager. Here's the list in case anyone else is wondering, I am not sure if every link is a valid one, I will know by the end of today.

 

14 hours ago, smeghammer said:

@ObserverOfTime yes, each 4gb chunk is a compressed archive of encoded WAD data, essentially indexed by UUIDs.  I got them all and made this: 

You will need the big files locally - they are not in the repo as they are too big. The above python app will give you a front end for local archive files once you have them. 

 

Final edit: I just figured out that I need to set up the database myself. I have to admit I am not sure which file to use to populate the database with, or how to even go about it. A quick rundown on how to go about it would be appreciated (essentially, what file to import once I have the database up and running).

 

Absolute last edit: Yes, just create a new database called "wadarchive" using MongoDB Compass, then import the .json files as new collections under the same name as the filename and you're good to go. I finally figured it out! Now to finish downloading the last 1/2TB or so of data...

Edited by ObserverOfTime : formating, details, own stupidity, can't read, etc. etc.

Share this post


Link to post
1 hour ago, ObserverOfTime said:

 

Absolute last edit: Yes, just create a new database called "wadarchive" using MongoDB Compass, then import the .json files as new collections under the same name as the filename and you're good to go. I finally figured it out! Now to finish downloading the last 1/2TB or so of data...

 

That's it :-)

 

It is just shy of 4TB of archives IIRC.

 

Each JSON file in the download actually relates to a separate mongo collection - there's a load of additional metadata in there, if you feel like reverse-engineering it. I got as far as three collections I think, though my python thing just uses filenames.json. 

 

Just to confirm - you don't need to use GridFS to hold the big data file content. It was all designed - as far as I could work out - to be accessed from the filesystem and outer archive by archive file key (first two alphanumeric characters (hex to 255) for directory and a 24? character UUID as zip filename inside the top-level big archive so all the mongo collection documents are keyed on uuids. I may have made a crude entity relationship diagram - I'll post it here if I find it...

 

Have fun!

Share this post


Link to post
Posted (edited)
26 minutes ago, smeghammer said:

 

That's it :-)

 

It is just shy of 4TB of archives IIRC.

 

Each JSON file in the download actually relates to a separate mongo collection - there's a load of additional metadata in there, if you feel like reverse-engineering it. I got as far as three collections I think, though my python thing just uses filenames.json. 

 

Just to confirm - you don't need to use GridFS to hold the big data file content. It was all designed - as far as I could work out - to be accessed from the filesystem and outer archive by archive file key (first two alphanumeric characters (hex to 255) for directory and a 24? character UUID as zip filename inside the top-level big archive so all the mongo collection documents are keyed on uuids. I may have made a crude entity relationship diagram - I'll post it here if I find it...

 

Have fun!

 

4TB? Is that the unzipped size? The whole archives, zipped up, appear to be just shy of 1TB, and I have zero ambitions to unzip all of that. If the download turns out to be 4TB I need a different HDD. It works well enough (screenshots, readmes, automap pictures, etc.) the way it is for what it is, so I don't think I will do much reverse-engineering haha. The only thing I might look into "adjusting" is the file listing:

 

image.png.d450c71ab467d9ac5bd896e07a7be785.png

 

I have a lot more screen real-estate to use and would like there to be about twice as many files listed per page. Maybe I also want to be able to search by GUID in addition to file name, but it's easy enough to just open up the json and go CTRL+F for now. other than that, she's working 👍.

 

Edit: I'm a dummy, figured out the page size.

Edited by ObserverOfTime

Share this post


Link to post
Posted (edited)

Nice! Really pleased someone else has got it working!

 

Yeah I have some work to do to sort by filename rather than the UUID (it's sorting by the UUID rather than the metadata filename). There needs to be some intermediate logic to map UUID to filename in the metadata mongo collection, and then sort the front-end. RL has taken over though...

 

Have you checked out the other front end I made on my site?

 

https://www.smeghammer.co.uk/wad-archive/

 

It's all client-side (so no mapped filenames) but I think has a better UI.

 

>>> EDIT <<<

Turns out it is trivial to sort by filename. IIRC, I originally planned to sort by filename but was impatient so switched to _id so I could get it working with just afew big archives. 

 

The mongo collection field to sort by is /filenames[], so in database.py, ln 20, do this:

# 'page_data' : list(self.db['filenames'].find(filter).sort('_id',1).skip(page_size * page_num).limit(page_size))
'page_data' : list(self.db['filenames'].find(filter).sort('filenames.0',1).skip(page_size * page_num).limit(page_size)),

Note filenames[0] is undefined in about 40 cases (the filename field is a zero-length array, and there doesn't seem to be a simple lookup by UUID available), so I plan to parse the readme text if filenames[0] is undefined, because it is sort of consistent - there is usually a title entry like so:

Title : H2H Czech Series 2 (H2HCZEK2.WAD)

and a bit of regex parsing should get at the mapname.

 

It is worth doing, because the sort order is currently pulling back the entries where filenames.0 is undefined first...

 

Edited by smeghammer

Share this post


Link to post
15 hours ago, ObserverOfTime said:

4TB? Is that the unzipped size? The whole archives, zipped up, appear to be just shy of 1TB

My bad. Yes, it is about 1TB in total.

Share this post


Link to post
8 hours ago, smeghammer said:

My bad. Yes, it is about 1TB in total.

 

That's good news.

 

Since I have you here, how do I expose the web server to other computers on my network? Currently the web server is only reachable from the same machine the server runs on (no firewalls involved). I recon this is a default setting for the web server for security reasons, and I am not the least bit familiar with how to configure a flask instance. I'm looking to make the server reachable from my LAN's subnet only, if that helps at all. I figure I may as well ask since you are more familiar with the code than I am.

 

Either way, I'm happy to report that it's up and running without problems. Love the map previews, screenshots, readmes, etc. that's some pretty cool stuff and I'm amazed you took the time and effort to create something dedicated to this exact purpose. The doom community never ceases to amaze.

Share this post


Link to post

The flask app is running in dev mode ATM, though you should be able to configure it with an IP address if you need to:

https://stackoverflow.com/questions/7023052/configure-flask-dev-server-to-be-visible-across-the-network

 

I tend to have the dev application local to my dev machine, but have the Mongo database sitting on a headless sever somewhere. You will need to configure Mongo to accept remote connections of you do that though:

https://www.techrepublic.com/article/how-to-enable-mongodb-remote/

 

tldr; - amend mongod.conf to have

 

net:  port: 27017
  bindIp: 0.0.0.0

 

Share this post


Link to post

@ObserverOfTime - I updated my app to sort by filename (plus a few other QoL fixes):

 

https://github.com/smeghammer/wad-archive

 

On 3/3/2024 at 8:39 PM, ObserverOfTime said:

Love the map previews, screenshots, readmes, etc.

 

Thanks man! 

 

Though I'm piggy-backing off @WadArchive's original work of course.

All that stuff is in the database and/or filesystem of the archive dump, as well as a whole bunch of other stuff about the details of the map/wad internal structure too, which I haven't really looked into. IRL and all that...

Share this post


Link to post
7 hours ago, smeghammer said:

@ObserverOfTime - I updated my app to sort by filename (plus a few other QoL fixes):

 

Very nice, I love the little spinner thing that shows up when you search for something, it's a nice touch. I also dig the new font for the header, the old font was a tad on the large side.

 

While looking at entries which have no pictures works well as it did before, anything that has a lot of pictures (like valiant.wad, for example) just seems to vomit and endless base64 encoded string to the console and it takes ages to load the associated entries as a result. It is almost as if it prints the entire images for the entry encoded in base64 to console. Here is just a very very small sampling of such an event ("Electric Nightmare" here being the name of one of the images for the maps in valiant.wad):

 

Spoiler

{'file': 'MAP21.png', 'b64': 'iVBORw0KGgoAAAANSUhEUgAABAAAAAMACAMAAACN.......[many tens of thousands of characters omitted for brevity].......Mz8x8jepTBfwicLLnQNLG64kdP4gb8IBWafFK7FUAQjhHAAhhCoAmQWo6ROnbtQwri7govH3TZfKYvb6bcb56aVp4kGxfQClvjZtzwBV581YDQ+yoqJxao/R9LtHpApACOcACCGzyP8PbXvVgVPWTWQAAAAUdEVYdFNvZnR3YXJlAFpEb29tIDIuOC4x/7QofAAAAABJRU5ErkJggg==', 'nicename': 'Electric Nightmare'}]}}
127.0.0.1 - - [10/Mar/2024 12:09:44] "GET /app/file/details/9a3d7932dc8efa478f9a5b30fa5de83a4fecb4bf HTTP/1.1" 200 -

 

How can I decrease the verbosity of the console output? It seems to me that the speed at which the application can print to my console is now a bottleneck. Either way I'm glad and thankful for your continued development of the wadarchive. Thanks man, you rock!

Share this post


Link to post
8 minutes ago, ObserverOfTime said:

While looking at entries which have no pictures works well as it did before, anything that has a lot of pictures (like valiant.wad, for example) just seems to vomit and endless base64 encoded string to the console and it takes ages to load the associated entries as a result. It is almost as if it prints the entire images for the entry encoded in base64 to console.

 

Good call re the console output. I'll get rid of that. The recent changes shouldn't have made any changes to the image loading logic though.

 

I just tried loading the valiant page - yeah it does take a while, but I don't see any broken images. I'll keep an eye out for it though. Feel free to raise a bug on the github repo though.

 

The way I have done it is to get the binary data from the source filesystem, and use the converted B64 data as the image source:

 

<img 
     title="MAP09.png" 
     src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABAAAAAMACAIAAAA12IJaAAAABGdBTUEAALGOfPtRkwAAgA..." 
     id="currentimage_SCREENSHOTS">

And I forgot to remove the logging.  

 

See line 94 of middleware.py. I did it like this because the image itself is within the big archive file, so I could not directly generate an image HREF.

 

I'm exclusively using Chrome, if that makes a difference?

Share this post


Link to post

Oh no sorry for the miscommunication, nothing is broken per se, all of the images load eventually. It's just that clicking on an entry with many images takes ~30s to load, presumably because of the maximum console printing speed, and that clicking on a few entries with images one after the other causes unduly long load times, because everything gets printed to console.

 

If you could let me know what I need to comment out or change to prevent the base64 string for images to be printed to console I'm sure it would be as fast as it has ever been, and everything will be alright. I will take any further actual issues over to github though, it's probably a better place to discuss it. Cheers mate.

Share this post


Link to post
4 minutes ago, ObserverOfTime said:

Oh no sorry for the miscommunication, nothing is broken per se, all of the images load eventually. It's just that clicking on an entry with many images takes ~30s to load, presumably because of the maximum console printing speed, and that clicking on a few entries with images one after the other causes unduly long load times, because everything gets printed to console.

 

If you could let me know what I need to comment out or change to prevent the base64 string for images to be printed to console I'm sure it would be as fast as it has ever been, and everything will be alright. I will take any further actual issues over to github though, it's probably a better place to discuss it. Cheers mate.

 

No probs :-)

 

Look at middleware.py:details() on line 42. That prints the base 64 string to the terminal. There are other print statements too that you can get rid of. I'll update the repo later on anyway. Gotta put the kids to bed now :-)

Share this post


Link to post
1 hour ago, smeghammer said:

Look at middleware.py:details() on line 42. That prints the base 64 string to the terminal.

 

Bingo, commenting that print statement out did the trick, entries with images now load, for all intents and purposes, instantly. Thanks for working it out for me.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×