Revisiting the T1 crash issue

Everything about ToME 2.x.x. No spoilers, please

Moderator: Moderator

Message
Author
Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Revisiting the T1 crash issue

#1 Post by Lord Estraven »

Approaching this again with a bit more programming experience, I figured out the issue... get_spell_level is returning a huge number, > 10^8. get_spell_level also returns a byte... Thus, a buffer overflow, which any modern OS will observe and punish.

Now I just have to figure out how to fix this.

I can see a few problems here:

1. For some reason, spell_level[realm][spell] is always something huge. This makes little sense, as spell levels are wiped at birth. I've got the code right here:

Code: Select all

        for (i = 0; i < MAX_REALM; i++)
	{
                for (j = 0; j < 64; j++)
                {
                        spell_level[i][j] = 0;
                }
	}
}
But it's returning something humongous anyway. No idea why but that's the main issue.

2. get_spell_level returns a byte. Why? That strikes me as inviting overflow issues.

3. Now that I notice, bytes and 16-bit integers are being used all over the place. Isn't this a little bit reckless? Granted that I come from the wonderful land of Java and 32-bit integers, I don't see the point in using those.

AnonymousHero
Spiderkin
Posts: 482
Joined: Sat Mar 18, 2006 12:48 pm

Re: Revisiting the T1 crash issue

#2 Post by AnonymousHero »

Are the spell_level values always wrong immediately after starting? Or only after saving and reloading?

Do you mean "spell_level[realm][spell]" literally, and if so, do both the "realm" and "spell" variables start at 0? Is even spell_level[0][0] corrupted?

Is the spell_level array declared with the appropriate dimensions?

There are several general techniques you can try to find memory corruption problems:

1) Try compiling with "-Wall" and see where warnings pop up. They sometimes point to severe problems, especially if you're looking at .
2) Try compiling with Clang + "-Wall" and check warnings. Clang sometimes flags things GCC doesn't catch. (Plus it generally has much better error messages.) If you're using Ubuntu (e.g.) installing "Clang" should be as easy as "aptitude install clang".
3) You can try one of the graphical debuggers out there and set a breakpoint somewhere appropriate to inspect values interactively; this is sometimes very helpful for understanding what's going on.
4) Try using Valgrind. I haven't really even used it myself, so I'm not really sure how user-friendly it is; YMMV.

As to the bit about "byte/int16/etc.". These days I think you're probably right and most of that should just be plain "int" or "long"; it's hardly worth saving a byte here and there to risk introducing implicit conversion errors.

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#3 Post by Lord Estraven »

AnonymousHero wrote:Are the spell_level values always wrong immediately after starting? Or only after saving and reloading?
Immediately after starting.
Do you mean "spell_level[realm][spell]" literally, and if so, do both the "realm" and "spell" variables start at 0? Is even spell_level[0][0] corrupted?
Yes, yes, and apparently yes.
Is the spell_level array declared with the appropriate dimensions?
Hmm, looks like it is...
There are several general techniques you can try to find memory corruption problems:

1) Try compiling with "-Wall" and see where warnings pop up. They sometimes point to severe problems, especially if you're looking at .
Okay, though keep in mind T1 is already producing huge numbers of warnings.
2) Try compiling with Clang + "-Wall" and check warnings. Clang sometimes flags things GCC doesn't catch. (Plus it generally has much better error messages.) If you're using Ubuntu (e.g.) installing "Clang" should be as easy as "aptitude install clang".
No easy access to Clang unfortunately.
3) You can try one of the graphical debuggers out there and set a breakpoint somewhere appropriate to inspect values interactively; this is sometimes very helpful for understanding what's going on.
Okay...
4) Try using Valgrind. I haven't really even used it myself, so I'm not really sure how user-friendly it is; YMMV.
I'll check that out too then.
As to the bit about "byte/int16/etc.". These days I think you're probably right and most of that should just be plain "int" or "long"; it's hardly worth saving a byte here and there to risk introducing implicit conversion errors.
Like I said, bytes and 16-bit ints are used all over the place... How, if at all, do you think I should go about replacing them?

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#4 Post by Lord Estraven »

Okay Valgrind's output is most interesting... There seem to be a lot bad stuff happening, lots of "Conditional jump or move depends on uninitialized value(s)" memory errors.

The crash problem is here though. (vstrnfmt in z-form.c.)

Code: Select all

==3638== Invalid read of size 1
==3638==    at 0x41DEA6D: vfprintf (in /lib/libc-2.11.1.so)
==3638==    by 0x41FDFFB: vsprintf (in /lib/libc-2.11.1.so)
==3638==    by 0x41E5A4A: sprintf (in /lib/libc-2.11.1.so)
==3638==    by 0x814E754: vstrnfmt (z-form.c:567)
==3638==    by 0x814E8F3: strnfmt (z-form.c:695)
==3638==    by 0x80A6D41: get_spell (cmd5.c:268)
==3638==    by 0x80B84DE: do_cmd_cast (cmd5.c:8037)
==3638==    by 0x805C7FA: process_command (dungeon.c:3741)
==3638==    by 0x806200A: dungeon (dungeon.c:4503)
==3638==    by 0x8062A58: play_game (dungeon.c:5418)
==3638==    by 0x804AEC3: main (main.c:783)
==3638==  Address 0x1 is not stack'd, malloc'd or (recently) free'd
tome: software bug 99 33
And the summary:

Code: Select all

==3638== 
==3638== ERROR SUMMARY: 154 errors from 11 contexts (suppressed: 0 from 0)
==3638== malloc/free: in use at exit: 3,243,752 bytes in 5,697 blocks.
==3638== malloc/free: 7,629 allocs, 1,932 frees, 3,601,812 bytes allocated.
==3638== For counts of detected errors, rerun with: -v
==3638== Use --track-origins=yes to see where uninitialised values come from
==3638== searching for pointers to 5,697 not-freed blocks.
==3638== checked 3,582,892 bytes.
==3638== 
==3638== LEAK SUMMARY:
==3638==    definitely lost: 241 bytes in 14 blocks.
==3638==      possibly lost: 0 bytes in 0 blocks.
==3638==    still reachable: 3,243,511 bytes in 5,683 blocks.
==3638==         suppressed: 0 bytes in 0 blocks.
==3638== Rerun with --leak-check=full to see details of leaked memory.

AnonymousHero
Spiderkin
Posts: 482
Joined: Sat Mar 18, 2006 12:48 pm

Re: Revisiting the T1 crash issue

#5 Post by AnonymousHero »

Are you compiling on a 32-bit or a 64-bit machine?

EDIT: The line from cmd5.c:

Code: Select all

                        strnfmt(tmp_val, 78, "Level %d %s %s (%d mana, %d%% fail)? ",
                                prompt, get_spell_level(realm, spell), spell_names[realm][spell%64][0],
                                s_ptr->smana, spell_chance(spell,realm));
There's some very weird addressing going on with the spell_names array, so if you change its dimensions, you're probably in trouble.

Anyway, what's probably going wrong is that the format string is using "%d" to print, and get_spell_level() is returning a "byte". Try changing the above to

Code: Select all

                        strnfmt(tmp_val, 78, "Level %d %s %s (%d mana, %d%% fail)? ",
                                prompt, (int) get_spell_level(realm, spell), spell_names[realm][spell%64][0],
                                s_ptr->smana, (int) spell_chance(spell,realm));
(I've inserted two "(int)" )

My advice: Run (don't walk) away from that codebase. Very far away :).

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#6 Post by Lord Estraven »

Hmm. Maybe I will run away. :lol:

I have a sudden insane desire to rewrite the whole thing in Perl...

No, that won't work at all. Gods, what a tangled mess.

AnonymousHero
Spiderkin
Posts: 482
Joined: Sat Mar 18, 2006 12:48 pm

Re: Revisiting the T1 crash issue

#7 Post by AnonymousHero »

Just for the heck of it, I did some stats on the ToME 2 code. It turns out that the C source is "only" 298K lines... of which 60K lines are main-*.c (whitespace and comments included). There's also 9K lines of Lua. Given a typical $HLPL-to-C line count ratio of a factor of 3 to 5, I don't think a conversion to $HLPL would be entirely out of the question, especially if HLPL were something C-like where you could "paste the C code and fix" code as you went along.

(HLPL: Higher-Level Programming Language)

A lot of the annoying parts of the C code (string formatting, info file handling, etc.) could be translated reasonably easily and made much more sane, but it'd be incredibly tedious to not have a working "intermediate" product until you'd ported all of the code.

I am certainly NOT going do it... but I am (extremely slowly) writing a roguelike in Haskell (yeah!) at the moment. I'm not sure how much it'll resemble ToME, but I'm using ToME as the main inspiration.

Oh well, one can dream :).

EDIT: Oh, and there's 33K lines of generated C-Lua interfacing code. That can also be discounted. So the total "lines-to-port" would be roughly 200K lines of C.
Last edited by AnonymousHero on Thu Sep 30, 2010 6:28 pm, edited 1 time in total.

darkgod
Master of Eyal
Posts: 10750
Joined: Wed Jul 24, 2002 9:26 pm
Location: Angolwen
Contact:

Re: Revisiting the T1 crash issue

#8 Post by darkgod »

Use TE4 !
You have many things already implemented this way ;)
[tome] joylove: You can't just release an expansion like one would release a Kraken XD
--
[tome] phantomfrettchen: your ability not to tease anyone is simply stunning ;)

AnonymousHero
Spiderkin
Posts: 482
Joined: Sat Mar 18, 2006 12:48 pm

Re: Revisiting the T1 crash issue

#9 Post by AnonymousHero »

Isn't TE4 based on Lua? If so, no thanks. I prefer my errors at compile time :).

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#10 Post by Lord Estraven »

AnonymousHero wrote:Just for the heck of it, I did some stats on the ToME 2 code. It turns out that the C source is "only" 298K lines... of which 60K lines are main-*.c (whitespace and comments included). There's also 9K lines of Lua. Given a typical $HLPL-to-C line count ratio of a factor of 3 to 5, I don't think a conversion to $HLPL would be entirely out of the question, especially if HLPL were something C-like where you could "paste the C code and fix" code as you went along.
<snip>
I tried porting part of one file to Perl. Gave up pretty quickly. Tons of functions in there and no idea where they're located; grepping gets tiresome...

(But I'm not a programmer. Anyone can write code, but it takes more than that to write - or port - a large program, IMO.)

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#11 Post by Lord Estraven »

Okay I tried the (int) fix... No impact whatsoever. Exact same error in Valgrind.

I think I need to figure out where the heck the memory is getting corrupted...

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#12 Post by Lord Estraven »

Still going at it for some reason...

Tried to change sprintf to asprintf to see what would happen. What I got was tome complaining that it couldn't find a file in the libdir with weird Unicode characters in its name ("P?E ??F" was how it was displayed in the console). No idea how that happened, as far as I know asprintf should behave identically to sprintf.

(Though gut instinct tells me using asprintf here would be a bad idea anyway. No real idea why.)

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#13 Post by Lord Estraven »

Ha, got it! It turns out the problem wasn't strnfmt but in the way it was invoked.

Unfortunately I couldn't figure out exactly what was being done wrong... But fortunately, it was very easy to remove the spell info feature. The shift key no longer does anything when casting with 'm', but 'b'rowsing works perfectly fine.

So yeah, the crash is now fixed, though the solution was pretty lazy on my part. :P Expect the fixed version of T1 to appear on Gitorious in short order.

P.S. Cursory examination of do_cmd_browse revealed that it prints spell info using completely different code. I might reestablish the shift key feature at some point using code borrowed from do_cmd_browse, which would hopefully not cause a crash.

Lord Estraven
Uruivellas
Posts: 718
Joined: Tue Dec 13, 2005 12:35 am

Re: Revisiting the T1 crash issue

#14 Post by Lord Estraven »

Briefly went up, then went down again. This code base is just too gnarly. :(

AnonymousHero
Spiderkin
Posts: 482
Joined: Sat Mar 18, 2006 12:48 pm

Re: Revisiting the T1 crash issue

#15 Post by AnonymousHero »

Told you. (Tongue in cheek.)

It's a little sad, actually... I find ToME 2.x (+ Theme) and Entroband to be the most enjoyable of all roguelikes. Unfortunately both of those code bases are absurdly hackish and needlessly complicated.

Post Reply