Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware WDT Stack Dump Tool #7010

Merged
merged 76 commits into from
Apr 7, 2021
Merged

Conversation

mhightower83
Copy link
Contributor

@mhightower83 mhightower83 commented Jan 11, 2020

A Hardware WDT Reset stack dump tool by Michael Hightower

This Sketch demonstrates the use of a tool that prints a stack dump after a Hardware WDT reset. After a Hardware Watchdog Timer reset, the module hwdt_app_entry.cpp writes a stack dump to the serial interface.

The goal was to have a file (hwdt_app_entry.cpp) that could be dropped into a sketch directory, then open the sketch project, build, upload, and debug.

When the ESP8266 restarts because of a Hardware WDT reset, the serial port speed defaults to 115200 bps. For support of other speeds, review the comments on option HWDT_UART_SPEED in wdt_app_entry.cpp.

See hwdt_app_entry.cpp for more options and details.

This Sketch demonstrates the use of a tool to print a stack dump
at reboot after a Hardware WDT event.

The module hwdt_app_entry.cpp writes a stack dump to the serial interface
after a Hardware Watchdog Timer has struck and a new boot cycle has begun.
The sketch must properly initialized the Serial port before the crash.

hwdt_app_entry.cpp is the core file that does the work.
Improved comments.
Added option to match the UART speed used by the sketch.
Added option to print greeting at the start to indicate the HWDT stack dump code is active.
Isolated logic for handling strings: one assuming they are not inited at
the time the code is running and one that does. The later appears to be the case.
Fix issue with HWDT reason detection when sketch crashes too fast.
Added sample sketch menu option for crashing with a function defined
with a weak attribute via prototype, but never actually defined in
full function form. eg. `void trouble(void){return;}`
Adjusted configuration option order for most likely to be used to the top.
Tried to improve comments.
Replace numbers with enum values.
Regular global strings have worked reliably.
Tweaked reset reason detection.
Reordered elements in global structure.
Improve #if test for debug option
Always improving comments.
@mhightower83 mhightower83 changed the title Hardware WDT Stack Dump Hardware WDT Stack Dump Tool Jan 15, 2020
data problem that occurs when printing after a flash upload using esptool.
Curiously, esptool stub also uses similar delays.
Word choices and description improvements.
no mixed up stack frame was created. Also removed no longer needed
extra level of function calling.
Use existing macros from uart_register,h to handle getting current
UART speed. Added some missing `const`.
@d-a-v
Copy link
Collaborator

d-a-v commented Jan 17, 2020

I haven't tried yet but I read the comments.
Did you try Serial.flush() instead of delay() ?
also eboot.elf has grown in size I believe, that should be reflected in the comments

@mhightower83
Copy link
Contributor Author

@d-a-v, Sorry for the confusion, In order to capture and print the stack, I am running as an app_entry_redefinable replacement, before the SDK and Core has started. (I really miss the Core APIs.) Fortunately, a UART speed change is not needed, if you are running at the default speed of 115200 and your printing the stack dump for a Hardware WDT reset. It is when I tried to print at a different time or speed that creates the issue. It works for now. I want to go back and try again polling the FIFO for empty then wait 1 character time for the character to leave the shift register, before changing UART speed. That should work; however, I have found that in the same sentence the words "should" and "work" just don't get along very often. 😄

eboot.elf's stack usage, with just a flash boot, (w/o running the uncompress OTA logic) appears to be down around 720 bytes. At this time I have a conservative 1024 reserved for ROM and eboot. I think that can come down. Just waiting to see if anything grows.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 17, 2020

Serial speed is generally 74880 at boot time (115200*26/40).
Does your board have a 40MHz crystal ? Should it be mentioned in the comments (115200 vs 74880 according to crystal) ?
By Serial.flush() I indeed intended to wait for a character (10/baud secs).

@mhightower83
Copy link
Contributor Author

@d-a-v, ahh, I tend to over detail things. I think I went too far the other way. I have a 26MHz crystal. I understand that the boot ROM has initially set the UART clock divisor to support 115200 bps with the assumption that it has a 40MHz crystal. For a 26MHz crystal, the resulting error gives us a real rate of 74880 bps and prints jibberish at first until the CPU clock frequency is recalibrated. While CH_PD and EST_RST bring the CPU back to this state of underclocking with a 26MHz crystal, the Hardware WDT reset does not appear to do the same. The UART appears to continues to be clocked at a rate to support a device connected at 115200 bps. And, this allows us to see the famous and cryptic WDT message.

I'll come up with a clearer description so people that are aware of the crystal issue are not distracted by my too brief description.

On the delay for FIFO to clear, as I was thinking this through again, I remembered another factor in just using the delay was to save on IRAM. It only runs at startup, so I opted for keeping code smaller vs faster in this area.

@mhightower83
Copy link
Contributor Author

Findings from exploring the Extra 4K heap option crash further:

For the possible SYS stack memory range 0x3FFFE000 up to 0x40000000, there is hole at 0x3FFFEA80 up to 0x3FFFEB30
This region turns to zeros after a WiFi connection is made. It looks like it is a buffer used by aes_decrypt_init. It also appears the SDK has a wrapper around the function.

Looking at one of my projects I see the SYS stack high watermark at 0x3FFFE720. This places the hole within the active stack range and yet it does not crash. I assume when the WiFi connection is being done, the actively used stack space is low at that moment. However, when HWDT (with ROM Stack set to 1024) is used with Extra 4K Heap I do crash. My assumption is that the "actively used stack space" is overlapping the hole. In contrast to when it works with reserved ROM Stack set to 752.

Improves the SYS stack space available when using the extra 4K Heap
option in conjunction with HWDT. Replaces the ROM AES buffer at
0x3FFFEA80 with one provided by malloc().
Check if cont stack is yielding to SYS, use g_pcont->sp_yield to limit the
amount of the cont stack dumped.

Generalized dev logic path to create a generalized debug function hwdt_pre_sdk_init_icache.
@mhightower83
Copy link
Contributor Author

mhightower83 commented Nov 17, 2020

Some HWDT crashes are improved by #7704 (comment). That PR moves them to crashes postmortem can report with much-improved context.

@earlephilhower earlephilhower added this to the 3.0.0 milestone Feb 16, 2021
@devyte
Copy link
Collaborator

devyte commented Mar 28, 2021

@mhightower83 there are conflicts in this one. Could you please resolve them?

@d-a-v d-a-v added the alpha included in alpha release label Mar 31, 2021
Copy link
Collaborator

@earlephilhower earlephilhower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Our debug menu is enormous already, adding a few more entries is no problem. The replacement startup looks good (if massively over-engineered and over-configurable, but it's mostly hidden if you don't care about it), and only comes into effect in specific debug modes.

Thx again, @mhightower83 !

@mhightower83
Copy link
Contributor Author

@earlephilhower

The replacement startup looks good (if massively over-engineered and over-configurable

Yea, I either under-plan or over-plan ahead. Also, with the discovery of how to turn on Flash execution, this PR transitioned from running mostly in IRAM to running mostly from Flash. Assuming we have no need for the IRAM build path, I should go back and trim that out. I don't think features in that build path have kept up with the Flash side. Assuming yes, to avoid introducing new bugs, it might be best to do in a new PR.

On the over-plan side, I do have a generalization in the startup, that I think will allow for a possible user-supplied function (hwdt_pre_sdk_init()) that could be used to do post-crash printing of stuff that could aid in analyzing a sketch's state at the time of the HWDT. Also (might need more changes) I was thinking that it would be nice to run a heap integrity check after a previous crash and maybe a full poison check as well. This PR is too big, these are things I have been thinking about. It is something I can look at if you see a need or value.

Anytime you think I have gone too far, feel free to encourage me to come back 😄

@d-a-v d-a-v merged commit da6ec83 into esp8266:master Apr 7, 2021
@mhightower83 mhightower83 deleted the pr-hwdt-stack-dump branch May 1, 2021 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alpha included in alpha release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants