Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Heap init code improvements" doesn't boot on ESP-01 with PUYA flash #8657

Closed
6 tasks done
avian2 opened this issue Aug 14, 2022 · 18 comments · Fixed by #8658
Closed
6 tasks done

"Heap init code improvements" doesn't boot on ESP-01 with PUYA flash #8657

avian2 opened this issue Aug 14, 2022 · 18 comments · Fixed by #8658

Comments

@avian2
Copy link

avian2 commented Aug 14, 2022

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-01]
  • Core Version: [313b3c0]
  • Development Env: [Arduino IDE]
  • Operating System: [Ubuntu]

Settings in IDE

  • Module: [Generic ESP8266 Module]
  • Flash Mode: [dio]
  • Flash Size: [1MB]
  • lwip Variant: [v2 Lower Memory]
  • Reset Method: [nodemcu]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mhz]
  • Upload Using: [SERIAL]
  • Upload Speed: [115200]

Problem Description

Any program compiled and uploaded using the latest git master does not run at all. After a successful compile and upload Arduino serial monitor only shows garbage (using a scope on the serial line shows data with a non-standard baudrate that I can't decode). The LED is constantly on.

Simplest sketch that reproduces this is the "Blink" example. However the result is the same regardless of what program is uploaded.

Using git bisect from a known-good release (2.4.2) shows that the commit that introduced this issue is:

7356cd1
Heap init code improvements and updates (#8458)

Reverting just this commit on the latest git master fixes the problem.

MCVE Sketch

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
}

void loop() {
  digitalWrite(LED_BUILTIN, HIGH);
  delay(1000);
  digitalWrite(LED_BUILTIN, LOW);
  delay(1000);
}

Debug Messages

Picked up JAVA_TOOL_OPTIONS: 
. Variables and constants in RAM (global, static), used 27984 / 80192 bytes (34%)
║   SEGMENT  BYTES    DESCRIPTION
╠══ DATA     1496     initialized variables
╠══ RODATA   856      constants       
╚══ BSS      25632    zeroed variables
. Instruction RAM (IRAM_ATTR, ICACHE_RAM_ATTR), used 59601 / 65536 bytes (90%)
║   SEGMENT  BYTES    DESCRIPTION
╠══ ICACHE   32768    reserved space for flash instruction cache
╚══ IRAM     26833    code in IRAM    
. Code in flash (default, ICACHE_FLASH_ATTR), used 232952 / 1048576 bytes (22%)
║   SEGMENT  BYTES    DESCRIPTION
╚══ IROM     232952   code in flash   
esptool.py v3.0
Serial port /dev/ttyUSB0
Connecting....
Chip is ESP8266EX
Features: WiFi
Crystal is 26MHz
MAC: [...]
Uploading stub...
Running stub...
Stub running...
Configuring flash size...
Auto-detected Flash size: 1MB
Erasing flash (this may take a while)...
Chip erase completed successfully in 0.0s
Flash params set to 0x0220
Compressed 266288 bytes to 196214...
Writing at 0x00000000... (8 %)
Writing at 0x00004000... (16 %)
Writing at 0x00008000... (25 %)
Writing at 0x0000c000... (33 %)
Writing at 0x00010000... (41 %)
Writing at 0x00014000... (50 %)
Writing at 0x00018000... (58 %)
Writing at 0x0001c000... (66 %)
Writing at 0x00020000... (75 %)
Writing at 0x00024000... (83 %)
Writing at 0x00028000... (91 %)
Writing at 0x0002c000... (100 %)
Wrote 266288 bytes (196214 compressed) at 0x00000000 in 20.1 seconds (effective 105.9 kbit/s)...
Hash of data verified.

Leaving...
Hard resetting via RTS pin...
@mcspr
Copy link
Collaborator

mcspr commented Aug 14, 2022

Default baudrate is 74880 btw, you will see boot messages if you configure your monitor to it
Is this commit affecting all of your devices or just one?

@avian2
Copy link
Author

avian2 commented Aug 14, 2022

Thanks for the quick response!

Default baudrate is 74880 btw, you will see boot messages if you configure your monitor to it

Decoding at 74880 bps gives the following (repeating in an infinite loop):

�4R2RvVKE�R|�4R2RvVKE�R|�4R2RvVKE�R|�4R2RvVKE�R|�4R2RvVKE�R|�4R2RvVKE�R|�4R2RvVKE�R

Is this commit affecting all of your devices or just one?

I've now tried a few other ESP-01 modules and found one that works with the commit!

good_bad_esp01

Both should have 1 MB Flash (25Q80).

One interesting difference I see between these two modules is time to erase the flash:

Good module:

Auto-detected Flash size: 1MB
Erasing flash (this may take a while)...
Chip erase completed successfully in 6.0s

Bad module:

Auto-detected Flash size: 1MB
Erasing flash (this may take a while)...
Chip erase completed successfully in 0.0s

In both cases Arduino was set to "Erase flash: All Flash Contents"

@mcspr
Copy link
Collaborator

mcspr commented Aug 14, 2022

Usually, device prints something like this on boot

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 3424, room 16
tail 0
chksum 0x2e
load 0x3fff20b8, len 40, room 8
tail 0
chksum 0x2b
csum 0x2b
v00090700
~ld
...sketch starts, baudrate might change...

Not actually erasing flash seems suspect, and ours as well as esptool issue tracker mentions PUYA chips as source of all kinds of trouble.
https://github.com/esp8266/Arduino/search?q=puya&type=issues
https://github.com/espressif/esptool/search?q=puya&type=issues

While we have PUYA-specific workarounds for an already working sketch (edit: enabled by default), not really sure this is the expected behaviour when flashing or whether it actually writes any data to the flash. Looks like a hw issue, really.

@avian2
Copy link
Author

avian2 commented Aug 14, 2022

Usually, device prints something like this on boot

For some reason all my instruments have problems decoding this serial output. I've finally managed to get something halfway readable using pulseview on an CSV export from a logic analyzer.

Serial output from the bad module:

[91][91][C9][F5]0x00000000, depc=0x00000000
Fatal exception (0): 
epCLO[C2][D1]0201ec8, epc2=0x0L[82][82][82][82][82]b[02]*[C1][8D][CD]=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exc[A5][D1][A9][CB][0B]B[82]): 
[A5][8D][B1]O[C2][D1]0[92][8A]*[8D][E1], epcZO[C2][C1][B0][82]0000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x4020[B1]Y,b[02]*[C1][8D][C9]=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x40201ec8, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x40201ec8, epc2=0x00000000, epc3=0x0000[10][96][F5]0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x40201ec8, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Unfortunately the exception decoder doesn't seem to do anything with this, even if I just paste the part that looks correctly decoded.

EDIT: 0x40201ec8 is handle_hwdt

Not actually erasing flash seems suspect, and ours as well as esptool issue tracker mentions PUYA chips as source of all kinds of trouble.

The PUYA issues all seem to be related to writing to flash (to filesystem during run time, or with esptool during programming). I always get "Hash of data verified.", on both the good and the bad module. If this is an issue writing the flash during programming, shouldn't the verification fail at this point? I don't see how the heap code changes anything related to flash writing.

For the record, flash_id of chips on my modules:

Good module:

Manufacturer: c8
Device: 4014
Detected flash size: 1MB

Bad module:

Manufacturer: 85
Device: 6014
Detected flash size: 1MB

One idea I had is that maybe the PUYA work-around code itself is a problem, since it might get run on the bad module (due to its flash_id), but not the good one. I see that the workaround malloc's an additional buffer so that might be the connection to the heap code. On the other hand, I don't see why the boot code would want to write to flash early in the boot process.

@mhightower83
Copy link
Contributor

@avian2
For an experiment to try, add a <sketch name>.ino.globals.h file to you build with lines:

/*@create-file:build.opt@
-DUMM_INIT_USE_IRAM=1
*/

Also, remove Tools->Debug Level HWDT or HWDT_NOEXTRA4K options from your build.

For confirmation that the new .h file was processed, you should see something like:

Extracting embedded compiler command-line options from <the real sketch name>.ino.globals.h:1
   2, Add command-line option: -DUMM_INIT_USE_IRAM=1
  Created compiler command-line options file /tmp/arduino_build_592869/core/build.opt

In the Arduino IDE build console. To see a detailed build report, you may need to turn on "Show verbose output during compilation"

These changes will remove early (pre SDK init) ICACHE read operations.

@avian2
Copy link
Author

avian2 commented Aug 14, 2022

@mhightower83 Thanks for the suggestion. Adding this define makes the Blink sketch work on the bad modules!

Bad module, unmodified git master (313b3c0), with -DUMM_INIT_USE_IRAM=1: works.

Bad module, unmodified git master (313b3c0), without -DUMM_INIT_USE_IRAM=1: does not work.

I tried alternating the setting three times to make sure it's not random.

@mcspr Some additional info regarding flash writing:

I manually dumped the flash content from the modules with esptool.py read_flash. Dumps from the good modules, bad modules and the .bin file produced by the compiler are identical.

Also, the unused portion of the flash looks correctly erased (all 0xff) on the bad modules.

As far as I can see, this is not a problem with programming the flash with the esptool.

@mcspr
Copy link
Collaborator

mcspr commented Aug 14, 2022

Hm. But what's the difference between flash cache being enabled later down the line vs. manual call in that wrapper?

@mhightower83
Copy link
Contributor

@mcspr
My guess is the initial Boot ROM setup for PUYA is not compatible with the requirements for the ICACHE/SPI hardware-Flash interface to work. The exception 0 with handle_hwdt hints toward reading zero from the ICACHE. I haven't looked at the RTOS SDK in a while; however, I think it has unique setups for some list of flash chips and a default setup for those not listed. Maybe the NONOS SDK does something similar.

@avian2
Copy link
Author

avian2 commented Aug 15, 2022

In case it helps in understanding this issue, here's another partly decoded serial capture I have. This was triggered from the RST line, so it shows serial output immediately after reset:

 )[D7][81][AA]X[0B][02][C2][02]J[B0]L[16]I[ED]t [A3]X][AE]JJ, boot mode:(3,7)

[AC]+[91][81]0x4010f000, len 3424, roo[AD][8A][B2][02]j[0A]tail 0
[A3]kk[D5][B5] 0x2e
loa[A4][82][C2]Mf[A6][96][82][12][E1], len 40, room 8 
tail 0
chksum 0x2b
csum 0x2b
v00041170
~ld
Fatal exception (0): 
epc1=0x40223968, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x40223968, epc2=0x00000000, [82][82]b[02]*[E1][8D][D9][85][91][91][C9][F5]0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x40223968, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0): 
epc1=0x40223968, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00p[82]000, DY[8D][F5]0x00000000

0x40223968 is the address of the umm_init function in the binary:

$ xtensa-lx106-elf-objdump -d Blink.ino.elf |grep -C3 40223968
40223966:	22                      	.byte 0x22
40223967:	40                      	.byte 0x40

40223968 <umm_init>:
40223968:	f0c112               	addi	a1, a1, -16
4022396b:	ffd521               	l32r	a2, 402238c0 <wpa_sm_disassociate+0x4>
4022396e:	21c9                	s32i.n	a12, a1, 8
40223970:	3109                	s32i.n	a0, a1, 12

@mhightower83
Copy link
Contributor

Sorry, I looked around and I don't have any devices with the PUYA flash chip. If you don't mind here is another experiment to try. Add this as a new .cpp file to your sketch folder. It will change the setting for flashchip->chip_size to 1M before the umm_init() call. The thinking is this will prevent any possible out-of-range request from being made to the Flash device.

#include <Arduino.h>
#include <umm_malloc/umm_malloc.h>
extern "C" {
#include "user_interface.h"
#include "cont.h"
}

extern "C" {
/////////////////////////////////////////////////////////////////////////////////
// starter code copied from core_esp8266_main.cpp
extern "C" void call_user_start();

IRAM_ATTR void app_entry_redefinable(void)
{
    /* Allocate continuation context on this SYS stack,
       and save pointer to it. */
    cont_t s_cont __attribute__((aligned(16)));
    g_pcont = &s_cont;

    flashchip->chip_size = 1024 * 1024; // Fixup chip_size for PUYA

    /* Doing umm_init just once before starting the SDK, allowed us to remove
       test and init calls at each malloc API entry point, saving IRAM. */
#ifdef UMM_INIT_USE_IRAM
    umm_init();
#else
    // umm_init() is in IROM
    mmu_wrap_irom_fn(umm_init);
#endif
    /* Call the entry point of the SDK code. */
    call_user_start();
}
};

@avian2 avian2 changed the title "Heap init code improvements" prevents programs from starting on ESP-01 "Heap init code improvements" doesn't boot on ESP-01 with PUYA flash Aug 16, 2022
@avian2
Copy link
Author

avian2 commented Aug 16, 2022

@mhightower83 I tried the chip_size fixup and there is no difference. I verified that the code gets compiled in by dumping the elf, but it still crashes on boot.

@mhightower83
Copy link
Contributor

mhightower83 commented Aug 19, 2022

@avian2 Thanks for trying that.

I have been looking around a bit and see this equivalent used before starting Cache_Read_Enable:
SPI0U |= SPIUCSSETUP; // SPI_CS_SETUP. It looks promising. In the RTOS SDK and other places, SPI_CS_SETUP was camouflaged as BIT5. It took me a while to recongnize it as relavent. (At least for the example that looked relevant. Some other SPI bus setups had SPI_CS_SETUP.)

If this should fail, I may need to wait till I have hardware to work with to resolve it. I would also look at a roll-back PR.
-DUMM_INIT_USE_IRAM=1 is a good workaround for now. However, for now Debug HWDT will not work with PUYA. It requires early Cache_Read_Enable to work.

#include <Arduino.h>
#include <umm_malloc/umm_malloc.h>
extern "C" {
#include <user_interface.h>
#include <esp8266_peri.h>            // **** new line ****
#include <cont.h>
}

extern "C" {
/////////////////////////////////////////////////////////////////////////////////
// starter code copied from core_esp8266_main.cpp
extern "C" void call_user_start();

IRAM_ATTR void app_entry_redefinable(void)
{
    /* Allocate continuation context on this SYS stack,
       and save pointer to it. */
    cont_t s_cont __attribute__((aligned(16)));
    g_pcont = &s_cont;

    SPI0U |= SPIUCSSETUP;  // SPI_CS_SETUP  - **** New line ****

    // flashchip->chip_size = 1024 * 1024; // Fixup chip_size for PUYA

    /* Doing umm_init just once before starting the SDK, allowed us to remove
       test and init calls at each malloc API entry point, saving IRAM. */
#ifdef UMM_INIT_USE_IRAM
    umm_init();
#else
    // umm_init() is in IROM
    // I am not sure what this does. It appears to be the key function called
    // from `fix_cache_bug` in the NONOS SDK - Will this help PUYA Flash work?
    // It appears to do some, low-level undocumented register manipulation and
    // device specific init based on OTP CHIPID model bits.
    extern uint32_t phy_get_bb_evm(void); // undocumented
    phy_get_bb_evm();

    mmu_wrap_irom_fn(umm_init);
#endif
    /* Call the entry point of the SDK code. */
    call_user_start();
}
};

Edit: I think the flashchip->chip_size = 1024 * 1024; can be removed.
Edit: 08/30/22, Added call to phy_get_bb_evm();

@avian2
Copy link
Author

avian2 commented Aug 22, 2022

@mhightower83 Thanks for looking into this.

I've tried the SPI_CS_SETUP fix and unfortunately Blink still does not boot with PUYA flash. Same as before I get an invalid instruction exception in umm_init().

I've also tried ESP-01 modules with BergMicro and Kynix flash chips. Those work fine.

@mhightower83
Copy link
Contributor

@avian2 I will not be able to test this on a device with PUYA flash until after October. The hardware devices I received so far, that were supposed to have PUYA flash, have Tenx. ☹️ I updated the previous sample code comment with my latest observations.

@TD-er
Copy link
Contributor

TD-er commented Aug 30, 2022

@avian2 I will not be able to test this on a device with PUYA flash until after October. The hardware devices I received so far, that were supposed to have PUYA flash, have Tenx. ☹️ I updated the previous sample code comment with my latest observations.

Pfft, you can't rely on getting garbage chips when you purposely order garbage...
The irony ;)

@avian2
Copy link
Author

avian2 commented Sep 1, 2022

@mhightower83 Your latest code from 08/30/22 with phy_get_bb_evm(); fixes the issue with PUYA flash! The Blink sketch as well as some others I tried boot fine on my previously problematic PUYA flash modules.

I also tried it with modules with BergMicro and Kynix flash that I have at hand and they work fine.

Thank you again for taking the time to investigate this problem!

@d-a-v d-a-v added this to the 3.1 milestone Sep 13, 2022
@d-a-v d-a-v added alpha included in alpha release and removed alpha included in alpha release labels Sep 13, 2022
@mhightower83
Copy link
Contributor

mhightower83 commented Sep 18, 2022

@avian2 It looks like there is still a missing piece to recreating this problem. I have now swapped in the PUYA P25Q80H flash chip on an ESP-01S board and the failure does not occur with the blink sketch with core version ((313b3c0...)) or the current master. So I am failing to recreate the original problem.

I think I have gone as far as I can with this. It is working for you. The logic came from a function named fix_cache_bug which sounds like something worth incorporating into early icache-enabling logic.

If you have time to try the fix in PR form #8658 (comment), that would be great.

Otherwise @d-a-v, I think I am done with this PR.

Edited for clairity.

@avian2
Copy link
Author

avian2 commented Nov 14, 2022

Sorry, I keep meaning to follow up on this but other things kept interfering.

@mhightower83 I'm not sure what's special about PUYA P25Q80H on my modules. Perhaps I have some other die revision? Or maybe the ESP8266 itself is different? Unfortunately I only have one of them left for testing. Others are running some sensors for me (without problems so far - I used then-current esp8266 core with the heap init code commit reverted).

In any case, the current git master (80bf716) works for me on the PUYA module I still have, so I can confirm that your last PR fixed it.

I also checked again just to be sure it's not something else: the issue is indeed still perfectly reproducible with the old core version I initially reported (313b3c0).

Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants