Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadProhibited in FreeRTOS Event List processing (IDFGH-2300) #242

Closed
nathanjel opened this issue Jan 15, 2017 · 24 comments
Closed

LoadProhibited in FreeRTOS Event List processing (IDFGH-2300) #242

nathanjel opened this issue Jan 15, 2017 · 24 comments
Labels
Type: Bug bugs in IDF

Comments

@nathanjel
Copy link
Contributor

In a complex UART/MQTT/WebSockets/SSL scenario, under heavy load (UART data in/out processing every 10ms, several os Tasks, bombarded with MQTT messages and sends few MQTT messages per second, using both CPU cores) i randomly run into this repeatable error, which seems to occur as invalid pointer is dereferenced in FreeRTOS queue processing code. The 0x1d could hardly be a valid pointer given ESP32 memory map.

Down to sys_sem_signal it's only the semafor queue object being referenced, and error occurs later down the stream as the semafor owning task is extracted and system-wide event list adjusted, I would suspect the error is occuring somewhere in FreeRTOS internals.

No idea how to fix this so far, no workarounds.

Guru Meditation Error of type LoadProhibited occurred on core  1. Exception was unhandled.
Register dump:
PC      : 0x4008683f  
PS      : 0x00060133  
A0      : 0x80085698  
A1      : 0x3ffbea50  
A2      : 0x0000001d  
A3      : 0x00060f23  
A4      : 0x00060f23  
A5      : 0x0000003c  
A6      : 0x00000000  
A7      : 0x00000000  
A8      : 0x800842d8  
A9      : 0x3ffbea40  
A10     : 0x00000003  
A11     : 0x00060123  
A12     : 0x00060123  
A13     : 0x00000006  
A14     : 0x3ffbff3e  
A15     : 0x00000014  
SAR     : 0x00000019  
EXCCAUSE: 0x0000001c  
EXCVADDR: 0x0000002d  
LBEG    : 0x4000c2e0  
LEND    : 0x4000c2f6  
LCOUNT  : 0xffffffff  

Backtrace: 0x4008683f:0x3ffbea50 0x40085698:0x3ffbea70 0x40083eb4:0x3ffbea90 0x400eff21:0x3ffbeab0 0x40135c93:0x3ffbead0 0x4010



Remote debugging using /dev/ttyUSB1
uxListRemove (pxItemToRemove=0x1d) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:212
212	List_t * const pxList = ( List_t * ) pxItemToRemove->pvContainer;
(gdb) bt
#0  uxListRemove (pxItemToRemove=0x1d) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:212
#1  0x40085698 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3028
#2  0x40083eb4 in xQueueGiveFromISR (xQueue=0x3ffce388, 
    pxHigherPriorityTaskWoken=0x3ffb6524 <g_lwip_critical_section_needs_taskswitch>)
    at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:1369
#3  0x400eff21 in sys_sem_signal (sem=0x3ffce37c) at /home/nathan/xnode/esp-idf/components/lwip/port/freertos/sys_arch.c:136
#4  0x40135c93 in event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/lwip/api/sockets.c:1920
#5  0x40137e80 in err_tcp (arg=0x3ffd7530, err=-12 '\364') at /home/nathan/xnode/esp-idf/components/lwip/api/api_msg.c:397
#6  0x400eb239 in tcp_input (p=<optimized out>, inp=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/lwip/core/tcp_in.c:366
#7  0x400e55ed in ip4_input (p=0x3ffd7ca4, inp=0x3ffbcd2c) at /home/nathan/xnode/esp-idf/components/lwip/core/ipv4/ip4.c:678
#8  0x400e52d4 in ethernet_input (p=0x3ffd7ca4, netif=0x3ffbcd2c)
    at /home/nathan/xnode/esp-idf/components/lwip/netif/ethernet.c:176
#9  0x400f02dc in tcpip_thread (arg=<optimized out>) at /home/nathan/xnode/esp-idf/components/lwip/api/tcpip.c:142
(gdb) 
@nathanjel
Copy link
Contributor Author

Thanks. I had a look and #233 does not tell me much. I compile in Debug, I have similar issues (different backtrace, but landing in same line in list.c) since some time, not only on recent IDF.

@projectgus
Copy link
Contributor

Hi @nathanjel ,

Thanks for reporting this. It looks like the same as an issue that we're also tracking internally, we've seen the same crash but only under high network load conditions.

We're continuing to investigate and will update you as soon as we have some more information.

Angus

@nathanjel
Copy link
Contributor Author

Exactly - this happens when the device is "bombarded" with messages during load testing. (a few hundred bytes MQTT message every 1ms). The software itself is generallly a port from our original implementation on TI's CC3200, where load testing was never an issue. Results improved after we implemented rate limiting on ESP32 port (rejecting messages when messages/sec limit exceeded) but never reached the stability of our original implementation - see all my other bug reports.

Did You consider upgrading FreeRTOS from 8.2 to a newer version?

@nathanjel
Copy link
Contributor Author

@me-no-dev
Copy link
Member

me-no-dev commented Jan 19, 2017

Seem to happen easier if code is running on core 1 and it happens both on incoming and outgoing packets. Queues are used throughout the system so I doubt that it has anything to do with the problem.
I can also tell you that using raw api and queues does not manifest the problem, so issue is probably somewhere in netconn and it's event logic.

@igrr igrr added the Type: Bug bugs in IDF label Jan 20, 2017
@nathanjel
Copy link
Contributor Author

nathanjel commented Jan 27, 2017

It seems to remain a theme in my app now, if under load.
I get an issue in a similar place, few lines later in xTaskRemoveFromEventList, but this time out of normal Queue code, while using a compiled state machine

PC      : 0x4008622b  PS      : 0x00060d33  A0      : 0x80085266  A1      : 0x3ffcf570  
A2      : 0x3ffcf778  A3      : 0x3ffcf5f8  A4      : 0x00000008  A5      : 0x3ffcf920  
A6      : 0x00000009  A7      : 0x00000000  A8      : 0x00000000  A9      : 0x3ffcf778  
A10     : 0x3ffcf778  A11     : 0x00060d23  A12     : 0x00060d23  A13     : 0xffff00ff  
A14     : 0xb33f0000  A15     : 0x00060723  SAR     : 0x0000001d  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000004  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

Remote debugging using /dev/ttyUSB1
uxListRemove (pxItemToRemove=0x3ffcf778) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:218
218		if( pxList->pxIndex == pxItemToRemove )
(gdb) bt
#0  uxListRemove (pxItemToRemove=0x3ffcf778) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:218
#1  0x40085266 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3017
#2  0x40083a9a in xQueueGenericSend (xQueue=0x3ffcf8c4, pvItemToQueue=<optimized out>, xTicksToWait=0, xCopyPosition=0)
    at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:769
#3  0x400f74bd in InternalSend (queue_addr=0x3ffcf8c4, message=STATUS_MESSAGE_CONFIGURATION_SAVE_REQUEST, supplement=0x0)
    at /home/nathan/xnode/SNESP32/main/./status.c:346
#4  0x401072d4 in NetFSM_Msg (this=0x3ffcf680, queue_addr=0x3ffcf8c4, message=STATUS_MESSAGE_CONFIGURATION_SAVE_REQUEST)
    at /home/nathan/xnode/SNESP32/components/xnode/./netfsm.c:72
#5  0x40106f6c in NetMap_RemoteBrokerLive_EVENT_TIMER_TICK (fsm=0x3ffcf69c)
    at /home/nathan/xnode/SNESP32/components/xnode/./stripnode_net_sm.c:948
#6  0x40128ae0 in stripnode_netContext_EVENT_TIMER_TICK (fsm=0x3ffcf69c)
    at /home/nathan/xnode/SNESP32/components/xnode/./stripnode_net_sm.c:1708
#7  0x400f7969 in StatusTask (pvParameters=<optimized out>) at /home/nathan/xnode/SNESP32/main/./status.c:506
(gdb) 

@nathanjel
Copy link
Contributor Author

nathanjel commented Jan 28, 2017

Two interesting observations

  1. There seems to be corelation between SPI Flash settings and the stability. This could mean a possible hardware issue on my end. Coming down from 80MHz QIO to 26MHz QIO did the trick and reliefed me of most of the occurences of the issue and improved the observed stability of network connection under load. DIO at 80 and 40Mhz did not work well. We have a custom PCB, so multiple things could go wrong.

  2. There seems to be some locking issue - using GDB stub one can notice that there must have been a change in the FreeRTOS Tasks list while the block of code was ran.

The code says the list was not empty (queue.c from FreeRTOS)

				else
				{
					if( listLIST_IS_EMPTY( &( pxQueue->xTasksWaitingToReceive ) ) == pdFALSE )
					{
						if( xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) != pdFALSE )
						{
							/* The task waiting has a higher priority so

And GDB shows that at the time of dump, the list was empty

#0  uxListRemove (pxItemToRemove=0x1d) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:214
#1  0x40084cb8 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3017
#2  0x400838f4 in xQueueGiveFromISR (xQueue=0x3ffd29c0, 
    pxHigherPriorityTaskWoken=0x3ffb77c4 <g_lwip_critical_section_needs_taskswitch>)
    at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:1369
#3  0x400edf35 in sys_sem_signal (sem=0x3ffd29b4) at /home/nathan/xnode/esp-idf/components/lwip/port/freertos/sys_arch.c:136
#4  0x40109230 in event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/lwip/api/sockets.c:1920
#5  0x4010b2ba in recv_tcp (arg=0x3ffe8a28, pcb=0x3ffe8c38, p=<optimized out>, err=0 '\000')
    at /home/nathan/xnode/esp-idf/components/lwip/api/api_msg.c:281
#6  0x400e9f34 in tcp_input (p=<optimized out>, inp=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/lwip/core/tcp_in.c:429
#7  0x400e5679 in ip4_input (p=0x3ffe9970, inp=0x3ffbdb50) at /home/nathan/xnode/esp-idf/components/lwip/core/ipv4/ip4.c:678
#8  0x400e53b9 in ethernet_input (p=0x3ffe9970, netif=0x3ffbdb50)
    at /home/nathan/xnode/esp-idf/components/lwip/netif/ethernet.c:176
#9  0x400ee2b8 in tcpip_thread (arg=<optimized out>) at /home/nathan/xnode/esp-idf/components/lwip/api/tcpip.c:142
(gdb) up
#1  0x40084cb8 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
    at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3017
3017		( void ) uxListRemove( &( pxUnblockedTCB->xEventListItem ) );
(gdb) print pxUnblockedTCB 
$1 = (TCB_t *) 0x1
(gdb) print px
pxCurrentTCB               pxEventList                pxOverflowTimerList        pxStack
pxCurrentTimerList         pxIndex                    pxPortInitialiseStack      pxTopOfStack
pxDelayedTaskList          pxNext                     pxPrevious                 pxUnblockedTCB
pxEnd                      pxOverflowDelayedTaskList  pxReadyTasksLists          
(gdb) print px
pxCurrentTCB               pxEventList                pxOverflowTimerList        pxStack
pxCurrentTimerList         pxIndex                    pxPortInitialiseStack      pxTopOfStack
pxDelayedTaskList          pxNext                     pxPrevious                 pxUnblockedTCB
pxEnd                      pxOverflowDelayedTaskList  pxReadyTasksLists          
(gdb) print pxCurrentT
pxCurrentTCB        pxCurrentTimerList  
(gdb) print pxCurrentTCB 
$2 = {0x3ffbe0b8, 0x3ffbfa60}
(gdb) print pxE
pxEnd        pxEventList  
(gdb) print pxEventList 
$3 = <optimized out>
(gdb) up
#2  0x400838f4 in xQueueGiveFromISR (xQueue=0x3ffd29c0, 
    pxHigherPriorityTaskWoken=0x3ffb77c4 <g_lwip_critical_section_needs_taskswitch>)
    at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:1369
1369							if( xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) != pdFALSE )
(gdb) print px
pxCurrentTCB               pxHigherPriorityTaskWoken  pxOverflowTimerList        pxQueueSetContainer
pxCurrentTimerList         pxIndex                    pxPortInitialiseStack      pxReadyTasksLists
pxDelayedTaskList          pxNext                     pxPrevious                 
pxEnd                      pxOverflowDelayedTaskList  pxQueue                    
(gdb) print pxQueue
$4 = (Queue_t * const) 0x3ffd29c0
(gdb) print pxQueue->xTasksWaitingToReceive 
$5 = {uxNumberOfItems = 0, pxIndex = 0x3ffd29ec, xListEnd = {xItemValue = 4294967295, pxNext = 0x3ffd29ec, 
    pxPrevious = 0x3ffd29ec}}
(gdb) 

This makes absolute sense, if one realizes that locking on the &xTaskQueueMutex happens only in the call to xTaskRemoveFromEventList, while it would make sense to make the lock before we check if the list is empty at all. I guess the idea behind could be that &pxQueue->mux is locked on anyhow out there, but then again, if a task picks up the event on the other core, it can get out from the list in between the list len macro call and the critical section start in xTaskRemoveFromEventList. This would also explain why most of the errors like this I see are on the Remove step.

@nathanjel
Copy link
Contributor Author

nathanjel commented Feb 14, 2017

This one is definitely a race condition, related to locking issue in a multi-processor environment.

In FreeRTOS/tasks.c tick handler xTaskIncrementTick(), a task's xEventListItem can be removed from the containing list, and the mutex is only on xTaskQueueMutex

Now, if the xEventListItem is on a specific queue's event list, e.g. pxQueue->xTasksWaitingToReceive, then what can happen is that in the queue manipulating functions, e.g. xQueueGenericSend we have a check if the queue's waiting to receive list is empty (e.g. listLIST_IS_EMPTY( &( pxQueue->xTasksWaitingToReceive ) ) == pdFALSE ) protected only by queue-specific mutex, but not by xTaskQueueMutex, so that between that check and the subsequent call xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) == pdTRUE the above mentioned tick handler could actually get called on the other core and remove the item from queue's list and we will get the reported result.

@nathanjel
Copy link
Contributor Author

Seems my pull request came in at the same time as this
01c4a98

Anyhow, now the issue does not occur. Thanks.

@igrr
Copy link
Member

igrr commented Feb 15, 2017

Thanks for your PR, we have been discussing ways to fix this issue and your looks like a solid one. We will need to have a more detailed look, but I do see a chance this will be merged as a more reliable solution.

@negativekelvin
Copy link
Contributor

Thumbs up for more reliable. "reducing the window where the race condition can happen" does not make me sleep like a baby at night.

@nathanjel
Copy link
Contributor Author

Thank You. My initial solution that I applied internally to be able to continue work was very similar to the one in 01c4a98, but the code was more messy and thus I was not offering it to the wider community. The one proposed is elegant, and as the additional list check is being done with the protection of xTaskQueueMutex it should be solid and reliable enough - as far as I see it the window of opportunity is closed - but happy to learn where Your concerns come from. I also tested this with my heavy-duty code and the problem has not re-occured.

Adding the xTaskMutex around all the invocations indeed seems more robust, but it also is heavyweight as the overhead counted in instructions executed is much larger than the workaround, and it messes more with FreeRTOS internals. I think both are just patches needed to work around fundamental single-core driven FreeRTOS design concepts.

In my head I picture a solution that drops the queue-tasklist coupling completly and replaces it with something - maybe a lock-free ringbuffer that would store queue "events", for consideration by the scheduler...

@kglowacki
Copy link

this issue still exists (sdk v2.1). no heavy load on the network is required - for me it happens when I send a single SSL request just after wifi reconnects details

@nacpem
Copy link

nacpem commented Dec 4, 2019

I am getting this issue as well
This started when i create a separate task to send MQTT messages in a for loop.
I also have a parallel task which is continuously reading from the CAN bus and a two event loops:

  1. for Wifi
  2. MQTT

@github-actions github-actions bot changed the title LoadProhibited in FreeRTOS Event List processing LoadProhibited in FreeRTOS Event List processing (IDFGH-2300) Dec 4, 2019
@david-cermak
Copy link
Collaborator

@rebeliousconformist
Can you please share the version of IDF you use? The use-case you're describing sounds like a common scenario so should be easily reproducible.

@nacpem
Copy link

nacpem commented Dec 6, 2019

Hi david,
im using v4.1-dev-1086-g93a8603c5

Also, with the CAN bus task off, the MQTT connects to the rabbitMQ server perfectly.
When both the tasks are running in parallel (MQTT and CAN tx/rx) the MQTT send at most 22 or 23 messages to the server irrespective of the number of topics.

With the CAN bus readout task switched off. the MQTT event loop works fine.

When the MQTT is called from a Task, then it sends only 22-23 messages. I presume it is getting preempted by either the Wifi event loop or the MQTT event loop.
I have to cleanup my code before i post anything else.

Also when the semaphores are removed, then the errors go away.

Ill post my code soon

@david-cermak
Copy link
Collaborator

Hi @rebeliousconformist

Any luck with the test code? Could you please post some bits of your test code?
Thanks!

@nacpem
Copy link

nacpem commented Jan 17, 2020

Hi David,
As there was no reply on the same post i froze the development.
i shall post the code today End of Day 1800 CET. I will have to shore up the code.

Regards,
Nachiket Bapat

@david-cermak
Copy link
Collaborator

Hi Nachiket,

Thanks for your help and sorry for not responding earlier. Will try to reproduce the issue with your code!
Thanks again,
David

@nacpem
Copy link

nacpem commented Mar 27, 2020

Hi @david-cermak the event loop crash error is resolved in the latest developments.

@david-cermak
Copy link
Collaborator

Hi @rebeliousconformist

Thanks for the update! Closing the underlying internal tracker then.

@SpComb
Copy link

SpComb commented Jul 2, 2022

#8336 looks like a similar issue with event groups in esp-idf v4.4, does vTaskRemoveFromUnorderedEventList() also need a similar fix?

@guirespi
Copy link

Hi, this stills happens in v4.4.2 release. To trigger it you can use led_strip examlple (esp_idf/examples/peripherals/rmt) combined with the connection example. You have to make a freertos task that constantly turn on and turn off the ws2812 rgb leds in intervals of 100 ms. Meanwhile the rgb task is loop, you have to try to connect to wi-fi. This error causes two behaviours, the first is the Loadprohibited guru handler, the second one is that the program freezes at rmt_write_sample

@guirespi
Copy link

Also, when config the RMT driver I use a clock diviser of 4 instead of 2 as the API suggests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

10 participants