Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M1 Mac "EOF" during http access to docker-hosted webserver "Connection reset by peer" #5407

Closed
3 tasks done
rfay opened this issue Mar 1, 2021 · 28 comments
Closed
3 tasks done

Comments

@rfay
Copy link

rfay commented Mar 1, 2021

  • I have tried with the latest version of Docker Desktop
  • I have tried disabling enabled experimental features
  • I have uploaded Diagnostics
  • Diagnostics ID: 165EF212-DB99-4FE1-90A4-E516A309B23F/20210609215723

Expected behavior

Predictable behavior on docker supporting an nginx webserver.

Actual behavior

Intermittent EOF when hitting a webserver.

I now have Mac M1 CI going for ddev, currently using build 3.1.0 (60984)

Every day I see a few tests where an http request gets an EOF. This is different from many docker fails I've experienced before, and seems new to the M1 docker.

I know it's not much help, but




testcommon.go:396:
  | Error Trace:	testcommon.go:396
  | config_test.go:789
  | Error:      	Received unexpected error:
  | Get "http://127.0.0.1/phpinfo.php": EOF
  | Test:       	TestPHPOverrides
  | Messages:   	GetLocalHTTPResponse returned err on rawurl http://testpkgdrupal8.ddev.site/phpinfo.php: Get "http://127.0.0.1/phpinfo.php": EOF
 

The thing is that EOF on M1 has happened intermittently on different tests, and it's the most likely failure on mac M1. It usually succeeds on a retry (there's an hour of testing, with hundreds or thousands of http requests). There is only one machine in the pool at this point, so it's not about different computers with different state.

I just thought you should know about this one because it's reasonably common.

Information

MacOS Big Sur 11.2.1
Docker Desktop for Mac 3.1.0 (60984)

Steps to reproduce the behavior

  1. ...
  2. ...
@stephen-turner
Copy link
Contributor

Thanks for the report, @rfay. If you have a spare Big Sur Intel machine to test it on, I'd be interested to know whether it fails on that using the new virtualization framework.

@stephen-turner stephen-turner added the area/m1 M1 preview builds label Mar 2, 2021
@rfay
Copy link
Author

rfay commented Mar 2, 2021

I went ahead and switched a Big Sur amd64 test runner to new virtualization, we'll see what happens. Note that my initial experience wasn't good - it just went off into spinning-circle land forever. However, when I closed the window after many minutes and came back it seemed to be set. Not satisfied, I reset to factory defaults and then changed it to new virt again. It didn't do the infinite spinning the second time.

@rfay
Copy link
Author

rfay commented Mar 2, 2021

@stephen-turner Networking to container (host.docker.internal) is nonfunctional with new virtualization enabled, so that's a no-go for NFS or xdebug. Opened #5410

@rfay
Copy link
Author

rfay commented Mar 30, 2021

@stephen-turner @djs55 The EOF issue may have gone away in RC2, I'm not sure.

However, and I assume related, I still get "Connection reset by peer" quite a lot. "read tcp 127.0.0.1:62795->127.0.0.1:80: read: connection reset by peer". It's rare that I don't have to restart an M1 test run due to this one. Doesn't happen on the same test each time.

@rfay
Copy link
Author

rfay commented Apr 18, 2021

I see this related things quite a lot still "Get "http://127.0.0.1//README.txt": read tcp 127.0.0.1:60801->127.0.0.1:80: read: connection reset by peer"

Sometimes I have to restart tests many times to get a clean run. Of course, while I'm doing that every other platform, Docker for Mac amd64, Docker for Windows, WSL2, Linux, have all completed without any of these errors on the same code.

@rfay
Copy link
Author

rfay commented Apr 22, 2021

I do still see the EOF, it hasn't gone away:

testcommon_test.go:194:
  | Error Trace:	testcommon_test.go:194
  | Error:      	Received unexpected error:
  | Get "http://127.0.0.1/readme.html": EOF
  | Test:       	TestGetLocalHTTPResponse

@LeZuse
Copy link

LeZuse commented Apr 28, 2021

We just started having the same issue on Apple Silicon macs with Docker 3.3.1. Reverting to Preview 7 solves the issue. Happy to provide more details.

@rfay
Copy link
Author

rfay commented May 5, 2021

This problem is not solved in Docker Desktop 3.3.2. I had hopes that it might have been related to this in the 3.3.2 release notes:

Fixed a bug with an Apple chip where the last byte in a network transfer was occasionally lost

@rfay
Copy link
Author

rfay commented May 5, 2021

Not sure how you're going to chase this @djs55 but it's a really significant problem. Casual users probably just click through it and try again, but I haven't had a successful M1 test run of the ddev test suite for more than a month, I'm starting to ignore the failures. And it's always the EOF

@djs55
Copy link
Contributor

djs55 commented May 6, 2021

@rfay thanks for trying with 3.3.2.

The code path which handles docker run -p forwarded ports should be the same on both Intel and Apple Silicon so I suspect the bug might actually be present on both platforms, even if it is only visible on Apple Silicon. I'll take a look in more detail to see if I can spot something.

A bit of a long shot but: Is the EOF from the first request to a container or does it happen after successful requests? I ask because running docker -p 80:80 -d nginx could possibly return before the nginx process has called listen, leading to a transient EOF on the first request. This would probably be obvious if it was happening ... unless the container is silently crashing and auto-restarting? It's probably worth double-checking that the container isn't accidentally running through qemu emulation. We're still chasing down and building multiarch images in a few places ourselves. Qemu works just well enough to make simple tests pass but then fails during more stressful tests.

@rfay
Copy link
Author

rfay commented May 6, 2021

Thanks for the thinking and attention on this.

  • This seems to happen randomly in no-particular test, so I doubt that it has to do with container crashing or that sort of thing.
  • These happen after the container is fully up and registered healthy, and the healthcheck does an http request, and that's not when this is happening. It's happening on traffic after it's come up.
  • Images are absolutely arm64. We use only native images on all platforms. I know how crazy those qemu-container-wrong-arch things can be. So qemu should not be in play here. Also note that we run the same tests on linux/arm64 with the same arm64 images. No problems there.

Again, thanks. And I know this is a hard one. I'll try to re-investigate a few of these and keep some notes to see if there's any kind of pattern. But it seems like... ddev start, wait until it's up, curl something, EOF failure. That's a common pattern in all the tests, but there doesn't seem to be any particular pattern in what tests fail.

@LeZuse
Copy link

LeZuse commented May 7, 2021

We had problem with HTTP clients complaining about mismatched content length (response got cut off) and now it seems to work in 3.3.2 so either this is not the same issue or there are more variables at play

Sent with GitHawk

@rfay
Copy link
Author

rfay commented May 7, 2021

@LeZuse I'm betting that your fix was recorded in the 3.3.2 release notes,

Fixed a bug with an Apple chip where the last byte in a network transfer was occasionally lost

Sadly, this one seems to be different.

@rfay
Copy link
Author

rfay commented Jun 4, 2021

This problem remains. I'll work on a recreation scenario. It's only one time in 10 that the ddev test suite completes without this problem on M1 (problem is ONLY on M1, same tests everywhere).

@rfay
Copy link
Author

rfay commented Jun 9, 2021

I think I have a recreation scenario, and here's a diagnostic: 165EF212-DB99-4FE1-90A4-E516A309B23F/20210609215723 (Edit: Got another one: 165EF212-DB99-4FE1-90A4-E516A309B23F/20210609221754 )

Most of the time this is probably "Connection reset by peer"

It appears that in this situation both ddev-router and ddev-webserver are completely ready (and have already served something, but internally in the container)

@rfay
Copy link
Author

rfay commented Jun 24, 2021

Upgraded all test runners to 3.5.0, but this remains a failure somewhere in almost every test run.

@hmaesta
Copy link

hmaesta commented Jul 7, 2021

Oh, man... After 24 hours of extreme frustration, here I am.

We have some piece of code that Just Don't Work™ – no errors, no exceptions, no log. Nothing. The request just have a sudden stop coming from nowhere, like being hit by a lightning on a beautiful summer day.

And then I realized that I was the only one suffering from this rare phenomenon. My coleges –followers of Linus Torvalds– didn't even notice that something was odd – just me.

"We use Docker so everyone can have the same environment. It's our code. It's not possible that is just me."

Well, it was. After accepting that M1 could be the cause I started looking for someone as unlucky as me and found this issue.

I uninstalled every piece of Docker in my computer and downgraded from 3.5.1 to 3.3.1, the first to support Apple Silicon, and everything is back to normal – except me, that even now hadn't accepted that I spent 8+ hours of working hours looking for a bug in code.

@rfay
Copy link
Author

rfay commented Jul 28, 2021

This remains an issue on 3.5.2.

@rfay
Copy link
Author

rfay commented Aug 12, 2021

Same on 3.6.0

Although I can usually get a full ddev test suite to pass on mac amd64 and Windows amd64, it's very rare that I can get through a full suite on mac M1. Sometimes I retry several times. It's always some random test that has an EOF, connection reset by peer

@rfay rfay changed the title M1 Mac "EOF" during http access to docker-hosted webserver M1 Mac "EOF" during http access to docker-hosted webserver "Connection reset by peer" Aug 18, 2021
@docker-robott
Copy link
Collaborator

Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

@rfay
Copy link
Author

rfay commented Nov 16, 2021

/remove-lifecycle stale
/lifecycle frozen

Just because issues don't get any attention doesn't mean they're stale. This is a consistent problem.

@dossy
Copy link

dossy commented Dec 16, 2021

Possibly related? #3448

@rfay
Copy link
Author

rfay commented May 5, 2022

This remains a consistent problem. As speculated here it may be related to the use of localhost.

@icemanmelting
Copy link

I have the same issue with a test. I have a mac mini and a studio that are running 2 different java apps that need to communicate with each other through TCP Sockets, and I get intermitent issues with socket disconnection quite often. This is a stress test, and I have never been able to run it with the 2 computers for more than 1 hour, or 1.5 hours. If I run everything in the mac studio using a docker internal network, then everything runs ok, and I am able to have the communication going for days on end.

@ColeoCofer
Copy link

The only solution to this problem that I found was to reduce the allocated resources in docker-desktop down to 1 CPU (inspired by this post). My containers are now consistently passing, whereas with 4 CPUs, I would get the EOF error around 4/5 times.

Docker-Desktop Version: 4.18.0 (104112)
Engine: 20.10.24

@rfay
Copy link
Author

rfay commented Sep 23, 2023

This still happens regularly, but it's not going to be addressed here. Closing.

@rfay rfay closed this as completed Sep 23, 2023
@tisba
Copy link

tisba commented Sep 24, 2023

Hey @rfay, could you point us to the issue where this is going to be addressed so we can keep track of the issue? Thanks! 🙏

@rfay
Copy link
Author

rfay commented Sep 24, 2023

Unfortunately, if it hasn't gotten any attention in 2 1/2 years, I don't think it will get any.

However, it's a poor issue and doesn't have a repro case, as it's intermittent. If you have a good repro case, something that the docker team can easily reproduce, perhaps they'll pay attention if you open a new issue with uploaded diagnostics and a repro case. Consider making a github test repo that demonstrates it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants