Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG: "[ERROR] Error in ProxyController: Error inside ProxyWorker" #4562

Closed
ekojsalim opened this issue Dec 7, 2023 · 42 comments · Fixed by #4867
Closed

🐛 BUG: "[ERROR] Error in ProxyController: Error inside ProxyWorker" #4562

ekojsalim opened this issue Dec 7, 2023 · 42 comments · Fixed by #4867
Assignees
Labels
bug Something that isn't working regression Break in existing functionality as a result of a recent change start-dev-worker Relating to the startDevWorker API

Comments

@ekojsalim
Copy link

Which Cloudflare product(s) does this pertain to?

Wrangler core, Miniflare

What version(s) of the tool(s) are you using?

3.19.0 [Wrangler]

What version of Node are you using?

20.10.0

What operating system are you using?

Linux

Describe the Bug

Repeatedly calling an endpoint calling a Durable Object will result in this error every other request:

✘ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

Not sure if the cause is actually repeatedly calling a DO. In the DevTools, requests to the DO appears to be all successful.

Downgrading to 3.18.0 fixes this issue, so this is possibly a regression involving the startDevWorker refactor.

Please provide a link to a minimal reproduction

No response

Please provide any relevant error logs

No response

@ekojsalim ekojsalim added the bug Something that isn't working label Dec 7, 2023
@KrisBraun
Copy link

KrisBraun commented Dec 10, 2023

Also seeing this error on ~50% of reloads using Remix dev.

@KrisBraun
Copy link

FYI @mrbbot (potentially related to startDevWorker)

@beanow-at-crabnebula
Copy link

I've started seeing test flakes which are based on lots of unstable_dev starts.
Without DO bindings (1 KV and 1 R2 binding though).

Reverting to 3.18.0 indeed solves the issue.

Node 18, Linux, if that matters.

@mrbbot mrbbot added the waal label Dec 12, 2023
@ShravanSunder
Copy link

I'm also getting this error often, started recently with the new version

@mrbbot mrbbot added start-dev-worker Relating to the startDevWorker API and removed waal labels Dec 18, 2023
@pvarsh
Copy link

pvarsh commented Dec 19, 2023

Getting this error ('Network connection lost') and { errno: -32, code: 'EPIPE', syscall: 'write' } on a test that used to work. About half the time one or the other, but consecutive runs tend to have the same error. Started happening over the last few days.

@irekjozwiak
Copy link

I'm getting this on ~70% reloads on certain workers on 3.22.1

@matthewjosephtaylor
Copy link

matthewjosephtaylor commented Dec 28, 2023

I too am seeing this. I'm using DurableObjects with WebSockets.

I suspect it might be related to something like the DO goes into hibernation and loses the websocket connection.

For me it isn't 'every other request' but rather if I leave the WebSocket open and connected without traffic for a period (at least that is what it 'feels' like.

Will revert to 3.18.0 to see if that fixes

For the record I am on 3.22.1 currently

FYI reverting to 3.18 came with other bugs and issues that were even worse. I note that this morning 3.22.2 dropped and am excited to see if this helps... 🤞

Sadly still seeing same issue after about 30s of inactivity for the websocket in 3.22.2 :(

I'm considering pushing some 'ping' messages across just to keep the DO from sleeping which I still suspect is the cause, but in a way it is acting as a nice chaos monkey while developing...

@bryanlatten
Copy link

+1 seeing this semi-regularly with very small local load, 3-4 parallel requests

@CameronWhiteside
Copy link

+1

@beanow-at-crabnebula
Copy link

Just adding a note that we now have critical CVEs against the only workaround: downgrading to wrangler 3.18
https://github.com/cloudflare/workers-sdk/security

Upgrading to latest wrangler also caused Jest to find leaking request handles.
This makes the unstable_dev for integration tests: as documented. Very brittle.

@admah
Copy link
Contributor

admah commented Jan 10, 2024

@ekojsalim (and others) thanks for reporting this. Could you help provide us with a way to reproduce this error message?

This will help us better debug what actually may be going on here. There are a couple of changes that were made within Wrangler recently that could be a culprit here. Thanks!

@bryanlatten
Copy link

bryanlatten commented Jan 10, 2024

At the risk of being the wrong signal here, but I see more failures when a CORS (OPTIONS) request is happening? The app has access to queues and KV, not D0 like others are using above.

Screenshot 2024-01-10 at 2 00 56 PM

@ethan-tqa
Copy link

@admah even the Counter example in DO docs has this issue, with any wrangler above version 3.18. It does not require any concurrent requests. I can reliably reproduce the issue using hand triggered http requests if they are less than say 3 seconds apart. And the behavior is very consistent, always 1 working request followed by 1 broken request and repeat.

@matthewjosephtaylor
Copy link

Work-around:

I'm using DurableObjects with WebSockets.

As long as I have the client continually sends a 'ping' message across the socket every 5 seconds I'm no longer experiencing the issue.

I STRONGLY suspect the issue is with wrangler hybernating the worker. If I never let it rest, I don't get the error.

@aroman
Copy link

aroman commented Jan 19, 2024

Work-around:

I'm using DurableObjects with WebSockets.

As long as I have the client continually sends a 'ping' message across the socket every 5 seconds I'm no longer experiencing the issue.

I STRONGLY suspect the issue is with wrangler hybernating the worker. If I never let it rest, I don't get the error.

If you're talking about websocket hibernation API, we don't use those APIs in our project (we use WebSockets + DOs but not the hibernate APIs), and we are still hit hard by this issue when developing locally and running tests.

Separately, for CF folks (cc: @lrapoport-cf), I'd like to underscore the point @beanow-at-crabnebula made above. At my company, we've been unable to upgrade wrangler since early November, due to one regression after another (3 by my count: this one, #4496, and one caused by #4535 I can't seem to find). Of these, only the latter was documented in release notes.

I appreciate how frequently wrangler releases new versions, but my feeling is that something has got to change process-wise to mitigate the frequency and severity of breaking regressions — or at least document them in release notes so folks are aware of the tradeoffs when upgrading. I appreciate how active and interactive with customers your team is on the CF github repos — that level of communication goes a long way.

@admah
Copy link
Contributor

admah commented Jan 19, 2024

@aroman that is a good callout. We are constantly reviewing our processes to see what we can do to mitigate these types of incidents, because we do understand how disruptive they are.

For this issue (and any others related to previous startDevWorker work) we have prioritized them and are working to have them resolved ASAP.

@lrapoport-cf lrapoport-cf added the regression Break in existing functionality as a result of a recent change label Jan 22, 2024
@lrapoport-cf
Copy link
Contributor

hi @aroman and all, apologies for the delayed action on this issue. our team got pulled into high priority internal work over the last several weeks and we fell behind on our regular workers-sdk maintenance. i appreciate you calling out our engagement with the community as a positive -- we strive to keep you all informed as much as possible. this is a good reminder for us to continue to communicate out any internal discussion we have about particular issues so you all are always up to date on their status.

in terms of concrete next steps, we have prioritized this issue for this week and assigned someone to address it. and while we're also taking strides to reduce the number of regressions by increasing test coverage, going forward we'll be prioritizing fixing any regressions that do slip through as quickly as possible -- we have also just added a regression label so that items such as these get highlighted, please feel free to use it :)

thanks for raising this feedback!

@filipStoklasa23
Copy link

We are experiencing this issue heavily in our azure CI during e2e (parallel cypress) test run.

Downgrading to 3.18.0 does not help

Screenshot 2024-01-24 at 10 52 25

@RamIdeas
Copy link
Contributor

Hi @ekojsalim! Could you please describe the specific reproduction steps? Also, since you mentioned your "requests to the DO appears to be all successful", could you clarify whether this is indeed an error causing your local development environment to behave incorrectly or whether it is a noisy log message in your terminal?

I tried this week to reproduce the issue on Ubuntu but couldn't. I used the Counter example on the Cloudflare Docs and ran npx wrangler dev and in another terminal ran curl http://localhost:8787/increment?name=A repeatedly. The requests were all successful and the output from wrangler had no extraneous logs – just the request logs with 200 status codes. I also waited 10+ seconds between some requests to test whether Durable Object Hibernation was triggering the issue.

For others in the comments reporting similar but not exactly the same issue, could you please open new issues so that we can gather specific of your situation – our issue template includes fields for your wrangler and OS version. Please try to provide reproduction steps to aid us to fix the issue for you. I am committed to improving the local developer experience so please provide as much detail as possible to fix your issue.

@ekojsalim
Copy link
Author

@RamIdeas We actually migrated off Cloudflare Workers not long after this issue, so my memory on this issue may not be the best.

could you clarify whether this is indeed an error causing your local development environment to behave incorrectly or whether it is a noisy log message in your terminal?

It's not just noisy log message. Every other HTTP request to the endpoint will error out with that error message.

I tried to reproduce this issue now using a stripped down version of our old code but also couldn't reproduce it now (repeatedly calling a DO). It's possible that the root issue lies on other interactions but I couldn't recall what exactly.

@JamesDelfini
Copy link

At the risk of being the wrong signal here, but I see more failures when a CORS (OPTIONS) request is happening? The app has access to queues and KV, not D0 like others are using above.

Screenshot 2024-01-10 at 2 00 56 PM

Confirmed on my development environment when using CORS but in deployed environment it is working fine.

@matthewjosephtaylor
Copy link

@RamIdeas Here is my env. I'm using DOs with Websockets in tandem with normal REST calls all handled via the DO.

Here is my system info (apologies for the franken-template):

What version(s) of the tool(s) are you using?
"wrangler": "^3.24.0"

What version of Node are you using?
v20.9.0

What operating system are you using?
Macbook pro Darwin Kernel Version 23.0.0

You mentioned running 10s between requests. I suggest opening that up to a minute. I never timed it but 10s feels a bit too short.

On the wild chance I'm 'doing it wrong' here is how I'm initializing the DO from my main index.ts file since the docs around this are quite confusing/inconsistent.

// In order for the workers runtime to find the class that implements
// our Durable Object namespace, we must export it from the root module.
export { AppStateDurableObject } from "./state/durable/AppStateDurableObject";

export default {
  fetch(req: Request, env: Env, ctx: ExecutionContext) {
    const id = env.APP_STATE.idFromName("app-state");
    const stub = env.APP_STATE.get(id);
    return stub.fetch(req);
  },
};

@RamIdeas
Copy link
Contributor

@matthewjosephtaylor that worker code looks totally fine but I would need to see the Durable Object code as it essentially handles the full request/response.

I used 10s because that is the timeout workerd enforces before hibernating an idle Durable Object – in reality I waited about 15s to be sure. If you're experiencing issues specifically related to waiting a minute between requests, can you please open a separate issue with repro steps so we can track there? specifically, can you please include:

  • the specific version of your OS and wrangler (not the semver range)
  • a minimal working subset of your worker code
  • a minimal working subset of your wrangler.toml
  • commands used to start your local dev server, including custom env vars and cli args, and
  • steps to be performed in your browser, curl commands or a test we can run which reliably fails (at least a certain percentage of the runs)

A git repo I can clone and run a test suite on would be optimal, but even a git repo with a README of step-by-step instructions would be great. This should help us be able to reproduce the bug and diagnose and address the issue faster. Thanks!

@matthewjosephtaylor
Copy link

@RamIdeas I note you have a potential fix drafted 🤞 that it works :)

Will open a new bug if your fix doesn't work.

FYI I'd also be happy enough with some flag to disable hibernating the DO in dev. I like that wrangler dev does it as it exercises the DO well during development, but since it is effectively broken for me, it hurts more than it helps at the moment.

@Cherry
Copy link
Contributor

Cherry commented Jan 31, 2024

We hit this today with one of our projects. It's a relatively simple Worker, no bindings other than the Workers Sites KV binding. Our wrangler.toml is as simple as this:

name = "xxx"
account_id = "xxx"
workers_dev = true
compatibility_date = "2023-06-29"
main = "./src/worker.ts"
compatibility_flags = ["nodejs_compat"]

rules = [
  { type = "Text", globs = ["**/*.svg"], fallthrough = true }
]

[site]
bucket = "./public-dist"

It wasn't happening for everyone on the team, but at least one person was experiencing this error when hitting an endpoint multiple times in some environments. It didn't happen consistently, so creating a full repro isn't viable unfortunately, but every so often they'd see it, and it'd break their implementation, causing the request to not respond with anything.

On reverting to 3.18.0, the issues subsided and they were able to continue working without issue.

@RamIdeas
Copy link
Contributor

RamIdeas commented Feb 2, 2024

Hi @matthewjosephtaylor and @Cherry! I think I managed to narrow this down to requests being fired off to the previous UserWorker just before a new UserWorker is started (after a reload caused by config or source file changes).

I have a potential fix in #4867 – please try out the prerelease build and let me know if you still experience issues.

@matthewjosephtaylor thanks for confirming the initial fix on that branch solved your issue. I've since made more changes to further address the issue. Would appreciate you trying again too.

@RamIdeas
Copy link
Contributor

RamIdeas commented Feb 2, 2024

FYI I'd also be happy enough with some flag to disable hibernating the DO in dev. I like that wrangler dev does it as it exercises the DO well during development, but since it is effectively broken for me, it hurts more than it helps at the moment.

Thanks for the feature request. We'll consider exposing it via wrangler in the current and upcoming work on improving the local dev servers.

Notably, hibernation occurs after 10s and, if this issue wasn't being seen until after about a minute as other comments imply, then I don't think hibernation was relevant to the issue in question. Also, the Durable Object is going to reload when the UserWorker reloads anyway so you will definitely want to implement logic to recover from Durable Object reloads even for local dev.

@nora-soderlund
Copy link
Contributor

@RamIdeas I just tested your PR, it does not fix it for me, unfortunately. I have the same issue as everyone else here, fetches failing during reloads etc, downgrading to 3.18.0 fixes it for me.

▲ [WARNING] Failed to get worker definitions TypeError: fetch failed

      at fetch (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:17032:19)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async getRegisteredWorkers (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:127052:22)
      at async getBoundRegisteredWorkers (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:127071:29) {
    cause: ConnectTimeoutError: Connect Timeout Error
        at onConnectTimeout (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:5865:29)
        at <redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:5821:51
        at Immediate._onImmediate (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:5853:13)
        at process.processImmediate (node:internal/timers:476:21) {
      code: 'UND_ERR_CONNECT_TIMEOUT'
    }
  }

I don't know how much info I can give that will be helpful other than this project is running Remix. I will from my end try to narrow down the issue in workers-sdk.

I did some digging the other day but didn't come very far, I'll see if the changes in your PR will guide me a bit better this time.

@matthewjosephtaylor
Copy link

@RamIdeas your fix #4867 improved the situation quite a bit, but now I eventually get:

✘ [ERROR] *** Received signal #11: Segmentation fault: 11

  stack:

(no stack trace in console)

After I get the segfault wrangler is effectively dead, and I have to restart it.

This happens after several minutes which is a remarkable improvement on what it was before, where it would basically die as soon as hibernation set in.

For the record this is what I tested with:

npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/7766885793/npm-package-wrangler-4867 dev --port 8787 ./src/index.ts 

@RamIdeas
Copy link
Contributor

RamIdeas commented Feb 8, 2024

Hi @nora-soderlund that looks like an issue with the dev registry (used for local service bindings). Although probably related to worker reloads, can you open another issue so that we can track and fix that independently (we may already have a PR up that fixes it)

@matthewjosephtaylor thanks for confirming the PR makes a marked improvement. That segfault is going to be a tough one to debug. Can you open another issue so that we can debug that in isolation? I may need to provide you a debug build of workerd and instructions to use it to get any useful leads and would make it easier to follow if we had a separate issue. Please feel free to assign me or just tag me :)

@Cherry
Copy link
Contributor

Cherry commented Feb 12, 2024

Sadly #4867 did not solve it for one of my team members.

It's a long-running worker, run with wrangler dev without any edits being made, but now instead of seeing the ProxyController error, they just receive a 503. It's not consistent, and it's not reliably reproducible (it seems to happen mostly when hitting the same endpoint a lot), but reverting them to 3.18.0 is still our best resolution.

@matthewjosephtaylor
Copy link

matthewjosephtaylor commented Feb 12, 2024

@Cherry I opened #4995 in relation to the fix for #4867 helping but not solving my issues (new segfault error exposed). Perhaps the issues you are continuing to face are related as well, especially if the 503s become consistent once they start happening.

@Cherry
Copy link
Contributor

Cherry commented Feb 12, 2024

Thanks @matthewjosephtaylor. They unfortunately aren't consistent for one of my team. They can send 100 requests, 98 are fine, and then 2 of them throw the new 503 introduced in #4867 😔

@RamIdeas
Copy link
Contributor

@matthewjosephtaylor thanks I'll follow up in that issue!

@Cherry it's helpful to know now they're receiving these new 503s. That implied they're non-GET/HEAD requests. Can you tell me more about the requests which are failing? Are they long-running requests? Do they have large bodies? Something is making them flakey to proxy to the UserWorker and, since they're workerd -> workers requests, that shouldn't be the case. Can you provide more info on their OS version/node version, any more info to go on?

@Cherry
Copy link
Contributor

Cherry commented Feb 12, 2024

@RamIdeas thanks for the followup. While I can't find a reliable repro, here's about as much info as I've been able to gather:

  • Windows 11, Node 20.11.0
  • They're POST requests with an empty body, to something like /notification/<id>/dismiss
  • They run a few async operations (queries), but respond within about 2-3 seconds locally (testing with remote DB)
  • Sometimes 30 of them can be made without any issues. But other times, the second or third one fails. Sometimes they'll succeed, and other times they'll just 503, and I've not found any specific cause or correlation. Retrying ones that fail almost always succeed just fine, one a second or third retry.

@aroman
Copy link

aroman commented Feb 12, 2024

We've been getting the occasional 503 error (very annoying — causes our tests to be flaky) for a long time, even on version 3.11.0. It seems to only occur under load and/or with a high degree of concurrency to the dev server, which is what our playwright tests do. Reducing the total # of concurrent requests does reduce the frequency of this issue, but doesn't totally solve it.

@StrandedKitty
Copy link

StrandedKitty commented Feb 13, 2024

I am getting this issue when doing several consecutive POST requests to something as simple as

export async function onRequest({request}: EventContext<unknown, string, unknown>): Promise<Response> {
	return new Response("test", {status: 200});
}

Win 11, Node v19.9.0.

It seems to be gone after switching to Wrangler 3.18.0.

@RamIdeas
Copy link
Contributor

@aroman the changes causing this issue were not in v3.11.0 – can you upgrade to the latest version and try again. 503s returned before 3.28.0 would be a completely different issue and if you're still experiencing it on recent versions please open a new issue with your repro steps.

@Cherry thanks for the additional info! I missed in your previous comment you mentioned that there are no reloads and you're receiving a 503. It shouldn't be possible to hit the codepath to the 503 unless a request to a previous version of your worker failed and upon catching that failed request we notice a new version of your worker. Can you provide me the log file for that wrangler session? That would have more details about the state of the ProxyWorker before/during/after these failed requests.

@Cherry + @StrandedKitty I tried POSTing many concurrent requests with the following but couldn't reproduce the error. Could you try this repro on your set-ups and let me know if you experience the issue?

worker:

export default {
	async fetch(request, env, ctx) {
		await new Promise((r) => setTimeout(r, 5000));
		return new Response('HELLO');
	},
};

In one terminal, run:

$ npx wrangler@latest dev

In another terminal:

$ for i in {1..200}
do
  curl -X POST http://localhost:8787 -F 'fileX=@./path/to/any/file' &;
done

@StrandedKitty
Copy link

StrandedKitty commented Feb 22, 2024

@RamIdeas Hi, I've debugged it a little and it seems like in order to reproduce it it's required to:

  • Send a lot of consecutive POST requests with a non-empty body (content type doesn't matter) with a small delay. Sending a file like in your example with curl also leads to 503 Service Unavailable.
  • Use Wranger >=3.19.0. I can't reproduce it with 3.18.0 but it seems to be present on all newer versions.

On 3.19.0 the error looks like this:

✘ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

And on 3.28.4 it looks like this:

✘ [ERROR] Your worker created multiple branches of a single stream (for instance, by calling `response.clone()` or `request.clone()`) but did not read the body of both branches. This is wasteful, as it forces the system to buffer the entire stream of data in memory, rather than streaming it through. This may cause your worker to be unexpectedly terminated for going over the memory limit. If you only meant to copy the request or response headers and metadata (e.g. in order to be able to modify them), use the appropriate constructors instead (for instance, `new Response(response.body, response)`, `new Request(request)`, etc).

So it seems like this issue it related to #3259 is some way.

Let me know if you need further help.

@Cherry
Copy link
Contributor

Cherry commented Feb 26, 2024

I found a way to reliably reproduce this on Windows @RamIdeas (and some success on MacOS). I filed #5095 to track separately.

@amiranvarov
Copy link

this issue still persists. Any CF members planning to take a look at it?

@RamIdeas
Copy link
Contributor

this issue still persists. Any CF members planning to take a look at it?

@amiranvarov if you're experiencing a similar error in your wrangler dev session, please open a new issue with steps to reproduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working regression Break in existing functionality as a result of a recent change start-dev-worker Relating to the startDevWorker API
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.