Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JdkConnector: JerseyInvocation stuck after DestinationConnectionPool thrown an IllegalStateException #4810

Open
cykl opened this issue Jun 17, 2021 · 14 comments

Comments

@cykl
Copy link

cykl commented Jun 17, 2021

One of our application got a thread stuck waiting for a future that will never complete after what seems to be an internal corruption in jdk-connector. Exception occurred at 03AM and thread was still stuck at 09AM.

Exception in thread "jdk-connector-9566" java.lang.IllegalStateException: Request not found
        at org.glassfish.jersey.jdk.connector.internal.DestinationConnectionPool.getRequest(DestinationConnectionPool.java:156)
        at org.glassfish.jersey.jdk.connector.internal.DestinationConnectionPool.access$600(DestinationConnectionPool.java:34)
        at org.glassfish.jersey.jdk.connector.internal.DestinationConnectionPool$ConnectionStateListener.onStateChanged(DestinationConnectionPool.java:250)
        at org.glassfish.jersey.jdk.connector.internal.HttpConnection.changeState(HttpConnection.java:231)
        at org.glassfish.jersey.jdk.connector.internal.HttpConnection.access$200(HttpConnection.java:39)
        at org.glassfish.jersey.jdk.connector.internal.HttpConnection$ConnectionFilter.processRead(HttpConnection.java:376)
        at org.glassfish.jersey.jdk.connector.internal.HttpConnection$ConnectionFilter.processRead(HttpConnection.java:344)
        at org.glassfish.jersey.jdk.connector.internal.Filter.onRead(Filter.java:116)
        at org.glassfish.jersey.jdk.connector.internal.HttpFilter.processRead(HttpFilter.java:106)
        at org.glassfish.jersey.jdk.connector.internal.HttpFilter.processRead(HttpFilter.java:27)
        at org.glassfish.jersey.jdk.connector.internal.Filter.onRead(Filter.java:116)
        at org.glassfish.jersey.jdk.connector.internal.SslFilter.handleRead(SslFilter.java:359)
        at org.glassfish.jersey.jdk.connector.internal.SslFilter.processRead(SslFilter.java:321)
        at org.glassfish.jersey.jdk.connector.internal.SslFilter.processRead(SslFilter.java:35)
        at org.glassfish.jersey.jdk.connector.internal.Filter.onRead(Filter.java:116)
        at org.glassfish.jersey.jdk.connector.internal.Filter.onRead(Filter.java:124)
        at org.glassfish.jersey.jdk.connector.internal.TransportFilter$3.completed(TransportFilter.java:262)
        at org.glassfish.jersey.jdk.connector.internal.TransportFilter$3.completed(TransportFilter.java:249)
        at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:127)
        at java.base/sun.nio.ch.Invoker$2.run(Invoker.java:219)
        at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
"[redacted]-835" #4850639 prio=5 os_prio=0 cpu=0.66ms elapsed=21736.91s tid=0x00007fef6817e800 nid=0x6178 waiting on condition  [0x00007fef224e4000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000e98ddce0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1796)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3128)
	at java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1823)
	at java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:1998)
	at org.glassfish.jersey.jdk.connector.internal.JdkConnector.apply(JdkConnector.java:72)
	at com.greencomnetworks.jersey.client.RectifyOpentracingSpanProvider$RectifyOpentracingConnector.apply(RectifyOpentracingSpanProvider.java:52)
	at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:297)
	at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$2(JerseyInvocation.java:644)
	at org.glassfish.jersey.client.JerseyInvocation$$Lambda$801/0x000000010060bc40.call(Unknown Source)
	at org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:654)
	at org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:648)
	at org.glassfish.jersey.client.JerseyInvocation$$Lambda$802/0x000000010060b040.call(Unknown Source)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:205)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390)
	at org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:648)
	at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:643)
	at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:416)
	at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:307)
        [REDACTED]
	at com.pivovarit.collectors.AsyncParallelCollector.lambda$null$1(AsyncParallelCollector.java:62)
	at com.pivovarit.collectors.AsyncParallelCollector$$Lambda$709/0x000000010059d440.get(Unknown Source)
	at com.pivovarit.collectors.Dispatcher.lambda$completionTask$2(Dispatcher.java:90)
	at com.pivovarit.collectors.Dispatcher$$Lambda$710/0x000000010059d840.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at com.pivovarit.collectors.Dispatcher.lambda$withFinally$4(Dispatcher.java:117)
	at com.pivovarit.collectors.Dispatcher$$Lambda$714/0x000000010059ec40.run(Unknown Source)
	at com.greencomnetworks.concurrent.TracingAwareRunnable.run(TracingAwareRunnable.java:26)
	at io.micrometer.core.instrument.internal.TimedRunnable.run(TimedRunnable.java:44)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

Client is configured with a jdk-connector, MAX_CONNECTIONS_PER_DESTINATION = 32, CONNECTION_IDLE_TIMEOUT, 60_0000. I don't have proper metrics, but client was fairly busy when it happened.

This application was using 2.32. I just skimmed through git log -p and didn't see any change that would suggest that a newer version could fix the issue.

@jbescos
Copy link
Member

jbescos commented Jun 18, 2021

In the stacktrace there is:
com.greencomnetworks.jersey.client.RectifyOpentracingSpanProvider$RectifyOpentracingConnector.apply(RectifyOpentracingSpanProvider.java:52)
Could you provide some information about it?.

@cykl
Copy link
Author

cykl commented Jun 18, 2021

Sure, it's a thin wrapper around a real provider to ensure that the client OpenTracing span is closed even when response filter is not executed (ex. failed DNS resolution). Never found a better way to do that with JAX-RS.

It doubt it could be the cause of this failure but I will re-review it. I had no time to dig into this issue yet, but my gut feeling is that something could be wrong in the locking, or expected happens-befores, performed by DestinationConnectionPool. However, I have no reproducer yet and didn't had time to deep dive into the implementation.

/**
 * Workaround JAX-RS 2.1 client filter limitations regarding OpenTracing instrumentation.
 *
 * <p>
 * JAX-RS filters suffer from a design issue: on error case response filter are not invoked, see
 * https://github.com/eclipse-ee4j/jaxrs-api/issues/684. It prevents OpenTracing instrumentation from closing the span.
 * This provider works-around this issue. It ensures that spans are finished and error attached to the span.
 */
final class RectifyOpentracingSpanProvider implements ConnectorProvider {

    private final ConnectorProvider delegate;

    RectifyOpentracingSpanProvider( ConnectorProvider delegate ) {
        this.delegate = delegate;
    }

    @Override
    public Connector getConnector( Client client, Configuration runtimeConfig ) {
        var realConnector = delegate.getConnector( client, runtimeConfig );
        return new RectifyOpentracingConnector( realConnector );
    }

    static class RectifyOpentracingConnector implements Connector {

        private final Connector delegate;

        RectifyOpentracingConnector( Connector delegate ) {
            this.delegate = delegate;
        }

        @Override
        public ClientResponse apply( ClientRequest request ) {
            try {
                return delegate.apply( request );
            } catch( ProcessingException pe ) {
                var cause = pe.getCause( );

                SpanWrapper spanWrapper = CastUtils.cast( request.getProperty( PROPERTY_NAME ), SpanWrapper.class );
                if( spanWrapper != null && !spanWrapper.isFinished( ) ) {
                    var span = spanWrapper.get( );

                    Tags.ERROR.set( span, true );
                    if( cause != null ) {
                        span.log( Map.of(
                                "event", Tags.ERROR.getKey( ),
                                "error.object", cause ) );
                    }

                    spanWrapper.finish( );
                }

                throw pe;
            }
        }

        @Override
        public Future<?> apply( ClientRequest request, AsyncConnectorCallback delegateCallback ) {
            return delegate.apply( request, new AsyncConnectorCallback( ) {
                @Override
                public void response( ClientResponse response ) {
                    delegateCallback.response( response );

                }

                @Override
                public void failure( Throwable cause ) {
                    SpanWrapper spanWrapper = CastUtils.cast( request.getProperty( PROPERTY_NAME ), SpanWrapper.class );
                    if( spanWrapper != null && !spanWrapper.isFinished( ) ) {
                        var span = spanWrapper.get( );

                        Tags.ERROR.set( span, true );
                        span.log( Map.of(
                                "event", Tags.ERROR.getKey( ),
                                "error.object", cause ) );

                        spanWrapper.finish( );
                    }

                    delegateCallback.failure( cause);
                }
            } );
        }

        @Override
        public String getName( ) {
            return delegate.getName( );
        }

        @Override
        public void close( ) {
            delegate.close( );
        }
    }
}

@jbescos
Copy link
Member

jbescos commented Jun 28, 2021

This issue is very tricky and requires some discussion.

Based in the stack traces, it is using the synchronous method of the JdkConnector. The problem is that inside of it, it manages the response in asynchronous way, an it waits for the response of it. This is where the thread is stuck in your stack trace.

I have tried to make it really synchronous. Removing the Future inside of the JdkConnector#apply, but this does not help, because later the request is managed by HttpConnectionPool anyway in an asynchronous way.

This HttpConnectionPool is taking care of the running connections. They are identified by URI, so 2 different requests using the same URI could reuse the same connection, making it to perform better.

I think in this issue there is a race condition when processing more than 1 request with the same URI.

There are 3 possible "solutions":
1 - Fix the race condition. This is a very complex solution, difficult to spot and test. And very easy to make new issues.
2 - Make the synchronous JdkConnector#apply to use a new connection every time instead of using the HttpConnectionPool. But there should be a limit somehow to prevent to spam hundreds of connections.
3 - Set a timeout in the current Future inside the JdkConnector#apply. The issue will still exist but it will not hang threads forever.

@cykl
Copy link
Author

cykl commented Jun 28, 2021

Thanks for the feedback, it matches my intuition.

2 - Make the synchronous JdkConnector#apply to use a new connection every time instead of using the HttpConnectionPool. But there should be a limit somehow to prevent to spam hundreds of connections.

Connection reuse is a must-have to me. Creating a new connection with TLS handshake for each call is way too expansive.

As a fun fact, we are migrating away from HttpUrlConnectorProvider because we discovered that TLS streams got terminated on TLS alerts under heavy load. Effective behavior was closer to Connection: randomly-close than to persistent connection. I still have to report this issue.

3 - Set a timeout in the current Future inside the JdkConnector#apply. The issue will still exist but it will not hang threads forever.

While it makes sense and would likely be better than being stuck forever, I have to evaluate if it enables application to better react to such issue. Figuring out an adequate timeout could also be an issue.

More generally, what production grade connector would the Jersey team recommend using? I had like to avoid "heavy" dependencies, especially if the connector doesn't support HTTP/2.

@cykl
Copy link
Author

cykl commented Jul 2, 2021

We just deployed a few more applications with jdk-connector enabled and this race condition seems much more common than I expected. It took down more than 5 pods, with small load, in less than 12h. This seems a showstopper and jdk-connector should not be used until this get fixed.

@jbescos
Copy link
Member

jbescos commented Jul 2, 2021

@cykl I am trying to reproduce the issue with the next test, but I had no 'bad luck' so far. Could you check with a thread dump if you have more threads hanged?. This is the test, one hangs all the requests and the other is more random.

public class StressTest extends JerseyTest {

    private static final int SSL_PORT = 29999;
    private static final int ITERATIONS = 10000;
    // Must be less than the server thread pool. Is there any way to get that value?.
    private static final int N_REQUESTS = 10;
    private static final SSLContext SSL_CONTEXT;
    private static final ExecutorService executor = Executors.newFixedThreadPool(N_REQUESTS);
    private static CountDownLatch requests;
    private static CountDownLatch latch;

    static {
        System.setProperty("javax.net.ssl.keyStore", SslFilterTest.class.getResource("/keystore_server").getPath());
        System.setProperty("javax.net.ssl.keyStorePassword", "asdfgh");
        System.setProperty("javax.net.ssl.trustStore", SslFilterTest.class.getResource("/truststore_server").getPath());
        System.setProperty("javax.net.ssl.trustStorePassword", "asdfgh");
        try {
            SSL_CONTEXT = SslConfigurator.newInstance()
                    .trustStoreFile(SslFilterTest.class.getResource("/truststore_server").getPath())
                    .trustStorePassword("asdfgh")
                    .keyStoreFile(SslFilterTest.class.getResource("/keystore_server").getPath())
                    .keyStorePassword("asdfgh").createSSLContext();
        } catch (Exception e) {
            e.printStackTrace();
            throw new IllegalStateException(e);
        }
    }

    @Path("/test")
    public static class TestResource {
        @GET
        @Path("/1")
        public String test1() throws InterruptedException {
            requests.countDown();
            if (latch.await(100, TimeUnit.SECONDS)) {
                return "test1";
            } else {
                throw new IllegalStateException("Timeout");
            }
        }
        @GET
        @Path("/2")
        public String test2() {
            return "test2";
        }
    }

    @Override
    protected Application configure() {
        enable(TestProperties.LOG_TRAFFIC);
        enable(TestProperties.DUMP_ENTITY);
        return new ResourceConfig(TestResource.class);
    }

    @Override
    protected void configureClient(ClientConfig config) {
        config.connectorProvider(new JdkConnectorProvider());
    }

    @Override
    protected Optional<SSLContext> getSslContext() {
        return Optional.of(SSL_CONTEXT);
    }

    @Override
    protected Optional<SSLParameters> getSslParameters() {
        return Optional.of(SSL_CONTEXT.createSSLEngine("localhost", SSL_PORT).getSSLParameters());
    }

    @Override
    protected URI getBaseUri() {
        return UriBuilder.fromUri("https://localhost/").port(SSL_PORT).build();
    }

    @Test
    public void hangAllRequestsStatus200() throws InterruptedException, ExecutionException {
        assertEquals("https", getBaseUri().getScheme());
        for (int i = 0; i < ITERATIONS; i++) {
            requests = new CountDownLatch(N_REQUESTS);
            latch = new CountDownLatch(1);
            List<Future<Response>> responses = new ArrayList<>();
            for (int j = 0; j < N_REQUESTS; j++) {
                Future<Response> future = executor.submit(() -> target("/test/1").request().get());
                responses.add(future);
            }
            assertTrue(requests.await(100, TimeUnit.SECONDS));
            latch.countDown();
            for (Future<Response> response : responses) {
                assertEquals(200, response.get().getStatus());
            }
        }
    }

    @Test
    public void randomnessStatus200() throws InterruptedException, ExecutionException {
        assertEquals("https", getBaseUri().getScheme());
        for (int i = 0; i < ITERATIONS; i++) {
            List<Future<Response>> responses = new ArrayList<>();
            for (int j = 0; j < N_REQUESTS; j++) {
                Future<Response> future = executor.submit(() -> target("/test/2").request().get());
                responses.add(future);
            }
            for (Future<Response> response : responses) {
                assertEquals(200, response.get().getStatus());
            }
        }
    }

}

@cykl
Copy link
Author

cykl commented Jul 2, 2021

All production occurrences I have investigated had only one thread blocked. But, this can easily be explained by the architecture of the applications. They are batch jobs parallelizing HTTP calls using https://github.com/pivovarit/parallel-collectors. When a call blocks, current batch will timeout within seconds / minutes and abort or get stuck giving almost no chance for another call to block.

@jbescos
Copy link
Member

jbescos commented Jul 2, 2021

When I enabled SSL in the test I found a deadlock that is relatively easy to reproduce. Maybe this is related to your issue or maybe not, but at least I have something to work on.

I attach the stacks of the blocked threads.
thread_dump.txt

It is happening when the test is shutting down:

	at org.glassfish.jersey.jdk.connector.internal.JdkConnector.close(JdkConnector.java:246)
	at org.glassfish.jersey.client.ClientRuntime.close(ClientRuntime.java:368)
	at org.glassfish.jersey.client.ClientRuntime.onShutdown(ClientRuntime.java:353)
	at org.glassfish.jersey.client.JerseyClient.release(JerseyClient.java:190)
	at org.glassfish.jersey.client.JerseyClient.close(JerseyClient.java:180)
	at org.glassfish.jersey.test.JerseyTest.closeIfNotNull(JerseyTest.java:953)
	at org.glassfish.jersey.test.JerseyTest.tearDown(JerseyTest.java:648)

@cykl
Copy link
Author

cykl commented Jul 2, 2021

👍 I do confirm that all our calls issued by jdk-connector use TLS.

@cykl
Copy link
Author

cykl commented Jul 5, 2021

I was able to reproduce the issue, without TLS, using following terrible code. As consistency issue is likely related to a connection being closed or replaced, I set connection idle timeout to a very low value to ensure a high connection churn. We use a 60s timeout in our applications.

The good news is that it that it doesn't seem that hard to reproduce (it usually occurs after ~60s).

The not so good news is that I observed other dubious behaviors:

  1. It seems frequent that an invocation try to use a closed connection. While it is not a showstopper as client caller must deal with ProcessingException anyway, it seems weird as I think that jdk-connector itself closed the connection and allowed an invocation to use it (I have not verified that and might be wrong).
  2. Some client threads got stuck because of another issue than DestinationConnectionPool complaining about non existent request. It seems to occurs more frequently that the issue that I reported here and I observed no exception being logged.

Because of 2, I don't have a good predicate to detect that the issue we are looking for occured, and you must manually check the error output for the presence of "Request not found" or set a break point.

Note: Using OkHttp's MockWebServer & ParallelCollectors was the quickest way for me to write such code, my apologizies for the headhache it might cause. I didn't spend time to port it into Jersey codebase as I am currently unable to write some kind of real unit test.

package com.greencomnetworks.jersey.client;

import com.google.common.util.concurrent.ThreadFactoryBuilder;
import com.pivovarit.collectors.ParallelCollectors;
import java.io.IOException;
import java.time.Duration;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import java.util.function.Supplier;
import java.util.stream.IntStream;
import javax.ws.rs.ProcessingException;
import javax.ws.rs.client.Client;
import javax.ws.rs.client.ClientBuilder;
import okhttp3.mockwebserver.Dispatcher;
import okhttp3.mockwebserver.MockResponse;
import okhttp3.mockwebserver.MockWebServer;
import okhttp3.mockwebserver.RecordedRequest;
import org.glassfish.jersey.client.ClientConfig;
import org.glassfish.jersey.jdk.connector.JdkConnectorProperties;
import org.glassfish.jersey.jdk.connector.JdkConnectorProvider;
import org.junit.jupiter.api.Test;

public class Reproducer {

    @Test
    void reproducer( ) throws IOException, ExecutionException, InterruptedException, TimeoutException {
        var parallelism = 16;
        var cnxTimeout = Duration.ofMillis( 50 );

        try (var server = new MockWebServer( )) {
            server.setDispatcher( new Dispatcher( ) {
                public MockResponse dispatch( RecordedRequest recordedRequest ) {
                    return new MockResponse( );
                }
            } );

            Supplier<Client> clientFactory = ( ) -> {
                return ClientBuilder.newBuilder( )
                        .withConfig( new ClientConfig( )
                                .property( JdkConnectorProperties.MAX_CONNECTIONS_PER_DESTINATION, parallelism )
                                .property( JdkConnectorProperties.CONNECTION_IDLE_TIMEOUT,
                                        ( int ) cnxTimeout.toMillis( ) )
                                .connectorProvider( new JdkConnectorProvider( ) ) )
                        .build( );
            };

            var executor = Executors.newFixedThreadPool( parallelism, new ThreadFactoryBuilder( )
                    .setNameFormat( "client-%d" )
                    .build( ) );

            Client client = null;

            for( int iteration = 0; iteration < Integer.MAX_VALUE; iteration++ ) {
                if( client == null ) {
                    client = clientFactory.get( );
                }

                var target = client.target( server.url( "/" ).toString( ) );

                try {
                    IntStream.range( 0, 512 )
                            .mapToObj( Integer::toString )
                            .collect( ParallelCollectors.parallel(
                                    i -> {
                                        try {
                                            target.request( ).get( );
                                        } catch( ProcessingException e ) {
                                            if( e.getMessage( ).contains( "Current state: CLOSED" ) ) {
                                                System.out.println( e.getMessage( ) ); // Not what we are looking for
                                            } else {
                                                throw e;
                                            }
                                        }
                                        return i;
                                    },
                                    executor,
                                    parallelism ) )
                            .get( 15, TimeUnit.SECONDS );

                    // Might help to trigger the race condition as it likely happen on cnx close
                    var sleepDuration =
                            ( long ) ( ThreadLocalRandom.current( ).nextLong( cnxTimeout.toMillis( ) ) * 1.1 );
                    Thread.sleep( sleepDuration );
                } catch( TimeoutException e ) {
                    System.err.printf(
                            "Maybe reproduced #4810 at iteration %d, check logs for \"Request not found\"\n",
                            iteration );

                    Thread.getAllStackTraces( ).forEach( ( thread, elements ) -> {
                        if( thread.getName( ).startsWith( "client-" ) && elements.length > 10 ) {
                            System.out.println( thread );
                            Arrays.stream( elements )
                                    .forEach( element -> System.out.printf( "\t%s\n", element ) );
                        }
                    } );

                    client.close( );
                    client = null;
                }
            }
        }
    }
}

@jbescos
Copy link
Member

jbescos commented Jul 8, 2021

I am reproducing it too. I have to trigger more requests than threads we have in the fixed thread pool.

    @Test
    public void randomnessStatus200() throws InterruptedException, ExecutionException {
        ExecutorService executor = Executors.newFixedThreadPool(PARALLELISM,
                new ThreadFactoryBuilder().setNameFormat("client-%d").build());
        for (int i = 0; i < ITERATIONS; i++) {
            System.out.println("Iteration " + i);
            List<Future<Response>> responses = new ArrayList<>();
            for (int j = 0; j < 100; j++) {
                Future<Response> future = executor.submit(() -> target("/test/2").request().get());
                responses.add(future);
            }
            for (Future<Response> response : responses) {
                assertEquals(200, response.get().getStatus());
            }
        }
        executor.shutdown();
        assertTrue(executor.awaitTermination(10, TimeUnit.SECONDS));
    }

In the iteration 762 I'm having the issue.

@jbescos
Copy link
Member

jbescos commented Jul 8, 2021

I found there are items in DestinationConnectionPool#pendingRequests when the threads were blocked forever. This matches with what you pointed in the issue.

That field was a HashMap and it is accessed concurrently. After modifying it to ConcurrentHashMap my tests are passing.

@cykl would you help me to test my changes with your reproducer, please?. You need to do the next:

  1. Checkout the branch with the fix: https://github.com/jbescos/jersey/tree/issue4810
  2. Compile the whole project: mvn clean install -DskipTests
  3. Add Jersey 2.35-SNAPSHOT dependency in your reproducer.
  4. Run your reproducer.

@jbescos jbescos mentioned this issue Jul 8, 2021
@cykl
Copy link
Author

cykl commented Jul 8, 2021

I just ran my reproducer a few times using 2.34 and your branch. I can confirm that your fix seems to make the "Request not found" issue vanish.

However, I still observe many blocked client threads. No stack trace or error message is printed; client threads are just waiting for the future to complete. There is likely another synchronization issue somewhere in the connector. I was observing this issue with my reproducer using 2.34 but never saw it in the wild.

Note that I added .connectTimeout( 30, TimeUnit.SECONDS ).readTimeout( 30, TimeUnit.SECONDS ) just to make sure that it couldn't be that the client is still waiting something from the server.

Thread[client-12,5,main]
	java.base@11.0.10/jdk.internal.misc.Unsafe.park(Native Method)
	java.base@11.0.10/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
	java.base@11.0.10/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
	java.base@11.0.10/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
	java.base@11.0.10/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
	java.base@11.0.10/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
	app//org.glassfish.jersey.jdk.connector.internal.JdkConnector.apply(JdkConnector.java:72)
	app//org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:297)
	app//org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:662)
	app//org.glassfish.jersey.client.JerseyInvocation$$Lambda$507/0x0000000800313040.call(Unknown Source)
	app//org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:697)
	app//org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:691)
	app//org.glassfish.jersey.client.JerseyInvocation$$Lambda$510/0x0000000800340440.call(Unknown Source)
	app//org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	app//org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	app//org.glassfish.jersey.internal.Errors.process(Errors.java:205)
	app//org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390)
	app//org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:691)
	app//org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:661)
	app//org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:413)
	app//org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:313)
	app//com.greencomnetworks.jersey.client.Reproducer2.lambda$reproducer$1(Reproducer2.java:70)
	app//com.greencomnetworks.jersey.client.Reproducer2$$Lambda$340/0x000000080019e040.apply(Unknown Source)
	app//com.pivovarit.collectors.AsyncParallelCollector.lambda$null$1(AsyncParallelCollector.java:62)
	app//com.pivovarit.collectors.AsyncParallelCollector$$Lambda$348/0x000000080019c440.get(Unknown Source)
	app//com.pivovarit.collectors.Dispatcher.lambda$completionTask$2(Dispatcher.java:90)
	app//com.pivovarit.collectors.Dispatcher$$Lambda$349/0x000000080019b840.run(Unknown Source)
	java.base@11.0.10/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	java.base@11.0.10/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	app//com.pivovarit.collectors.Dispatcher.lambda$withFinally$4(Dispatcher.java:117)
	app//com.pivovarit.collectors.Dispatcher$$Lambda$352/0x000000080019a840.run(Unknown Source)
	java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base@11.0.10/java.lang.Thread.run(Thread.java:834)

@jbescos
Copy link
Member

jbescos commented Jul 8, 2021

Thank you for your tests. Yes, it seems there is something else to fix.

I'm going to leave the test running for hours and lets see if it gets stuck at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants