Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response.iter_lines() did not print the last line of stream log. #3577

Closed
haocs opened this issue Sep 16, 2016 · 17 comments
Closed

response.iter_lines() did not print the last line of stream log. #3577

haocs opened this issue Sep 16, 2016 · 17 comments

Comments

@haocs
Copy link

haocs commented Sep 16, 2016

python: 3.4
requests: 2.9.1

I implemented the following function to fetch stream log from server continuously.

def stream_trace(streaming_url, user_name, password):
    import requests
    r = requests.get(streaming_url, auth=(user_name, password), stream=True)
    for line in r.iter_lines(decode_unicode=True):
        if line:
            print(line)

The above code could fetch and print the log successfully however its behavior was different as expected.
Basically, it holds the last line of current content/chunk and prints it together with the next chunk of logs.
For example, let's say there are two chunks of logs from server and the expected print:

# first chunk
a
#  second chunk
b
c

what stream_trace function printed out('a' printed as 2nd chunk and 'c' was missing)

# first chunk
# second chunk
a
b

Could you help me figure out what may went wrong? Thanks.

@Lukasa
Copy link
Member

Lukasa commented Sep 16, 2016

So iter_lines has a somewhat unexpected implementation. Naïvely, we would expect that iter_lines would receive data as it arrives and look for newlines. In practice, this is not what it does. Instead it waits to read an entire chunk_size, and only then searches for newlines. This is a consequence of the underlying httplib implementation, which only allows for file-like reading semantics, rather then the early return semantics usually associated with a socket.

You can get the effect you want by setting the chunk size to 1. However, this will drastically reduce performance. If you can tolerate late log delivery, then it is probably enough to leave the implementation as it is: when the connection is eventually closed, all of the lines should safely be delivered and no data will be lost. The only caveat here is that if the connection is closed uncleanly, then we will probably throw an exception rather then return the buffered data.

@haocs
Copy link
Author

haocs commented Sep 20, 2016

Thanks @Lukasa
However, setting chunk_size to 1 or None did not change the results in my case. It seems that my issue is related to https://github.com/kennethreitz/requests/issues/2020 . Requests somehow handles chucked-encoding differently as curl does.
The following example shows different results GET from my log-server using curl and requests.

# curl --raw -v 'https://some-url.com'
< Transfer-Encoding: chunked
< Content-Type: text/plain; charset=utf-8

4f
2016-09-20T09:56:36  Welcome, you are now connected to log-streaming service.

37
PORT is:\\\\.\\pipe\\xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx

11
Something wrong

12
Something normal
# r.iter_content(hunk_size=None, decode_unicode=False)

b'2016-09-20T10:12:09  Welcome, you are now connected to log-streaming service.'
b'\r\nPORT is:\\\\.\\pipe\\xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx\r'
b'\nSomething normal'
b'\r\nSomething wrong\r'
b'\n'
b'PORT is:\\\\.\\pipe\\xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx\r\n'
b'Something wrong\r\nSomething '
b'normal\r\n'
b'PORT is:\\\\.\\pipe\\xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx\r\nSomething '
b'wrong\r\nSomething '

We can see that iter_content get the correct data as well as CRLF but chunks them in a different way. Since iter_lines internally called iter_content, the line split differently accordingly.

@Lukasa
Copy link
Member

Lukasa commented Sep 20, 2016

I didn't realise you were getting chunked content. In that case, can you try the latest Requests with iter_content(None)? That should actually give you chunks.

@haocs
Copy link
Author

haocs commented Sep 20, 2016

I tried with v2.11 but saw the same issue. Does iter_content chunk the data based on the chuck_size provided by server?

@Lukasa
Copy link
Member

Lukasa commented Sep 20, 2016

@haocs Only when using chunk_size=None.

@haocs
Copy link
Author

haocs commented Sep 20, 2016

Even with chunk_size=None, the length of content generated from iter_content is different to chunk_size from server.

# r.iter_contents(chunk_size=None)
[length of chunk] 0x4d
b'2016-09-20T21:41:52  Welcome, you are now connected to log-streaming service.'
[length of chunk] 0x38
b'\r\nPORT is:\\\\.\\pipe\\xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\r'

######
#curl --raw
4f
2016-09-20T22:00:30  Welcome, you are now connected to log-streaming service.

37
PORT is:\\.\pipe\xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

The above snippet shows two chunks that fetched by requests and curl from server. It seems that requests did not handle trailing CRLF(which is part of the chunk) properly. For example, chunk_size of the first chunk indicate the size is 4F but iter_content only received 4D length and add \r\n to the beginning of the next chunk.

@Lukasa
Copy link
Member

Lukasa commented Sep 21, 2016

I don't observe this problem on Python when using https://mkcert.org/generate/, where Requests generates exactly the same chunk boundaries as curl. Can you confirm for me please that server really is generating the exact same chunk boundaries in each case?

Can you also confirm for me that you ran your test on v2.11?

@haocs
Copy link
Author

haocs commented Sep 21, 2016

yes, I tested against v2.11.1. Requests works fine with https://mkcert.org/generate/.
One difference I noticed is that chunks from my testing server contains a \r\n explicitly at the end of each line(and the length of \r\n has been included in chunk length). Versus the mkcert.org ones don't have. Will this cause any trouble for Requests to process chunks?
BTW. my testing is running against Azure kudu server. If necessary, I can provide a testing account as well as repro steps.

@Lukasa
Copy link
Member

Lukasa commented Sep 22, 2016

mkcert.org provides a \r\n at the end of each chunk too, because it's required to by RFC 7230 Section 4.1. That section says that a chunked body looks like this:

<chunk size in hex>\r\n
<chunk data, whose total length is == chunk size>\r\n

Note that the \r\n at the end is excluded from the chunk size. Does your output end each chunk with two \r\n, one counted in the body and one that isn't? Because it's supposed to. The raw body above seems to be overcounting its chunk sizes by counting the CRLF characters in the chunk size, when it should not.

@haocs
Copy link
Author

haocs commented Sep 23, 2016

I understand the end \r\n of each chunk should not be counted in chunk_size. But another \r\n should be, right? However, per my testing, requests ignored both \r\n if I understand correctly.
I implemented another request function using urllib3 and it performed same as curl did. Which makes me believe that requests skipped \r\n when iterates contents.
Please see the following results from urllib3 and requests.

# python req_urllib3.py
[len of chunk] 0x4f
b'2016-09-23T19:27:27  Welcome, you are now connected to log-streaming service.\r\n'
[len of chunk] 0x39
b'2016-09-23T19:28:27  No new trace in the past 1 min(s).\r\n'


# python req_requests.py
[len of chunk] 0x4d
b'2016-09-23T19:25:09  Welcome, you are now connected to log-streaming service.'
[len of chunk] 0x39
b'\r\n2016-09-23T19:26:09  No new trace in the past 1 min(s).'

@Lukasa
Copy link
Member

Lukasa commented Sep 23, 2016

Requests uses urllib3 directly and performs no additional post processing in this case. iter_content(None) is identical to stream(None).

@haocs
Copy link
Author

haocs commented Sep 24, 2016

What's the urllib3 version shipped with requests v2.11?

@haocs
Copy link
Author

haocs commented Sep 24, 2016

Ok, I could repro this "issue" with urllib3. If I use urllib3 and set accept_encoding=True, it will give me exactly what. Seems Requests by default set header Accept-Encoding=Ture if called by requests.get()

@Lukasa
Copy link
Member

Lukasa commented Sep 24, 2016

So do you see the problem go away if you set headers={'Accept-Encoding': 'identity'}?

@haocs
Copy link
Author

haocs commented Sep 26, 2016

Yes. After I set headers={'Accept-Encoding': 'identity'}, iter_content(chunk_size=None, decode_unicode=False) worked as expected.

@Lukasa
Copy link
Member

Lukasa commented Sep 26, 2016

Ok. This strongly suggests that the problem is the way that the server is handling gzipping a chunked body. It would be very interesting if possible to see the raw data stream.

@haocs
Copy link
Author

haocs commented Sep 30, 2016

Since I could observe same problem using curl or urllib3 with gzip enabled, obviously this not necessary to be an issue of requests. Thank you very much for the help, issue closed.

@haocs haocs closed this as completed Sep 30, 2016
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants