Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large cassettes cause memory errors #187

Open
kratsg opened this issue Jul 1, 2020 · 7 comments
Open

Large cassettes cause memory errors #187

kratsg opened this issue Jul 1, 2020 · 7 comments

Comments

@kratsg
Copy link

kratsg commented Jul 1, 2020

 _____________________________ test_get_image_model _____________________________
 auth_client = <itkdb.client.Client object at 0x7fc9cba70970>
 tmpdir = local('/tmp/pytest-of-root/pytest-0/test_get_image_model0')
     def test_get_image_model(auth_client, tmpdir):
         with betamax.Betamax(
             auth_client, cassette_library_dir=itkdb.settings.ITKDB_CASSETTE_LIBRARY_DIR
         ) as recorder:
             recorder.use_cassette('test_images.test_get_image', record='none')
             image = auth_client.get(
                 'uu-app-binarystore/getBinaryData',
                 json={
                     'code': 'bc2eccc58366655352582970d3f81bf46f15a48cf0cb98d74e21463f1dc4dcb9'
                 },
             )
             assert isinstance(image, itkdb.models.Image)
             assert image.filename == 'PB6.CR2'
             assert image.format == 'cr2'
             temp = tmpdir.join("saved_image.cr2")
             nbytes = image.save(filename=temp.strpath)
 >           assert nbytes == 64603006
 tests/integration/test_image.py:34: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 /usr/local/lib/python3.8/site-packages/betamax/recorder.py:72: in __exit__
     self.stop()
 /usr/local/lib/python3.8/site-packages/betamax/recorder.py:131: in stop
     self.betamax_adapter.eject_cassette()
 /usr/local/lib/python3.8/site-packages/betamax/adapter.py:50: in eject_cassette
     self.cassette.eject()
 /usr/local/lib/python3.8/site-packages/betamax/cassette/cassette.py:110: in eject
     self._save_cassette()
 /usr/local/lib/python3.8/site-packages/betamax/cassette/cassette.py:205: in _save_cassette
     self.sanitize_interactions()
 /usr/local/lib/python3.8/site-packages/betamax/cassette/cassette.py:179: in sanitize_interactions
     i.replace_all(self.placeholders, True)
 /usr/local/lib/python3.8/site-packages/betamax/cassette/interaction.py:69: in replace_all
     self.replace(*placeholder.unpack(serializing))
 /usr/local/lib/python3.8/site-packages/betamax/cassette/interaction.py:63: in replace
     self.replace_in_body(text_to_replace, placeholder)
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 self = <betamax.cassette.interaction.Interaction object at 0x7fc9cbc0dee0>
 text_to_replace = '', placeholder = '<ACCESSCODE2>'
     def replace_in_body(self, text_to_replace, placeholder):
         for obj in ('request', 'response'):
             body = self.data[obj]['body']
             old_style = hasattr(body, 'replace')
             if not old_style:
                 body = body.get('string', '')
     
             if text_to_replace in body:
 >               body = body.replace(text_to_replace, placeholder)
 E               MemoryError
 /usr/local/lib/python3.8/site-packages/betamax/cassette/interaction.py:89: MemoryError

When a previously recorded cassette is very large, trying to replace causes an error. Any ideas how to get around this?

@kratsg
Copy link
Author

kratsg commented Jul 1, 2020

Personally, it would be helpful that cassettes that get ejected don't need to be saved. I think that's what is happening, even if it's meant to be read-only.

@sigmavirus24
Copy link
Collaborator

             if text_to_replace in body:
 >               body = body.replace(text_to_replace, placeholder)

This is treating the body as a string so we're doing str.replace here because that's worked for what has seemed to be every other use-case. We'd need a way to not have a giant string in memory (which we need to do to build the cassette unfortunately) and we'd need a way to in-place replace the string (which python doesn't allow because strings are immutable).

I'd be open to discussing how we can fix this, but I haven't spent much time on this library in a long time and I don't know of a quick solution here aside from loading the data into something like a StringIO and manipulating that (and then figuring out how to manipulate that with a JSONEncoder/Decoder)

@sigmavirus24
Copy link
Collaborator

Personally, it would be helpful that cassettes that get ejected don't need to be saved. I think that's what is happening, even if it's meant to be read-only.

In this case we're trying to look for and replace the placeholders you've defined. We do it every time because one most likely wants to have tests that appear to work and so a placeholder being substituted back in works well for pretty much everyone.

@kratsg
Copy link
Author

kratsg commented Jul 1, 2020

In this case we're trying to look for and replace the placeholders you've defined. We do it every time because one most likely wants to have tests that appear to work and so a placeholder being substituted back in works well for pretty much everyone.

So the only replacement I do here for placeholders is for bearer token. I would be open to a way to disable bearer token substitution when running these tests on a previously recorded cassette (assuming I do not want to create it again). [Since it's already replaced in the header as part of recording in the first place].

@kratsg
Copy link
Author

kratsg commented Jul 1, 2020

So there are two short-term options on the table here:

  1. edit the cassette to reduce the size (truncate it). This might work in the case where you don't actually care about the body contents, but just need to check workflow/integration tests.
  2. allow betamax to be configured to skip replacing placeholders manually

Here's how option 2 would work for pytest using pytest --no-placeholders from command line:

def pytest_addoption(parser):
    parser.addoption(
        "--no-placeholders", action="store_true", help="skip making placeholders"
    )   


def configure_betamax(no_placeholders):
    placeholders = { 
        'accessCode1': itkdb.settings.ITKDB_ACCESS_CODE1,
        'accessCode2': itkdb.settings.ITKDB_ACCESS_CODE2,
    }   

    def filter_bearer_token(interaction, current_cassette):
        # Exit early if the request did not return 200 OK because that's the
        # only time we want to look for Authorization-Token headers
        if interaction.data['response']['status']['code'] != 200:
            return

        headers = interaction.data['request']['headers']
        token = headers.get('Authorization')
        # If there was no token header in the request, exit
        if token is None:
            return

        # Otherwise, create a new placeholder so that when cassette is saved,
        # Betamax will replace the token with our placeholder.
        current_cassette.placeholders.append(
            betamax.cassette.cassette.Placeholder(
                placeholder='Bearer <ACCESS_TOKEN>', replace=token[0]
            )   
        )   

    betamax.Betamax.register_serializer(pretty_json.PrettyJSONSerializer)
    with betamax.Betamax.configure() as config:
        config.default_cassette_options['serialize_with'] = 'prettyjson'
        if not no_placeholders:
            config.before_record(callback=filter_bearer_token)
            for key, value in placeholders.items():
                config.define_cassette_placeholder(
                    '<{}>'.format(key.upper()), replace=value
                )   


def pytest_configure(config):
    configure_betamax(config.option.no_placeholders)

The long-term option would be to work with StringIO/BytesIO and iterate over line-by-line (or skip parsing the body if possible).

@kratsg
Copy link
Author

kratsg commented Jul 1, 2020

I have a happy workaround for now. In this case, I know when I'm getting binary data from the API I use, so i can do the following:

    # In cases where we get a large amount of binary data from the server, we'll truncate this.
    if 'uu-app-binarystore/getBinaryData' in interaction.data['request']['uri']:
        interaction.data['response']['body']['string'] = interaction.data['response'][
            'body'
        ]['string'][:1000]

to truncate the cassette automatically. This works pretty well -- since I don't care about the actual binary data content.

@kratsg kratsg closed this as completed Jul 1, 2020
@sigmavirus24 sigmavirus24 reopened this Jul 1, 2020
@sigmavirus24
Copy link
Collaborator

I'm re-opening this since I think it's a legitimate problem betamax might want to solve for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants