Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safe_load with implicit_resolver slows down #439

Closed
AlexeyTrekin opened this issue Sep 17, 2020 · 1 comment · Fixed by #441
Closed

safe_load with implicit_resolver slows down #439

AlexeyTrekin opened this issue Sep 17, 2020 · 1 comment · Fixed by #441

Comments

@AlexeyTrekin
Copy link

AlexeyTrekin commented Sep 17, 2020

Every time safe_load is called, it turns out to be slower than previous one, if implicit resolver is used.
Without implicit_resolvers the safe_load() works the same time every time.

pyyaml versions: '5.1.2', '5.3.1'
python 3.6.5, ubuntu 18.04

My output (code below, test data attached):

Run 1. 100 tries: 0.18349623680114746
Run 2. 100 tries: 0.2797386646270752
Run 3. 100 tries: 0.3783113956451416
Run 4. 100 tries: 0.47731661796569824
Run 5. 100 tries: 0.5768036842346191
Run 6. 100 tries: 0.6758952140808105
Run 7. 100 tries: 0.7751777172088623
Run 8. 100 tries: 0.8736646175384521
Run 9. 100 tries: 0.9729092121124268
Run 10. 100 tries: 1.0713348388671875

from time import time
import re
import yaml
        

tag= ""
pattern = re.compile('.*?\${(\w+)}.*?')

yaml.SafeLoader.add_implicit_resolver(tag, pattern, None)

path = './sample.txt'

for run in range(10):
    t0 = time()
    for _ in range(100):
        with open(path) as conf_data:
            res = yaml.safe_load(conf_data)

    t1 = time()
    print(f'Run {run+1}. 100 tries: {t1-t0}')

sample.txt

@psphicas
Copy link
Contributor

That's a neat bug.

It seems to be triggered by the combination of using None in the add_implicit_resolver call, along with your specific data, which has potential first-letter-matches with existing implicit resolvers (in this case the leading f in foo* that looks like it could potentially be a boolean).

Not a fix, obviously, but s/foo/poo/ makes the problem go away. The same bad behavior would be expected if tags start with these characters:

>>> "".join(yaml.Loader.yaml_implicit_resolvers.keys())
'yYnNtTfFoO-+0123456789.<~=!&*'

The problem seems to be here:

if value == '':
resolvers = self.yaml_implicit_resolvers.get('', [])
else:
resolvers = self.yaml_implicit_resolvers.get(value[0], [])
resolvers += self.yaml_implicit_resolvers.get(None, [])
for tag, regexp in resolvers:
if regexp.match(value):
return tag

            # value = "foo"
            else:
                # value[0] = "f"
                resolvers = self.yaml_implicit_resolvers.get(value[0], [])
                # resolvers = [('tag:yaml.org,2002:bool', re.compile('^(?:yes|Yes|YES|no|No|NO\n                    |true|True|TRUE|false|False|FALSE\n                    |on|On|ON|off|Off|OFF)$', re.VERBOSE))]
            # XXX This is mutating yaml.SafeLoader.yaml_implicit_resolvers['f']
            resolvers += self.yaml_implicit_resolvers.get(None, [])
            # self.yaml_implicit_resolvers.get(None) returns the custom implicit resolver
            # The longer the list gets, the slower it gets
            for tag, regexp in resolvers:
                if regexp.match(value):
                    return tag

Every time safe_load is called, yaml.SafeLoader.yaml_implicit_resolvers['f'] grows longer by 6 elements, and because there is never a regex match, it always has to traverse the entire list.

config:
    foo: bar
    bars:
    -   foo1: bar1
        foo2: bar2
        baz:
            foo3: bar3
            foo6:
                foo7: bar7

psphicas added a commit to psphicas/pyyaml that referenced this issue Sep 18, 2020
Repeated calls to `resolve` can experience performance degredation, if
`add_implicit_resolver` has been called with `first=None` (to add an
implicit resolver with an unspecified first character).

For example, every time `foo` is encountered, the "wildcard implicit
resolvers" (with `first=None`) will be appended to the list of implicit
resolvers for strings starting with `f`, which will normally be the
resolver for booleans. The list `yaml_implicit_resolvers['f']` will keep
getting longer. The same behavior applies for any first-letter matches
with existing implicit resolvers.

This change avoids unintentionally mutating the lists in the class-level
dict `yaml_implicit_resolvers` by looping through a temporary copy.

Fixes: yaml#439
ingydotnet pushed a commit that referenced this issue Jan 13, 2021
Repeated calls to `resolve` can experience performance degredation, if
`add_implicit_resolver` has been called with `first=None` (to add an
implicit resolver with an unspecified first character).

For example, every time `foo` is encountered, the "wildcard implicit
resolvers" (with `first=None`) will be appended to the list of implicit
resolvers for strings starting with `f`, which will normally be the
resolver for booleans. The list `yaml_implicit_resolvers['f']` will keep
getting longer. The same behavior applies for any first-letter matches
with existing implicit resolvers.

This change avoids unintentionally mutating the lists in the class-level
dict `yaml_implicit_resolvers` by looping through a temporary copy.

Fixes: #439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants