Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issuing more than 500 performance queries on 500 image dataset fails #10

Open
psyhtest opened this issue Apr 17, 2020 · 0 comments
Open
Labels
bug Something isn't working

Comments

@psyhtest
Copy link
Contributor

I've been using a dataset with 500 images for testing:

$ ck install package --tags=dataset,imagenet,2012,val,min --no_tags=resized

and observed that setting the target QPS parameter to 8 and above e.g.:

$ export NPROCS=`grep -c processor /proc/cpuinfo`
$ ck run program:mlperf-inference-v0.5 --skip_print_timers --dep_add_tags.dataset=min \
--cmd_key=image-classification --env.CK_LOADGEN_TASK=image-classification \
--env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE=Performance  \
--env.CK_OPENVINO_NTHREADS=$NPROCS --env.CK_OPENVINO_NSTREAMS=$NPROCS \
--env.CK_OPENVINO_NIREQ=$NPROCS --env.CK_LOADGEN_TARGET_QPS=8

results in a segmentation fault:

./tmp-EEKtXG.sh: line 38: 123041 Segmentation fault      (core dumped) ./Release/ov_mlperf --scenario ${CK_LOADGEN_SCENARIO} --mode ${CK_LOADGEN_MODE} --mlperf_conf_filename ${CK_LOADGEN_MLPERF_CONF} --user_conf_filename ${CK_LOADGEN_USER_CONF} --total_sample_count ${CK_LOADGEN_DATASET_SIZE} --data_path ${CK_ENV_DATASET_IMAGENET_VAL} --dataset imagenet --device ${CK_OPENVINO_
DEVICE} --model_path ${CK_ENV_OPENVINO_MODEL_XML} --model_name ${CK_OPENVINO_MODEL_NAME} --nireq ${CK_OPENVINO_NIREQ} --nstreams ${CK_OPENVINO_NSTREAMS} --nthreads ${CK_OPENVINO_NTHREADS} -
-nwarmup_iters ${CK_OPENVINO_NWARMUP_ITERS} --batch_size ${CK_BATCH_SIZE} > stdout.log 2> stderr.log

and subsequently to a Python exception due to corrupted log files (in particular, tmp/mlperf_log_accuracy.json only contains [):

--------------------------------
Traceback (most recent call last):
  File "/home/anton/CK/ck/kernel.py", line 10820, in <module>
    r=access(sys.argv[1:])
  File "/home/anton/CK/ck/kernel.py", line 10776, in access
    rr=perform_action(i)
  File "/home/anton/CK/ck/kernel.py", line 4126, in perform_action
    return a(i)
  File "/home/anton/CK_REPOS/ck-autotuning/module/program/module.py", line 3571, in run
    run_output_dict = process(i)
  File "/home/anton/CK_REPOS/ck-autotuning/module/program/module.py", line 182, in process
    r=process_in_dir(ii)
  File "/home/anton/CK_REPOS/ck-autotuning/module/program/module.py", line 3042, in process_in_dir
    rxx=cs.ck_postprocess(ii)
  File "/home/anton/CK_REPOS/ck-mlperf/script/image-classification/loadgen_postprocess.py", line 34, in ck_postprocess
    mlperf_log_dict['accuracy'] = json.load(accuracy_file)
  File "/usr/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 1 (char 0)

With the target QPS set to 7, mlperf_log_summary.txt contains e.g.

================================================
MLPerf Results Summary
================================================
SUT name : SUT^@
Scenario : Offline
Mode     : Performance
Samples per second: 77.7004
Result is : INVALID
  Min duration satisfied : NO
  Min queries satisfied : Yes
Recommendations:
 * Increase expected QPS so the loadgen pre-generates a larger (coalesced) query.

================================================
Additional Stats
================================================
Min latency (ns)                : 5945914094
Max latency (ns)                : 5945914094
Mean latency (ns)               : 5945914094
50.00 percentile latency (ns)   : 5945914094
90.00 percentile latency (ns)   : 5945914094
95.00 percentile latency (ns)   : 5945914094
97.00 percentile latency (ns)   : 5945914094
99.00 percentile latency (ns)   : 5945914094
99.90 percentile latency (ns)   : 5945914094

================================================
Test Parameters Used
================================================
samples_per_query : 462
target_qps : 7
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 60000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 3133965575612453542
sample_index_rng_seed : 665484352860916858
schedule_rng_seed : 3622009729038561421
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
print_timestamps : false
performance_issue_unique : false
performance_issue_same : false
performance_issue_same_index : 0
performance_sample_count : 1024

samples_per_query gets calculated as target_qps * 60 * 1.1. When target_qps=7, samples_per_query=462 as above. Therefore, when target_qps=8, samples_per_query=528 which explains the segmentation fault.

However, rather than segfaulting, a better approach would be to load the 500 images, and process some images more than once.

@psyhtest psyhtest added the bug Something isn't working label Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant