.. _pypy_warmups: ++++++++++++++++++++++++++++++++++++++++++++++++ Analyze of PyPy warmup in performance benchmarks ++++++++++++++++++++++++++++++++++++++++++++++++ .. contents:: :depth: 2 Goal ==== The purpose of this analysis if to choose the number of warmups and samples for `performance benchmarks `_ for PyPy 5.7 on the speed-python server. * small enough to reduce total benchmark runtime * get a mean-stdev of the sample close enough to mean-stdev of all values * if performance has a cycle (usually with a spike at the start or end): compute 5 cycles to use the average of a cycle General notes ============= The analysis is based on a single data set: `pypy2_571_warmups.json.gz `_. * PyPy2 5.7.1 (revision 1aa2d8e03cdf): 64-bit, static binary * perf 1.2 (dev), performance 0.5.5 (dev) * performance hacked to run exactly 10 worker processes, each computes 250 values with 0 warmup * Date: 2017-04-13 22:19 - 2017-04-14 09:01 (10h45) * 67 benchmarks After computing these data, the following 4 microbenchmarks have been removed from performance benchmark suite and so will be ignored here. * call_method * call_method_slots * call_method_unknown * call_simple speed-python server: * cpu: 2 HP DL380 G7 Intel® Xeon® X5680 (3.33GHz/6-core/130W/12MB) FIO Processor Kit * memory: 4x 4GB (1x4GB) Dual Rank x4 PC3-10600 (DDR3-1333) Registered CAS-9 Memory Kit * OS: Ubuntu 16.04.1 LTS * Kernel: GNU/Linux 4.4.0-47-generic Script used to get the number of loops computed by perf:: import perf suite = perf.BenchmarkSuite.load('pypy.json.gz') for bench in suite: loops = [run._get_loops() for run in bench.get_runs() if run.values] if len(set(loops)) != 1: raise Exception(bench) loops = loops[0] print("%s: loops=%s" % (bench.get_name(), loops)) Must read paper: `Virtual Machine Warmup Blows Hot and Cold `_ (december 2016). See also the `R changepoint package `_. Glossary ======== * Loops: number of outer loop iteration, number of times the benchmark function is called to compute one "value". perf calibrates the number of iterations to get a loop taking at least 100 ms. * Value: Average wall-clock time elapsed to call the number *loops* times. Pseudo-code:: t0 = perf_counter() for _ in range(loops): func() dt = perf_counter() value = float(dt) / loops * Warmups: number of ignored values at the beginning of a benchmark. The number is chosen to ignore first values were are usually much higher than the mean of the "steady state" * Run: execution of one worker processes which computed values. In this analysis, each worker process computed exactly 250 values. Example: a result with 10 runs, 8 values, 0 warmups and 250 values means that the benchmark function was called 10x8x250 = 20,000 times in total, and 8x250 = 2,000 times per process, but the result only contains 10x250 = 2,500 values in total (or 250 values per process) since values are average time per loop iteration. 2to3: loops=1, warmups=0, values=30 =================================== * many spikes up to +16% of the mean * no cycle * moving average(50) increases from 1.816 sec to 1.824 sec Overall: .. image:: pypy_warmups/2to3.png Moving average (50 values) of runs: .. image:: pypy_warmups/2to3_moving_avg50.png chameleon: loops=32, warmups=6, values=10 ========================================= Overall: .. image:: pypy_warmups/chameleon.png Short cycle of 2 values (skip first 6 values, limit to 30 values): .. image:: pypy_warmups/chameleon_short_cycle.png Large cycle of 42 values (Moving average 1 value, skip first 6 values): .. image:: pypy_warmups/chameleon_large_cycle.png Mean: * loops=32, warmups=6, values=10: 7.05 ms +- 0.23 ms * loops=32, warmups=6, values=42: 7.02 ms +- 0.20 ms * LIMIT: loops=32, warmups=6: 7.01 ms +- 0.20 ms chaos: loops=16, warmups=20, values=25 ======================================= Overall: .. image:: pypy_warmups/chaos.png Cycle of 5 values (average of runs, skip 20, limit to 50): .. image:: pypy_warmups/chaos_cycle.png Glitch at values 35..38. Mean: * loops=16, warmups=20, values=25: 6.50 ms +- 0.22 ms * LIMIT: warmups=20: 6.48 ms +- 0.22 ms crypto_pyaes: loops=16, warmups=46, values=45 ============================================= Overall: .. image:: pypy_warmups/crypto_pyaes.png Cycle of 9 values (average of runs, skip 46, limit to 45): .. image:: pypy_warmups/crypto_pyaes_cycle.png Mean: * loops=16, warmups=46, values=45: 9.81 ms +- 0.24 ms * LIMIT: loops=16, warmups=46: 9.81 ms +- 0.24 ms deltablue: loops=256, warmups=14, values=55 =========================================== Overall: .. image:: pypy_warmups/deltablue.png Cycle of 11 values (average of runs, skip 14, limit to 55): .. image:: pypy_warmups/deltablue_cycle.png Mean: * loops=256, warmups=14, values=55: 441 us +- 20 us * LIMIT: loops=256, warmups=14: 440 us +- 19 us django_template: loops=4, warmups=12, values=36 =============================================== Overall: .. image:: pypy_warmups/django_template.png Cycle of 7.3 values (average of runs, skip 12, limit to 36): .. image:: pypy_warmups/django_template_cycle.png Mean: * loops=4, warmups=12, values=36: 36.9 ms +- 1.6 ms * loops=4, warmups=12, values=73: 36.9 ms +- 1.6 ms * LIMIT: loops=4, warmups=12: 36.9 ms +- 1.6 ms dulwich_log: loops=2, warmups=21, values=18 =========================================== Overall: .. image:: pypy_warmups/dulwich_log.png Cycle of 3.6 values (average of runs, skip 21, limit to 18): .. image:: pypy_warmups/dulwich_log_cycle.png Mean: * loops=2, warmups=21, values=18: 98.6 ms +- 4.7 ms * LIMIT: loops=2, warmups=21: 96.8 ms +- 4.4 ms fannkuch: loops=1, warmups=59, values=40 ======================================== Overall: .. image:: pypy_warmups/fannkuch.png Moving average of 25 values (skip 59), very small absolute variation (see the Y scale): .. image:: pypy_warmups/fannkuch_moving_avg25.png Long cycle of 40 values. Not easy to see using moving average, spikes depend on the width of the moving window. Mean: * loops=1, warmups=59, values=40: 171 ms +- 1 ms * LIMIT: loops=1, warmups=59: 171 ms +- 1 ms float: loops=4, warmups=25, values=40 ===================================== Use the suboptimal step 2 to reduce total benchmark runtime, even if the step 3 is a little bit faster. Overall: .. image:: pypy_warmups/float.png Step 2, after warmup (average of runs, skip 25, limit 40): .. image:: pypy_warmups/float_step2.png Step 3: cycle of 16 values (average of runs, skip 119, limit 80): .. image:: pypy_warmups/float_step3.png Mean: * Step 2: loops=4, warmups=25, values=40: 41.0 ms +- 0.4 ms * Step 3: loops=4, warmups=119: 39.9 ms +- 1.6 ms genshi_text: loops=8, BUG! ========================== BUG! 19 ms at value 0 => 92 ms at value 250, steady slowdown! Overall: .. image:: pypy_warmups/genshi_text.png genshi_xml: loops=2, BUG! ========================= BUG! 70 ms at value 0 => 200 ms at value 250, steady slowdown! Overall: .. image:: pypy_warmups/genshi_xml.png go: loops=2, warmups=87, values=80 ================================== Overall: .. image:: pypy_warmups/go.png Cycle of 5 values (average of runs, skip 87, limit 80): .. image:: pypy_warmups/go_cycle.png Step 2 after warmup (skip 87): .. image:: pypy_warmups/go_warmup.png Contiguous optimization (moving average of 50 values, skip 87), but only minor optimization (look at the Y scale): .. image:: pypy_warmups/go_moving_avg50.png Mean: * loops=2, warmups=87, values=32: 87.4 ms +- 4.9 ms * loops=2, warmups=87, values=80: 87.3 ms +- 5.0 ms * LIMIT: loops=2, warmups=87: 87.2 ms +- 5.0 ms hexiom: loops=64, warmups=36, values=50 ======================================= Only compute 2 cycles instead of 5 to limit the benchmark total runtime, since the cycle of long (25 values). Overall: .. image:: pypy_warmups/hexiom.png Cycle of 25 values (average of runs, skip 36 , limit 127): .. image:: pypy_warmups/hexiom_cycle.png Mean: * loops=64, warmups=36, values=50: 2.32 ms +- 0.04 ms * LIMIT: loops=64, warmups=36: 2.33 ms +- 0.05 ms hg_startup: loops=1, warmups=4, values=10 ========================================= Overall: .. image:: pypy_warmups/hg_startup.png Step 2 (skip 4, limit 10): .. image:: pypy_warmups/hg_startup_step2.png Mean: * loops=1, warmups=4, values=10: 243 ms +- 1 ms * LIMIT: loops=1, warmups=4: 243 ms +- 1 ms html5lib: loops=2, warmups=50, values=50 ======================================== On 250 values, it seems like PyPy optimizes the code multiple times. Tere are at least 3 steps: * Warmup: 0..56 * Step 1: 57..158 * Step 2: 159..249 Use the step 1 which is suboptimal to reduce the benchmark total runtime. Overall: .. image:: pypy_warmups/html5lib.png Moving average of 25 values: .. image:: pypy_warmups/html5lib_moving_avg25.png Cycle (runs average, skip 57, limit 50): .. image:: pypy_warmups/html5lib_cycle.png Mean: * loops=2, warmups=50, values=50: **63.8 ms +- 3.6 ms** * LIMIT: loops=2, warmups=50: **63.3 ms +- 3.5 ms** json_dumps: loops=16, warmups=38, values=25 =========================================== Overall: .. image:: pypy_warmups/json_dumps.png Cycle of 5 values (runs average, skip 38, limit 50): .. image:: pypy_warmups/json_dumps_cycle.png Mean: * Sample (5 cycles): loops=16, warmups=38, values=25: **7.38 ms +- 0.15 ms** * Sample (10 cycles): loops=16, warmups=38, values=50: **7.39 ms +- 0.16 ms** * LIMIT: loops=16, warmups=38: **7.39 ms +- 0.16 ms** json_loads: loops=256, warmups=27, values=50 ============================================ Overall, +16% spike at value 90, +10% spike at value 190: .. image:: pypy_warmups/json_loads.png Step 2 (skip 27, limit 50) with a pike at value 51: .. image:: pypy_warmups/json_loads_step2.png Mean: * Sample: loops=256, warmups=27, values=50: **28.4 us +- 0.2 us** * Limit: loops=256, warmups=27: **28.4 us +- 0.4 us** TODO ==== * logging_format: loops=2048 * logging_silent: loops=134217728 * logging_simple: loops=4096 * mako: loops=8 * meteor_contest: loops=2 * nbody: loops=4 * nqueens: loops=4 * pathlib: loops=8 * pickle: loops=64 * pickle_dict: loops=64 * pickle_list: loops=256 * pickle_pure_python: loops=128 * pidigits: loops=1 * pyflate: loops=1 * python_startup: loops=8 * python_startup_no_site: loops=8 * raytrace: loops=8 * regex_compile: loops=2 * regex_dna: loops=1 * regex_effbot: loops=4 * regex_v8: loops=1 * richards: loops=64 * scimark_fft: loops=16 * scimark_lu: loops=64 * scimark_monte_carlo: loops=32 * scimark_sor: loops=128 * scimark_sparse_mat_mult: loops=1024 * spambayes: loops=4 * spectral_norm: loops=16 * sqlalchemy_declarative: loops=1 * sqlalchemy_imperative: loops=8 * sqlite_synth: loops=32768 * sympy_expand: loops=1 * sympy_integrate: loops=4 * sympy_sum: loops=2 * sympy_str: loops=1 * telco: loops=8 * tornado_http: loops=2 * unpack_sequence: loops=2048 * unpickle: loops=128 * unpickle_list: loops=1024 * unpickle_pure_python: loops=128 * xml_etree_parse: loops=1 * xml_etree_iterparse: loops=1 * xml_etree_generate: loops=2 * xml_etree_process: loops=2