Tuning High Performance Convolution on NVIDIA GPUs

Author: Lianmin Zheng

This is an advanced tutorial for writing high performance tunable template for NVIDIA GPU. By running auto-tuner on this template, we can outperform the vendor provided library CuDNN in many cases.

Install dependencies

To use autotvm package in tvm, we need to install some extra dependencies. (change “3” to “2” if you use python2):

pip3 install --user psutil xgboost tornado

To make TVM run faster in tuning, it is recommended to use cython as FFI of tvm. In the root directory of tvm, execute

pip3 install --user cython
sudo make cython3

Now return to python code. Import packages.

import logging
import sys
import numpy as np

import tvm
import topi
from topi.testing import conv2d_nchw_python

from tvm import autotvm

Step 1: Define the search space

There are plenty of useful schedule primitives in tvm. You can also find some tutorials that describe them in more details, such as (1). How to optimize convolution on GPU (2). Optimizing DepthwiseConv on NVIDIA GPU

However, their implementations are manually tuned for some special input shapes. In this section, we build a large enough space to cover the techniques used in these tutorials. Then we rely on the efficient auto-tuner to search through this space and pick some good configurations.

If you are familiar with writing cuda schedule, you can find the following template is very general. Actually this template can be easily modified to tune other operators such as depthwise convolution and gemm. In order to fully understand this template, you should be familiar with the schedule primitives and auto tuning API. You can refer to the above tutorials and autotvm tutorial

It is worth noting that the search space for a conv2d operator can be very large (at the level of 10^9 for some input shapes)

@autotvm.template
def conv2d_no_batching(N, H, W, CO, CI, KH, KW, stride, padding):
    assert N == 1, "Only consider batch_size = 1 in this template"

    data = tvm.placeholder((N, CI, H, W), name='data')
    kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')
    conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype='float32')
    s = tvm.create_schedule([conv.op])

    ##### space definition begin #####
    n, f, y, x = s[conv].op.axis
    rc, ry, rx = s[conv].op.reduce_axis

    cfg = autotvm.get_config()
    cfg.define_split("tile_f", f, num_outputs=4)
    cfg.define_split("tile_y", y, num_outputs=4)
    cfg.define_split("tile_x", x, num_outputs=4)
    cfg.define_split("tile_rc", rc, num_outputs=3)
    cfg.define_split("tile_ry", ry, num_outputs=3)
    cfg.define_split("tile_rx", rx, num_outputs=3)
    cfg.define_knob("auto_unroll_max_step", [0, 512, 1500])
    cfg.define_knob("unroll_explicit", [0, 1])
    ##### space definition end #####

    # inline padding
    pad_data = s[conv].op.input_tensors[0]
    s[pad_data].compute_inline()
    data, raw_data = pad_data, data

    output = conv
    OL = s.cache_write(conv, 'local')

    # create cache stage
    AA = s.cache_read(data, 'shared', [OL])
    WW = s.cache_read(kernel, 'shared', [OL])
    AL = s.cache_read(AA, 'local', [OL])
    WL = s.cache_read(WW, 'local', [OL])

    # tile and bind spatial axes
    n, f, y, x = s[output].op.axis
    bf, vf, tf, fi = cfg["tile_f"].apply(s, output, f)
    by, vy, ty, yi = cfg["tile_y"].apply(s, output, y)
    bx, vx, tx, xi = cfg["tile_x"].apply(s, output, x)
    kernel_scope = n  # this is the scope to attach global config inside this kernel

    s[output].bind(bf, tvm.thread_axis("blockIdx.z"))
    s[output].bind(by, tvm.thread_axis("blockIdx.y"))
    s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
    s[output].bind(vf, tvm.thread_axis("vthread"))
    s[output].bind(vy, tvm.thread_axis("vthread"))
    s[output].bind(vx, tvm.thread_axis("vthread"))
    s[output].bind(tf, tvm.thread_axis("threadIdx.z"))
    s[output].bind(ty, tvm.thread_axis("threadIdx.y"))
    s[output].bind(tx, tvm.thread_axis("threadIdx.x"))
    s[output].reorder(n, bf, by, bx, vf, vy, vx, tf, ty, tx, fi, yi, xi)
    s[OL].compute_at(s[output], tx)

    # tile reduction axes
    n, f, y, x = s[OL].op.axis
    rc, ry, rx = s[OL].op.reduce_axis
    rco, rcm, rci = cfg['tile_rc'].apply(s, OL, rc)
    ryo, rym, ryi = cfg['tile_rx'].apply(s, OL, ry)
    rxo, rxm, rxi = cfg['tile_ry'].apply(s, OL, rx)
    s[OL].reorder(rco, ryo, rxo, rcm, rym, rxm, rci, ryi, rxi, n, f, y, x)

    s[AA].compute_at(s[OL], rxo)
    s[WW].compute_at(s[OL], rxo)
    s[AL].compute_at(s[OL], rxm)
    s[WL].compute_at(s[OL], rxm)

    # cooperative fetching
    for load in [AA, WW]:
        n, f, y, x = s[load].op.axis
        fused = s[load].fuse(n, f, y, x)
        tz, fused = s[load].split(fused, nparts=cfg["tile_f"].size[2])
        ty, fused = s[load].split(fused, nparts=cfg["tile_y"].size[2])
        tx, fused = s[load].split(fused, nparts=cfg["tile_x"].size[2])
        s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
        s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
        s[load].bind(tx, tvm.thread_axis("threadIdx.x"))

    # tune unroll
    s[output].pragma(kernel_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
    s[output].pragma(kernel_scope, 'unroll_explicit', cfg['unroll_explicit'].val)

    return s, [raw_data, kernel, conv]

Step 2: Search through the space

We pick the last layer on resnet as test case. Since our space is very large, XGBoostTuner is most suitable for our case. Here we only do 20 trials for demonstration. In practice, making 1000 trials usually can find some good kernels for this template

# logging config (for printing tuning log to screen)
logging.getLogger('autotvm').setLevel(logging.DEBUG)
logging.getLogger('autotvm').addHandler(logging.StreamHandler(sys.stdout))

# the last layer in resnet
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = autotvm.task.create(conv2d_no_batching,
                           args=(N, H, W, CO, CI, KH, KW, strides, padding),
                           target='cuda')
print(task.config_space)

# Use local gpu, measure 10 times for every config to reduce variance
# The timeout of compiling a program is 10 seconds, the timeout for running is 4 seconds
measure_option = autotvm.measure_option(
    builder=autotvm.LocalBuilder(),
    runner=autotvm.LocalRunner(repeat=3, min_repeat_ms=100, timeout=4)
)

# Begin tuning, log records to file `conv2d.log`
# During tuning we will also try many invalid configs, so you are expected to
# see many error reports. As long as you can see non-zero GFLOPS, it is okay.
tuner = autotvm.tuner.XGBTuner(task)
tuner.tune(n_trial=20,
           measure_option=measure_option,
           callbacks=[autotvm.callback.log_to_file('conv2d.log')])

Out:

ConfigSpace (len=10454400, space_map=
   0 tile_f: Split(policy=factors, product=512, num_outputs=4) len=220
   1 tile_y: Split(policy=factors, product=7, num_outputs=4) len=4
   2 tile_x: Split(policy=factors, product=7, num_outputs=4) len=4
   3 tile_rc: Split(policy=factors, product=512, num_outputs=3) len=55
   4 tile_ry: Split(policy=factors, product=3, num_outputs=3) len=3
   5 tile_rx: Split(policy=factors, product=3, num_outputs=3) len=3
   6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
   7 unroll_explicit: OtherOption([0, 1]) len=2
)
Get devices for measurement successfully!
No: 1   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (3) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (2) /workspace/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::RPCModuleNode::WrapRemote(void*)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x3b) [0x7fa0d3a348db]\n  [bt] (1) /workspace/build/libtvm.so(tvm::runtime::RPCSession::CallFunc(void*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, tvm::runtime::PackedFunc const*)+0x154) [0x7fa0d3a4f064]\n  [bt] (0) /workspace/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fa0d3214112]\n  File "/workspace/src/runtime/rpc/rpc_session.cc", line 964\nTVMError: Check failed: code == RPCCode: :kReturn: code=4',),), error_no=4, all_cost=5.684202432632446, timestamp=1574315729.2855356)        [('tile_f', [-1, 4, 2, 64]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 1, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],,None,425466
No: 2   GFLOPS: 141.45/141.45   result: MeasureResult(costs=(0.0016366552755102042,), error_no=0, all_cost=1.9480071067810059, timestamp=1574315730.3972821)    [('tile_f', [-1, 8, 4, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,7045398
No: 3   GFLOPS: 332.20/332.20   result: MeasureResult(costs=(0.0006968667336244541,), error_no=0, all_cost=1.7456409931182861, timestamp=1574315731.552171)     [('tile_f', [-1, 1, 16, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9198285
No: 4   GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.10788488388061523, timestamp=1574315725.013589)  [('tile_f', [-1, 8, 64, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,6880548
No: 5   GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04163837432861328, timestamp=1574315732.2843256) [('tile_f', [-1, 1, 4, 8]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 1, 512]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],,None,4258689
No: 6   GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.0452728271484375, timestamp=1574315732.3798873)  [('tile_f', [-1, 4, 64, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 256]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],,None,2509856
No: 7   GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04688143730163574, timestamp=1574315732.444488)  [('tile_f', [-1, 1, 256, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,5437792
No: 8   GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.035955190658569336, timestamp=1574315732.5040557)        [('tile_f', [-1, 2, 2, 128]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 2, 64]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,6746074
No: 9   GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03940987586975098, timestamp=1574315733.6918416) [('tile_f', [-1, 2, 256, 1]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 128, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,7381933
No: 10  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.07619357109069824, timestamp=1574315733.7865407) [('tile_f', [-1, 128, 1, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9754467
No: 11  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04891777038574219, timestamp=1574315733.8552227) [('tile_f', [-1, 4, 2, 16]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],,None,2375292
No: 12  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04194974899291992, timestamp=1574315733.9224305) [('tile_f', [-1, 4, 64, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 64]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],,None,4814747
No: 13  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.0469365119934082, timestamp=1574315735.1085696)  [('tile_f', [-1, 2, 1, 128]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],,None,4677191
No: 14  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04914736747741699, timestamp=1574315735.1891172) [('tile_f', [-1, 2, 4, 16]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],,None,5176116
No: 15  GFLOPS: 22.84/332.20    result: MeasureResult(costs=(0.0101360996,), error_no=0, all_cost=1.8690853118896484, timestamp=1574315737.5522187)     [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],,None,8533140
No: 16  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.05137896537780762, timestamp=1574315736.1610243) [('tile_f', [-1, 4, 2, 32]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 32, 4]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,6476332
No: 17  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.0796654224395752, timestamp=1574315738.3326652)  [('tile_f', [-1, 4, 8, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9754321
No: 18  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04342246055603027, timestamp=1574315738.3759074) [('tile_f', [-1, 1, 256, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 1, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,5488392
No: 19  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.08000826835632324, timestamp=1574315738.4491892) [('tile_f', [-1, 4, 8, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,10141683
No: 20  GFLOPS: 0.00/332.20     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (1) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fa0d3a0a521]\n  [bt] (0) /workspace/build/libtvm.so(+0xb9ac6b) [0x7fa0d3a05c6b]\n  File "/workspace/docs/../python/tvm/_ffi/_ctypes/function.py", line 72, in cfun\n    rv = local_pyfunc(*pyargs)\n  File "/workspace/docs/../python/tvm/autotvm/measure/measure_methods.py", line 615, in verify_pass\n    raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.04678916931152344, timestamp=1574315738.5154924) [('tile_f', [-1, 8, 2, 32]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],,None,5989253

Finally we can inspect the best config from log file, check correctness, and measure running time.

# inspect the best config
dispatch_context = autotvm.apply_history_best("conv2d.log")
best_config = dispatch_context.query(task.target, task.workload)
print("\nBest config:")
print(best_config)

# apply history best from log file
with autotvm.apply_history_best('conv2d.log'):
    with tvm.target.create("cuda"):
        s, arg_bufs = conv2d_no_batching(N, H, W, CO, CI, KH, KW, strides, padding)
        func = tvm.build(s, arg_bufs)

# check correctness
a_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
w_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
c_np = conv2d_nchw_python(a_np, w_np, strides, padding)

ctx = tvm.gpu()
a_tvm = tvm.nd.array(a_np, ctx=ctx)
w_tvm = tvm.nd.array(w_np, ctx=ctx)
c_tvm = tvm.nd.empty(c_np.shape, ctx=ctx)
func(a_tvm, w_tvm, c_tvm)

tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-2)

# Evaluate running time. Here we choose a large repeat number (400) to reduce the noise
# and the overhead of kernel launch. You can also use nvprof to validate the result.
evaluator = func.time_evaluator(func.entry_name, ctx, number=400)
print('Time cost of this operator: %f' % evaluator(a_tvm, w_tvm, c_tvm).mean)

Out:

Best config:
[('tile_f', [-1, 1, 16, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],,None,9198285
Time cost of this operator: 0.001152

Total running time of the script: ( 0 minutes 26.642 seconds)

Gallery generated by Sphinx-Gallery