How TOPI makes TVM code less sample-like

introductory

This article is translated from the English documentIntroduction to TOPI。

Written by.Ehsan M. Kermani。

More TVM documents can be found at →TVM Chinese Station

This is an introductory tutorial on TVM's Operator List (TOPI). TOPI provides numpy-style generic operations and schedules with a higher level of abstraction than TVM, and this tutorial describes how TOPI makes the code in TVM less stylized.

import tvm
import 
from tvm import te
from tvm import topi
import numpy as np

basic example

Let us recall the row summing operation (e.g. B = (A, axis=1)). To compute the row sum of a two-dimensional TVM tensor A, you should specify the symbolic operation as well as the schedule, as follows:

n = ("n")
m = ("m")
A = ((n, m), name="A")
k = te.reduce_axis((0, m), "k")
B = ((n,), lambda i: (A[i, k], axis=k), name="B")
s = te.create_schedule()

Enter the following command to view the readable IR code:

print((s, [A], simple_mode=True))

Output results:

@main = primfn(A_1: handle) -&gt; ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True}
  buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type="auto")}
  buffer_map = {A_1: A}
  preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [n, m: int32], [stride, stride_1: int32], type="auto")} {
  allocate(B: Pointer(global float32), float32, [n]), storage_scope = global;
  for (i: int32, 0, n) {
    B_1: Buffer(B, float32, [n], [])[i] = 0f32
    for (k: int32, 0, m) {
      B_1[i] = (B_1[i] + A[((i*stride) + (k*stride_1))])
    }
  }
}

However, the reduce axis must be defined for such a common operation and explicitly computed with the definition. Fortunately, it is possible to replace these two lines with (similar):

C = (A, axis=1)
ts = te.create_schedule()
print((ts, [A], simple_mode=True))

Output results:

@main = primfn(A_1: handle) -&gt; ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True}
  buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type="auto")}
  buffer_map = {A_1: A}
  preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [n, m: int32], [stride, stride_1: int32], type="auto")} {
  allocate(A_red: Pointer(global float32), float32, [n]), storage_scope = global;
  for (ax0: int32, 0, n) {
    A_red_1: Buffer(A_red, float32, [n], [])[ax0] = 0f32
    for (k1: int32, 0, m) {
      A_red_1[ax0] = (A_red_1[ax0] + A[((ax0*stride) + (k1*stride_1))])
    }
  }
}

Numpy-style operator overloading

Two tensors (whose shape is broadcastable and specific) can be added with topi.broadcast_add. TOPI provides operator overloading for such common operations to make them shorter. For example:

x, y = 100, 10
a = ((x, y, y), name="a")
b = ((y, y), name="b")
c = a + b  # Equivalent to topi.broadcast_add
d = a * b  # Equivalent to topi.broadcast_mul

TOPI broadcasts the primitive (int, float) to the tensor d - 3.14 using the same syntax overloading.

Generic scheduling and fusion operations

We've already shown how TOPI saves us from having to write explicit computations with low-level APIs, but the scheduling process is still the same as before, and TOPI provides a more advanced scheduling scheme based on the given context. The following series of operations ending in .schedule_reduce can be scheduled using only .schedule_reduce, for example for CUDA:

e = topi.elemwise_sum([c, d])
f = e / 2.0
g = (f)
with ():
    sg = .schedule_reduce(g)
    print((sg, [a, b], simple_mode=True))

Output results:

/workspace/python/tvm/target/:377: UserWarning: Try specifying cuda arch by adding 'arch=sm_xx' to your target.
  ("Try specifying cuda arch by adding 'arch=sm_xx' to your target.")
@main = primfn(a_1: handle, b_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True}
  buffers = {a: Buffer(a_2: Pointer(float32), float32, [10000], []),
             b: Buffer(b_2: Pointer(float32), float32, [100], [])}
  buffer_map = {a_1: a, b_1: b}
  preflattened_buffer_map = {a_1: a_3: Buffer(a_2, float32, [100, 10, 10], []), b_1: b_3: Buffer(b_2, float32, [10, 10], [])} {
  allocate(T_divide_red: Pointer(global float32), float32, [1]), storage_scope = global;
  attr [IterVar(: int32, [0:1024], "ThreadIndex", "")] "thread_extent" = 1024;
  allocate(T_divide_red.rf: Pointer(local float32), float32, [1]), storage_scope = local;
  allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local {
    T_divide_red.rf_1: Buffer(T_divide_red.rf, float32, [1], [], scope="local", align=4)[0] = 0f32
    for (k0..: int32, 0, 10) {
      if @((((((k0..*64) + floordiv(, 16)) < 625) && (((k0..*64) + floordiv(, 16)) < 625)) && (((k0..*64) + floordiv(, 16)) < 625)), dtype=bool) {
        T_divide_red.rf_1[0] = (T_divide_red.rf_1[0] + (((a[((k0..*1024) + )] + b[((floordiv(floormod(((k0..*12) + floordiv(, 2)), 50), 5)*10) + floormod(((k0..*4) + ), 10))]) + (a[((k0..*1024) + )]*b[((floordiv(floormod(((k0..*12) + floordiv(, 2)), 50), 5)*10) + floormod(((k0..*4) + ), 10))]))*0.5f32))
      }
    }
    attr [meta[][0]] "reduce_scope" = @(0u64, dtype=handle);
    @tir.tvm_thread_allreduce(1u32, T_divide_red.rf_1[0], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], , dtype=handle)
    if ( == 0) {
      T_divide_red_1: Buffer(T_divide_red, float32, [1], [], align=4)[0] = reduce_temp0_1[0]
    }
  }
}

As shown above, the scheduling phases of the calculation are cumulative and can be viewed by entering the following command:

print()

Output results:

[stage(a, placeholder(a, 0x228afb00)), stage(b, placeholder(b, 0x22097c90)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_elemwise_sum, compute(T_elemwise_sum, body=[(T_add[ax0, ax1, ax2] + T_multiply[ax0, ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=elemwise, attrs={})), stage(T_divide, compute(T_divide, body=[(T_elemwise_sum[ax0, ax1, ax2]/2f)], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=elemwise, attrs={})), stage(T_divide_red.rf, compute(T_divide_red.rf, body=[reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[0f]), source=[T_divide[floordiv(floordiv((k0.. + (k0..*1024)), 10), 10), floormod(floordiv((k0.. + (k0..*1024)), 10), 10), floormod((k0.. + (k0..*1024)), 10)]], init=[], axis=[iter_var(k0.., range(min=0, ext=10))], where=((((floordiv(floordiv((k0.. + (k0..*1024)), 10), 10) < 100) && (floordiv((k0.. + (k0..*1024)), 10) < 1000)) && ((k0.. + (k0..*1024)) < 10000))), value_index=0)], axis=[iter_var(k0.., range(min=0, ext=1024))], reduce_axis=[iter_var(k0.., range(min=0, ext=10))], tag=, attrs={})), stage(T_divide_red, compute(T_divide_red.repl, body=[reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[0f]), source=[T_divide_red.rf[k0..]], init=[], axis=[iter_var(k0.., range(min=0, ext=1024))], where=(bool)1, value_index=0)], axis=[], reduce_axis=[iter_var(k0.., range(min=0, ext=1024))], tag=, attrs={}))]

This can be verified by comparing it with the numpy results, as shown below:

func = (sg, [a, b, g], "cuda")
dev = (0)
a_np = (size=(x, y, y)).astype()
b_np = (size=(y, y)).astype()
g_np = ((a_np + b_np, a_np * b_np) / 2.0)
a_nd = (a_np, dev)
b_nd = (b_np, dev)
g_nd = ((g_np.shape, dtype=g_np.dtype), dev)
func(a_nd, b_nd, g_nd)
.assert_allclose(g_nd.numpy(), g_np, rtol=1e-5)

TOPI also provides common neural network operations, such as softmax on the optimized schedule:

tarray = ((512, 512), name="tarray")
softmax_topi = (tarray)
with ("cuda"):
    sst = .schedule_softmax(softmax_topi)
    print((sst, [tarray], simple_mode=True))

Output results:

@main = primfn(tarray_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True}
  buffers = {tarray: Buffer(tarray_2: Pointer(float32), float32, [262144], [])}
  buffer_map = {tarray_1: tarray}
  preflattened_buffer_map = {tarray_1: tarray_3: Buffer(tarray_2, float32, [512, 512], [])} {
  allocate(T_softmax_norm: Pointer(global float32x4), float32x4, [65536]), storage_scope = global;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 512;
  allocate(normal_reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local;
  allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local;
  allocate(T_softmax_exp: Pointer(warp float32), float32, [512]), storage_scope = warp;
  allocate(normal_reduce_temp0_1: Pointer(local float32), float32, [1]), storage_scope = local;
  allocate(reduce_temp0_1: Pointer(local float32), float32, [1]), storage_scope = local {
    attr [IterVar(: int32, [0:32], "ThreadIndex", "")] "thread_extent" = 32 {
      normal_reduce_temp0_2: Buffer(normal_reduce_temp0, float32, [1], [], scope="local")[0] = -3.40282e+38f32
      for (: int32, 0, 16) {
        normal_reduce_temp0_2[0] = max(normal_reduce_temp0_2[0], tarray[(((*512) + (*16)) + )])
      }
      attr [meta[][0]] "reduce_scope" = @(0u64, dtype=handle);
      @tir.tvm_thread_allreduce(1u32, normal_reduce_temp0_2[0], True, reduce_temp0_2: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], , dtype=handle)
      for (: int32, 0, 4) {
        let cse_var_1: int32 = (*4)
        T_softmax_exp_1: Buffer(T_softmax_exp, float32, [512], [], scope="warp")[ramp(((*16) + cse_var_1), 1, 4)] = @((tarray[ramp((((*512) + (*16)) + cse_var_1), 1, 4)] - broadcast(reduce_temp0_3: Buffer(reduce_temp0, float32, [1], [], scope="local", align=4)[0], 4)), dtype=float32x4)
      }
    }
    attr [IterVar(, [0:32], "ThreadIndex", "")] "thread_extent" = 32 {
      normal_reduce_temp0_3: Buffer(normal_reduce_temp0_1, float32, [1], [], scope="local")[0] = 0f32
      for (k.inner_1: int32, 0, 16) {
        normal_reduce_temp0_3[0] = (normal_reduce_temp0_3[0] + T_softmax_exp_1[((*16) + k.inner_1)])
      }
      attr [meta[][1]] "reduce_scope" = @(0u64, dtype=handle);
      @tir.tvm_thread_allreduce(1u32, normal_reduce_temp0_3[0], True, reduce_temp0_4: Buffer(reduce_temp0_1, float32, [1], [], scope="local")[0], , dtype=handle)
      for (.outer_1: int32, 0, 4) {
        T_softmax_norm_1: Buffer(T_softmax_norm, float32x4, [65536], [])[(((*128) + (*4)) + .outer_1)] = (T_softmax_exp_1[ramp(((*16) + (.outer_1*4)), 1, 4)] / broadcast(reduce_temp0_5: Buffer(reduce_temp0_1, float32, [1], [], scope="local", align=4)[0], 4))
      }
    }
  }
}

Fusion Convolution

It is possible to merge .conv2d and .

note

TOPI functions are generic functions, and different backends implement performance optimizations in different ways. All backends must call them within the scope of the compute declaration and schedule. TVM chooses the correct function to call with the target information.

data = ((1, 3, 224, 224))
kernel = ((10, 3, 5, 5))
with ("cuda"):
    conv = .conv2d_nchw(data, kernel, 1, 2, 1)
    out = (conv)
    sconv = .schedule_conv2d_nchw([out])
    print((sconv, [data, kernel], simple_mode=True))

Output results:

@main = primfn(placeholder_2: handle, placeholder_3: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True}
  buffers = {placeholder: Buffer(placeholder_4: Pointer(float32), float32, [150528], []),
             placeholder_1: Buffer(placeholder_5: Pointer(float32), float32, [750], [])}
  buffer_map = {placeholder_2: placeholder, placeholder_3: placeholder_1}
  preflattened_buffer_map = {placeholder_2: placeholder_6: Buffer(placeholder_4, float32, [1, 3, 224, 224], []), placeholder_3: placeholder_7: Buffer(placeholder_5, float32, [10, 3, 5, 5], [])} {
  allocate(compute: Pointer(global float32), float32, [501760]), storage_scope = global;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 5;
  allocate(conv2d_nchw: Pointer(local float32), float32, [14]), storage_scope = local;
  allocate(pad_temp.shared: Pointer(shared float32), float32, [112]), storage_scope = shared;
  allocate(: Pointer(shared float32), float32, [2]), storage_scope = shared;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 224;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 2;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
  attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 {
    conv2d_nchw_1: Buffer(conv2d_nchw, float32, [4], [], scope="local", align=8)[0] = 0f32
    conv2d_nchw_1[2] = 0f32
    conv2d_nchw_1[4] = 0f32
    conv2d_nchw_1[6] = 0f32
    conv2d_nchw_1[8] = 0f32
    conv2d_nchw_1[10] = 0f32
    conv2d_nchw_1[12] = 0f32
    conv2d_nchw_1[1] = 0f32
    conv2d_nchw_1[3] = 0f32
    conv2d_nchw_1[5] = 0f32
    conv2d_nchw_1[7] = 0f32
    conv2d_nchw_1[9] = 0f32
    conv2d_nchw_1[11] = 0f32
    conv2d_nchw_1[13] = 0f32
    for (: int32, 0, 3) {
      for (: int32, 0, 5) {
        attr [IterVar(threadIdx.z_1: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_1: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 {
          pad_temp.shared_1: Buffer(pad_temp.shared, float32, [112], [], scope="shared")[(threadIdx.x_1*7)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (1 <= ((*56) + floordiv((threadIdx.x_1*7), 2)))), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 450)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (1 <= ((*56) + floordiv(((threadIdx.x_1*7) + 1), 2)))), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 449)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 448)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32)
        }
        attr [IterVar(threadIdx.z_2: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_2: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 16;
        if @((threadIdx.x_2 < 2), dtype=bool) {
          placeholder.shared_1: Buffer(, float32, [2], [], scope="shared", align=8)[threadIdx.x_2] = placeholder_1[((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5))]
        }
        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0]))
        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0]))
        conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0]))
        conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0]))
        conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0]))
        conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0]))
        conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0]))
        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1]))
        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1]))
        conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1]))
        conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1]))
        conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1]))
        conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1]))
        conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1]))
        attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 {
          pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (1 <= ((*56) + floordiv(((threadIdx.x_1*7) + 1), 2)))), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 449)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 448)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32)
        }
        attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16;
        if @((threadIdx.x_2 < 2), dtype=bool) {
          placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 1)]
        }
        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0]))
        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0]))
        conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0]))
        conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0]))
        conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0]))
        conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0]))
        conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0]))
        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1]))
        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1]))
        conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1]))
        conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1]))
        conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1]))
        conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1]))
        conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1]))
        attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 {
          pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 448)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 442)], 0f32, dtype=float32)
        }
        attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16;
        if @((threadIdx.x_2 < 2), dtype=bool) {
          placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 2)]
        }
        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0]))
        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0]))
        conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0]))
        conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0]))
        conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0]))
        conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0]))
        conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0]))
        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1]))
        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1]))
        conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1]))
        conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1]))
        conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1]))
        conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1]))
        conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1]))
        attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 {
          pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 442)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (((*56) + floordiv(((threadIdx.x_1*7) + 9), 2)) < 113)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 441)], 0f32, dtype=float32)
        }
        attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16;
        if @((threadIdx.x_2 < 2), dtype=bool) {
          placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 3)]
        }
        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0]))
        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0]))
        conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0]))
        conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0]))
        conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0]))
        conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0]))
        conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0]))
        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1]))
        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1]))
        conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1]))
        conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1]))
        conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1]))
        conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1]))
        conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1]))
        attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 {
          pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 442)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (((*56) + floordiv(((threadIdx.x_1*7) + 9), 2)) < 113)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 441)], 0f32, dtype=float32)
          pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (((*56) + floordiv((threadIdx.x_1*7), 2)) < 108)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 440)], 0f32, dtype=float32)
        }
        attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1;
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16;
        if @((threadIdx.x_2 < 2), dtype=bool) {
          placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 4)]
        }
        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0]))
        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0]))
        conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0]))
        conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0]))
        conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0]))
        conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0]))
        conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0]))
        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1]))
        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1]))
        conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1]))
        conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1]))
        conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1]))
        conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1]))
        conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1]))
      }
    }
    compute_1: Buffer(compute, float32, [501760], [])[((((*100352) + (*224)) + (*112)) + )] = max(conv2d_nchw_1[0], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 16)] = max(conv2d_nchw_1[2], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 32)] = max(conv2d_nchw_1[4], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 48)] = max(conv2d_nchw_1[6], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 64)] = max(conv2d_nchw_1[8], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 80)] = max(conv2d_nchw_1[10], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 96)] = max(conv2d_nchw_1[12], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50176)] = max(conv2d_nchw_1[1], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50192)] = max(conv2d_nchw_1[3], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50208)] = max(conv2d_nchw_1[5], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50224)] = max(conv2d_nchw_1[7], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50240)] = max(conv2d_nchw_1[9], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50256)] = max(conv2d_nchw_1[11], 0f32)
    compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50272)] = max(conv2d_nchw_1[13], 0f32)
  }
}

summarize

This tutorial has shown the following:

How to manipulate numpy-style operators using the TOPI API.
How TOPI facilitates generic schedule and operator fusion of contexts to generate optimized kernel code.

Download Python source code: intro_topi.py

Download Jupyter Notebook: intro_topi.ipynb

Above is the details of how TOPI makes TVM code less sample-like, for more information about TOPI TVM code please follow my other related articles!