introductory
This article is translated from the English documentIntroduction to TOPI。
Written by.Ehsan M. Kermani。
More TVM documents can be found at →TVM Chinese Station
This is an introductory tutorial on TVM's Operator List (TOPI). TOPI provides numpy-style generic operations and schedules with a higher level of abstraction than TVM, and this tutorial describes how TOPI makes the code in TVM less stylized.
import tvm import from tvm import te from tvm import topi import numpy as np
basic example
Let us recall the row summing operation (e.g. B = (A, axis=1)). To compute the row sum of a two-dimensional TVM tensor A, you should specify the symbolic operation as well as the schedule, as follows:
n = ("n") m = ("m") A = ((n, m), name="A") k = te.reduce_axis((0, m), "k") B = ((n,), lambda i: (A[i, k], axis=k), name="B") s = te.create_schedule()
Enter the following command to view the readable IR code:
print((s, [A], simple_mode=True))
Output results:
@main = primfn(A_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True} buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type="auto")} buffer_map = {A_1: A} preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [n, m: int32], [stride, stride_1: int32], type="auto")} { allocate(B: Pointer(global float32), float32, [n]), storage_scope = global; for (i: int32, 0, n) { B_1: Buffer(B, float32, [n], [])[i] = 0f32 for (k: int32, 0, m) { B_1[i] = (B_1[i] + A[((i*stride) + (k*stride_1))]) } } }
However, the reduce axis must be defined for such a common operation and explicitly computed with the definition. Fortunately, it is possible to replace these two lines with (similar):
C = (A, axis=1) ts = te.create_schedule() print((ts, [A], simple_mode=True))
Output results:
@main = primfn(A_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True} buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type="auto")} buffer_map = {A_1: A} preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [n, m: int32], [stride, stride_1: int32], type="auto")} { allocate(A_red: Pointer(global float32), float32, [n]), storage_scope = global; for (ax0: int32, 0, n) { A_red_1: Buffer(A_red, float32, [n], [])[ax0] = 0f32 for (k1: int32, 0, m) { A_red_1[ax0] = (A_red_1[ax0] + A[((ax0*stride) + (k1*stride_1))]) } } }
Numpy-style operator overloading
Two tensors (whose shape is broadcastable and specific) can be added with topi.broadcast_add. TOPI provides operator overloading for such common operations to make them shorter. For example:
x, y = 100, 10 a = ((x, y, y), name="a") b = ((y, y), name="b") c = a + b # Equivalent to topi.broadcast_add d = a * b # Equivalent to topi.broadcast_mul
TOPI broadcasts the primitive (int, float) to the tensor d - 3.14 using the same syntax overloading.
Generic scheduling and fusion operations
We've already shown how TOPI saves us from having to write explicit computations with low-level APIs, but the scheduling process is still the same as before, and TOPI provides a more advanced scheduling scheme based on the given context. The following series of operations ending in .schedule_reduce can be scheduled using only .schedule_reduce, for example for CUDA:
e = topi.elemwise_sum([c, d]) f = e / 2.0 g = (f) with (): sg = .schedule_reduce(g) print((sg, [a, b], simple_mode=True))
Output results:
/workspace/python/tvm/target/:377: UserWarning: Try specifying cuda arch by adding 'arch=sm_xx' to your target. ("Try specifying cuda arch by adding 'arch=sm_xx' to your target.") @main = primfn(a_1: handle, b_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True} buffers = {a: Buffer(a_2: Pointer(float32), float32, [10000], []), b: Buffer(b_2: Pointer(float32), float32, [100], [])} buffer_map = {a_1: a, b_1: b} preflattened_buffer_map = {a_1: a_3: Buffer(a_2, float32, [100, 10, 10], []), b_1: b_3: Buffer(b_2, float32, [10, 10], [])} { allocate(T_divide_red: Pointer(global float32), float32, [1]), storage_scope = global; attr [IterVar(: int32, [0:1024], "ThreadIndex", "")] "thread_extent" = 1024; allocate(T_divide_red.rf: Pointer(local float32), float32, [1]), storage_scope = local; allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local { T_divide_red.rf_1: Buffer(T_divide_red.rf, float32, [1], [], scope="local", align=4)[0] = 0f32 for (k0..: int32, 0, 10) { if @((((((k0..*64) + floordiv(, 16)) < 625) && (((k0..*64) + floordiv(, 16)) < 625)) && (((k0..*64) + floordiv(, 16)) < 625)), dtype=bool) { T_divide_red.rf_1[0] = (T_divide_red.rf_1[0] + (((a[((k0..*1024) + )] + b[((floordiv(floormod(((k0..*12) + floordiv(, 2)), 50), 5)*10) + floormod(((k0..*4) + ), 10))]) + (a[((k0..*1024) + )]*b[((floordiv(floormod(((k0..*12) + floordiv(, 2)), 50), 5)*10) + floormod(((k0..*4) + ), 10))]))*0.5f32)) } } attr [meta[][0]] "reduce_scope" = @(0u64, dtype=handle); @tir.tvm_thread_allreduce(1u32, T_divide_red.rf_1[0], True, reduce_temp0_1: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], , dtype=handle) if ( == 0) { T_divide_red_1: Buffer(T_divide_red, float32, [1], [], align=4)[0] = reduce_temp0_1[0] } } }
As shown above, the scheduling phases of the calculation are cumulative and can be viewed by entering the following command:
print()
Output results:
[stage(a, placeholder(a, 0x228afb00)), stage(b, placeholder(b, 0x22097c90)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_elemwise_sum, compute(T_elemwise_sum, body=[(T_add[ax0, ax1, ax2] + T_multiply[ax0, ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=elemwise, attrs={})), stage(T_divide, compute(T_divide, body=[(T_elemwise_sum[ax0, ax1, ax2]/2f)], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=elemwise, attrs={})), stage(T_divide_red.rf, compute(T_divide_red.rf, body=[reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[0f]), source=[T_divide[floordiv(floordiv((k0.. + (k0..*1024)), 10), 10), floormod(floordiv((k0.. + (k0..*1024)), 10), 10), floormod((k0.. + (k0..*1024)), 10)]], init=[], axis=[iter_var(k0.., range(min=0, ext=10))], where=((((floordiv(floordiv((k0.. + (k0..*1024)), 10), 10) < 100) && (floordiv((k0.. + (k0..*1024)), 10) < 1000)) && ((k0.. + (k0..*1024)) < 10000))), value_index=0)], axis=[iter_var(k0.., range(min=0, ext=1024))], reduce_axis=[iter_var(k0.., range(min=0, ext=10))], tag=, attrs={})), stage(T_divide_red, compute(T_divide_red.repl, body=[reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[0f]), source=[T_divide_red.rf[k0..]], init=[], axis=[iter_var(k0.., range(min=0, ext=1024))], where=(bool)1, value_index=0)], axis=[], reduce_axis=[iter_var(k0.., range(min=0, ext=1024))], tag=, attrs={}))]
This can be verified by comparing it with the numpy results, as shown below:
func = (sg, [a, b, g], "cuda") dev = (0) a_np = (size=(x, y, y)).astype() b_np = (size=(y, y)).astype() g_np = ((a_np + b_np, a_np * b_np) / 2.0) a_nd = (a_np, dev) b_nd = (b_np, dev) g_nd = ((g_np.shape, dtype=g_np.dtype), dev) func(a_nd, b_nd, g_nd) .assert_allclose(g_nd.numpy(), g_np, rtol=1e-5)
TOPI also provides common neural network operations, such as softmax on the optimized schedule:
tarray = ((512, 512), name="tarray") softmax_topi = (tarray) with ("cuda"): sst = .schedule_softmax(softmax_topi) print((sst, [tarray], simple_mode=True))
Output results:
@main = primfn(tarray_1: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True} buffers = {tarray: Buffer(tarray_2: Pointer(float32), float32, [262144], [])} buffer_map = {tarray_1: tarray} preflattened_buffer_map = {tarray_1: tarray_3: Buffer(tarray_2, float32, [512, 512], [])} { allocate(T_softmax_norm: Pointer(global float32x4), float32x4, [65536]), storage_scope = global; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 512; allocate(normal_reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local; allocate(reduce_temp0: Pointer(local float32), float32, [1]), storage_scope = local; allocate(T_softmax_exp: Pointer(warp float32), float32, [512]), storage_scope = warp; allocate(normal_reduce_temp0_1: Pointer(local float32), float32, [1]), storage_scope = local; allocate(reduce_temp0_1: Pointer(local float32), float32, [1]), storage_scope = local { attr [IterVar(: int32, [0:32], "ThreadIndex", "")] "thread_extent" = 32 { normal_reduce_temp0_2: Buffer(normal_reduce_temp0, float32, [1], [], scope="local")[0] = -3.40282e+38f32 for (: int32, 0, 16) { normal_reduce_temp0_2[0] = max(normal_reduce_temp0_2[0], tarray[(((*512) + (*16)) + )]) } attr [meta[][0]] "reduce_scope" = @(0u64, dtype=handle); @tir.tvm_thread_allreduce(1u32, normal_reduce_temp0_2[0], True, reduce_temp0_2: Buffer(reduce_temp0, float32, [1], [], scope="local")[0], , dtype=handle) for (: int32, 0, 4) { let cse_var_1: int32 = (*4) T_softmax_exp_1: Buffer(T_softmax_exp, float32, [512], [], scope="warp")[ramp(((*16) + cse_var_1), 1, 4)] = @((tarray[ramp((((*512) + (*16)) + cse_var_1), 1, 4)] - broadcast(reduce_temp0_3: Buffer(reduce_temp0, float32, [1], [], scope="local", align=4)[0], 4)), dtype=float32x4) } } attr [IterVar(, [0:32], "ThreadIndex", "")] "thread_extent" = 32 { normal_reduce_temp0_3: Buffer(normal_reduce_temp0_1, float32, [1], [], scope="local")[0] = 0f32 for (k.inner_1: int32, 0, 16) { normal_reduce_temp0_3[0] = (normal_reduce_temp0_3[0] + T_softmax_exp_1[((*16) + k.inner_1)]) } attr [meta[][1]] "reduce_scope" = @(0u64, dtype=handle); @tir.tvm_thread_allreduce(1u32, normal_reduce_temp0_3[0], True, reduce_temp0_4: Buffer(reduce_temp0_1, float32, [1], [], scope="local")[0], , dtype=handle) for (.outer_1: int32, 0, 4) { T_softmax_norm_1: Buffer(T_softmax_norm, float32x4, [65536], [])[(((*128) + (*4)) + .outer_1)] = (T_softmax_exp_1[ramp(((*16) + (.outer_1*4)), 1, 4)] / broadcast(reduce_temp0_5: Buffer(reduce_temp0_1, float32, [1], [], scope="local", align=4)[0], 4)) } } } }
Fusion Convolution
It is possible to merge .conv2d and .
note
TOPI functions are generic functions, and different backends implement performance optimizations in different ways. All backends must call them within the scope of the compute declaration and schedule. TVM chooses the correct function to call with the target information.
data = ((1, 3, 224, 224)) kernel = ((10, 3, 5, 5)) with ("cuda"): conv = .conv2d_nchw(data, kernel, 1, 2, 1) out = (conv) sconv = .schedule_conv2d_nchw([out]) print((sconv, [data, kernel], simple_mode=True))
Output results:
@main = primfn(placeholder_2: handle, placeholder_3: handle) -> () attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "": True} buffers = {placeholder: Buffer(placeholder_4: Pointer(float32), float32, [150528], []), placeholder_1: Buffer(placeholder_5: Pointer(float32), float32, [750], [])} buffer_map = {placeholder_2: placeholder, placeholder_3: placeholder_1} preflattened_buffer_map = {placeholder_2: placeholder_6: Buffer(placeholder_4, float32, [1, 3, 224, 224], []), placeholder_3: placeholder_7: Buffer(placeholder_5, float32, [10, 3, 5, 5], [])} { allocate(compute: Pointer(global float32), float32, [501760]), storage_scope = global; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 5; allocate(conv2d_nchw: Pointer(local float32), float32, [14]), storage_scope = local; allocate(pad_temp.shared: Pointer(shared float32), float32, [112]), storage_scope = shared; allocate(: Pointer(shared float32), float32, [2]), storage_scope = shared; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 224; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 2; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 { conv2d_nchw_1: Buffer(conv2d_nchw, float32, [4], [], scope="local", align=8)[0] = 0f32 conv2d_nchw_1[2] = 0f32 conv2d_nchw_1[4] = 0f32 conv2d_nchw_1[6] = 0f32 conv2d_nchw_1[8] = 0f32 conv2d_nchw_1[10] = 0f32 conv2d_nchw_1[12] = 0f32 conv2d_nchw_1[1] = 0f32 conv2d_nchw_1[3] = 0f32 conv2d_nchw_1[5] = 0f32 conv2d_nchw_1[7] = 0f32 conv2d_nchw_1[9] = 0f32 conv2d_nchw_1[11] = 0f32 conv2d_nchw_1[13] = 0f32 for (: int32, 0, 3) { for (: int32, 0, 5) { attr [IterVar(threadIdx.z_1: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_1: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 { pad_temp.shared_1: Buffer(pad_temp.shared, float32, [112], [], scope="shared")[(threadIdx.x_1*7)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (1 <= ((*56) + floordiv((threadIdx.x_1*7), 2)))), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 450)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (1 <= ((*56) + floordiv(((threadIdx.x_1*7) + 1), 2)))), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 449)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 448)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32) } attr [IterVar(threadIdx.z_2: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_2: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "")] "thread_extent" = 16; if @((threadIdx.x_2 < 2), dtype=bool) { placeholder.shared_1: Buffer(, float32, [2], [], scope="shared", align=8)[threadIdx.x_2] = placeholder_1[((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5))] } conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0])) conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0])) conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0])) conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0])) conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0])) conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1])) conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1])) conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1])) conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1])) conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1])) conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1])) attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 { pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (1 <= ((*56) + floordiv(((threadIdx.x_1*7) + 1), 2)))), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 449)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 448)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32) } attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16; if @((threadIdx.x_2 < 2), dtype=bool) { placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 1)] } conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0])) conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0])) conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0])) conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0])) conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0])) conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1])) conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1])) conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1])) conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1])) conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1])) conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1])) attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 { pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 448)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 442)], 0f32, dtype=float32) } attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16; if @((threadIdx.x_2 < 2), dtype=bool) { placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 2)] } conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0])) conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0])) conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0])) conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0])) conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0])) conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1])) conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1])) conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1])) conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1])) conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1])) conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1])) attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 { pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 447)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 442)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (((*56) + floordiv(((threadIdx.x_1*7) + 9), 2)) < 113)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 441)], 0f32, dtype=float32) } attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16; if @((threadIdx.x_2 < 2), dtype=bool) { placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 3)] } conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0])) conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0])) conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0])) conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0])) conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0])) conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1])) conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1])) conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1])) conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1])) conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1])) conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1])) attr [IterVar(threadIdx.z_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "")] "thread_extent" = 16 { pad_temp.shared_1[(threadIdx.x_1*7)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 446)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 1)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 445)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 2)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 444)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 3)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 443)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 4)] = @tir.if_then_else(((2 <= ( + )) && (( + ) < 226)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 442)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 5)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (((*56) + floordiv(((threadIdx.x_1*7) + 9), 2)) < 113)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 441)], 0f32, dtype=float32) pad_temp.shared_1[((threadIdx.x_1*7) + 6)] = @tir.if_then_else((((2 <= ( + )) && (( + ) < 226)) && (((*56) + floordiv((threadIdx.x_1*7), 2)) < 108)), placeholder[((((((*50176) + (*224)) + (*224)) + (*112)) + (threadIdx.x_1*7)) - 440)], 0f32, dtype=float32) } attr [IterVar(threadIdx.z_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.y_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 1; attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "")] "thread_extent" = 16; if @((threadIdx.x_2 < 2), dtype=bool) { placeholder.shared_1[threadIdx.x_2] = placeholder_1[(((((*150) + (threadIdx.x_2*75)) + (*25)) + (*5)) + 4)] } conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[]*placeholder.shared_1[0])) conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[0])) conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[0])) conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[0])) conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[0])) conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[0])) conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[0])) conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[]*placeholder.shared_1[1])) conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[( + 16)]*placeholder.shared_1[1])) conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[( + 32)]*placeholder.shared_1[1])) conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[( + 48)]*placeholder.shared_1[1])) conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[( + 64)]*placeholder.shared_1[1])) conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[( + 80)]*placeholder.shared_1[1])) conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[( + 96)]*placeholder.shared_1[1])) } } compute_1: Buffer(compute, float32, [501760], [])[((((*100352) + (*224)) + (*112)) + )] = max(conv2d_nchw_1[0], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 16)] = max(conv2d_nchw_1[2], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 32)] = max(conv2d_nchw_1[4], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 48)] = max(conv2d_nchw_1[6], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 64)] = max(conv2d_nchw_1[8], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 80)] = max(conv2d_nchw_1[10], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 96)] = max(conv2d_nchw_1[12], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50176)] = max(conv2d_nchw_1[1], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50192)] = max(conv2d_nchw_1[3], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50208)] = max(conv2d_nchw_1[5], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50224)] = max(conv2d_nchw_1[7], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50240)] = max(conv2d_nchw_1[9], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50256)] = max(conv2d_nchw_1[11], 0f32) compute_1[(((((*100352) + (*224)) + (*112)) + ) + 50272)] = max(conv2d_nchw_1[13], 0f32) } }
summarize
This tutorial has shown the following:
- How to manipulate numpy-style operators using the TOPI API.
- How TOPI facilitates generic schedule and operator fusion of contexts to generate optimized kernel code.
Download Python source code: intro_topi.py
Download Jupyter Notebook: intro_topi.ipynb
Above is the details of how TOPI makes TVM code less sample-like, for more information about TOPI TVM code please follow my other related articles!