Pytorch-mlu implementation of adding layer-by-layer operators in detail

This tutorial shares a method for adding layer-by-layer arithmetic in pytorch-mlu on Cambrian devices.

In pytorch-mlu layer-by-layer model, the basic unit for data transfer and storage among operators is tensor. pytorch-mlu distributes operators to different devices according to the value of device attribute in tensor. Take abs() as an example, in the dispatch phase, the operator call will be distributed to a specific device according to the value of the device attribute of the input_tensor, and the logic is shown in the following figure:

Catch is decoupled from the pytorch source code by registering the addition of MLU operators. The following describes the steps to add MLU operators to Catch.

1. Registration Calculator

exist catch/torch_mlu/csrc/generated/aten_mlu_type_default.cpp Register the calculator in the center:

.op(torch::RegisterOperators::options().schema("aten::(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor")  // NOLINT 

  .impl_unboxedOnlyKernel<at::Tensor(const at::Tensor &, const at::Tensor &, at::Scalar), &AtenMluType::add>(at::TensorTypeId::MLUTensorId)
  
  aliasAnalysis(c10::AliasAnalysisKind::FROM_SCHEMA))

2. Calculator distribution

AtenMluType and AtenMluCustomType are the entry points for operators in the Catch module. the AtenMluType class mainly contains the standard operators in the framework, while the AtenMluCustomType class contains the customized operators. Depending on the properties of the operator, you can choose whether to add the corresponding operator declaration and implementation in AtenMluType or AtenMluCustomType.

Standard Operator Distribution

exist catch/torch_mlu/csrc/aten/aten_mlu_type.h cap (a poem)catch/torch_mlu/csrc/aten/aten_mlu_type.cpp Add operator declarations and implementations to the

aten_mlu_type.h
static at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
aten_mlu_type.cpp
at::Tensor AtenMluType::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  return OP_DISPATCH(add, self, other, alpha);
}

Customized Calculator Distribution

For MLU-specific operators, thecatch/torch_mlu/csrc/aten/aten_mlu_type.hcap (a poem)catch/torch_mlu/csrc/aten/aten_mlu_custom_type.cpp Add operator declarations and implementations to the

aten_mlu_type.h
static at::Tensor linear(const at::Tensor& input,
                         const at::Tensor& weight,
                         const at::Tensor& bias,
                         const at::Tensor& q_scale,
                         const at::Tensor& q_mode);
aten_mlu_custom_type.cpp
at::Tensor AtenMluCustomType::linear(const at::Tensor& input,
                                     const at::Tensor& weight,
                                     const at::Tensor& bias,
                                     const at::Tensor& q_scale,
                                     const at::Tensor& q_mode){
    return OP_DISPATCH(linear, input, weight, bias, q_scale, q_mode);
}

3. Modify the OpMethods base class.

From both AtenMluType and AtenMluCustomType are downcast to inference operators or training operators via OpMethods. In the catch/torch_mlu/csrc/aten/operators/op_methods.h cap (a poem)catch/torch_mlu/csrc/aten/operators/op_methods.cpp The implementation in OpMethods is the CPU implementation of the operator.

op_methods.h
virtual at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
   auto input_cpu = ();
   auto other_cpu = ();
   auto output = at::add(input_cpu, other_cpu, alpha);
   return (at::Device(at::Device::Type::MLU));
}

4. Issuance of calculators

existcatch/torch_mlu/csrc/aten/operators/cnml_ops.h cap (a poem)catch/torch_mlu/csrc/aten/operators/cnml_ops.cpp Add inference operator declarations and implementations in.

cnml_ops.h
at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
cnml_ops.cpp
at::Tensor CnmlOps::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  CNML_DISPATCH(add, cnml_add, self, other, alpha);  // The first argument to the CNML_DISPATCH macro is the name of the interface, the second is the name of the wrapper, and the rest is the name of the interface.
}

5. Add wrapper

A wrapper is an encapsulation of the kernel of an operator, one wrapper for each operator. take the add operator as an example, and add the wrapper as shown below:

cnml_kernel.h
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha);

at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha_scalar){
  TORCH_CHECK(() >= 0 || () >= 0, "dimension not support");
  at::Tensor input_ = input;
  at::Tensor other_ = other;
  auto alpha_data = alpha_scalar.to<scalar_t>();
  if(alpha_data != 1){
    // scale_t
    other_ = cnml::ops::cnml_scale(other_, alpha_data, 0);
  }
  if(other_.dim() < 1 && other_.device().type() == c10::DeviceType::CPU){
    auto other_scalar = other_.item();
    return cnml_add_internal(input_, other_scalar);   // Call kernel
  }
  if(input_.dim() < 1 && input_.device().type() == c10::DeviceType::CPU){
    auto input_scalar = input_.item();
    return cnml_add_internal(other_, input_scalar);   // Call kernel
  }
  
  bool broadcast = input_.sizes() != other_.sizes();
  if(broadcast){
    auto broadcast_size = at::infer_size((), ());
    at::Tensor broadcast1 = cnml::ops::cnml_expand(input_, broadcast_size, false);
    at::Tensor broadcast2 = cnml::ops::cnml_expand(other_, broadcast_size, false);
    return cnml_add_internal(broadcast1, broadcast2);  // Call kernel
  }else{
    return cnml_add_internal(input_, other_);  //call kernel
  }
  return cnml_add_internal(input_, other_);   //call kernel
}

6. Add wrapper

Wrapper implements the operator function by calling the kernel. The example call is cnml_add_internal. The implementation of the operator is mainly done by calling the interface of CNML library, the following is the logic of CNML library:

The kernel implementation is done by calling the CNML library interface according to the above programming logic.catch/torch_mlu/csrc/aten/operators/cnml/internal/cnml_internal.h cap (a poem)catch/torch_mlu/csrc/aten/operators/cnml/internal/add_internal/cpp Add the declarations and implementations of the kernel functions in the

cnml_internal.h
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2);
add_internal.cpp
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2){
  auto output = at::native::empty_like(input1);
  // prepare input cnml tensor
  auto* input1_impl = getMluTensorImpl(input1);  // Get MluTensorImpl
  auto input1_cnml = input1_impl->CreateCnmlTensor(
       CNML_TENSOR, toCnmlDataType(()));  // Type adaptation: toCnmlDataType()
       
  auto* input2_impl = getMluTensorImpl(input2);
  auto input2_cnml = input2_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(()));
      
  // prepare output cnml tensor
  auto* output_impl = getMluTensorImpl(output);
  auto output_cnml = output_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(()));
      
  // End the execution flow if not MLU device
  CHECK_MLU_DEVICE(output);
  
  // setup operator
  cnmlBaseOp_t add_op;
  TORCH_CNML_CHECK(cnmlCreateAddOp(&add_op, input1_cnml, input2_cnml, output_cnml));
  
  // return to JIT if running mode is fuse
  CHEXK_RETURN_TO_FUSE(add_op, output);
  
  // compile op
  TORCH_CNML_CHECK(cnmlCompileBaseOp(add_op, GET_CORE_VERSION, GET_CORE_NUMBER));
  
  auto queue = getCurQueue();
  TORCH_CNML_CHECK(cnmlComputeAddOpForward_V4(add_op,
                                              NULL,
                                              input1_impl->raw_mutable_data(),
                                              NULL,
                                              input2_impl->raw_mutable_data(),
                                              NULL,
                                              output_impl->raw_mutable_data(),
                                              queue,
                                              NULL));
   syncQueue(queue);
   TORCH_CNML_CHECK(cnmlDestroyBaseOp(&add_op));
   
  return output;
}

Handling of unsupported operators for MLUs

For operations that are not supported by MLU, the input data will be copied to CPU, and then CPU related operations will be called to make it run on CPU, and finally the output result will be copied to MLU. For specific implementation, you can query op_methods.cp, which is available in thecatch/torch_mlu/csrc/aten/operators/ Catalog.

op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
  auto input_cpu = ();
  auto other_cpu = ();
  auto output = at::add(input_cpu, other_cpu, alpha);
  return (at::Device(at::Device::Type::MLU));
}

When an exception is thrown during the execution of an added operator, if there is no corresponding operator operation on the CPU, the operation cannot be switched to run on the CPU;

Wrapper generally starts withcnml_operator name, the kernel is usually named after thecnml_operator name (math.)_internalchristen

7. Arithmetic testing

Use the python based unittest module to write unit tests for the arithmetic. When testing, we need to provide the same parameters and input data, execute the arithmetic on the MLU and CPU respectively, and compare the outputs of the two.The calculation results of the MLU and the CPU may differ, but the relative error of the two is generally acceptable within 2%.

def test_add(self):
  # "Tensor + Tensor" mode testing
  for shape1, shape2 in [((1,3,224,224),(1,3,224,224)),((2,30,80),(2,30,80)),((3,20),(3,20)),((10),(10))]:
    input1_cpu = (shape1, dtype=)
    input2_cpu = (shape2, dtype=)
    input1_mlu = input1_cpu.to(xm.mlu_device())
    input2_mlu = input2_cpu.to(xm.mlu_device())
    # Calculated on the CPU
    output_cpu = input1_cpu + input2_cpu
    # Calculated on MLU
    output_mlu = input1_mlu + input2_mlu
    # Calculate the error in the MLU and ensure that the relative error is within 2%
    (output_cpu, output_mlu.cpu(), 0.02, use_MSE=True)

Above shared the method to add layer-by-layer operator in Cambrian device pytorch-mlu, and wrote an example with add() operator as an example, I hope my sharing will help you a little bit in your learning.

To this point, this article on Pytorch-mlu implementation of adding layer-by-layer arithmetic method is introduced to this article, more related Pytorch content, please search for my previous articles or continue to browse the following articles I hope you will support me in the future!