Many recent works focus on modeling and evaluating systolic array like DNN accelerators. Among various approaches, those focus on loop transformations always have an efficient and neat DSE, but the design space is usually limited. Some focuses on modeling instead of optimzations, these works usually design IR that is expressive enough and argue that how their IR enlarged the design space. However, usually these IRs are not suitable for deriving a straightforward and efficient DSE, and the common approach is to further limit the search space. This is awkward and it makes the reason of designing expressive IRs subtle. Another problem is that even though some IR can represent various designs, they only work well/is “user friendly” enough when applied to simple architecture such as systolic array.

Pure Modeling:

Maestro

Maestro design three primitives (named “data-centric”) to map a GEMM/Conv2D workload into a spatial hardware. The three primitives are:

  • SpatialMap(size, offset),
  • TimeMap(size, offset),
  • Cluster(size)

In this language, the “size” domain specify how many element is mapped to a single PE simutaneously, and the “offset” is the shifting of the starting point. The Spatial and Temporal Map can be seen as forcing or specifying spatial and temporal reuse for index . This representation is a bit couterintuitive to me.

Maestro thoroughly evaluate DNN dataflows with an analytical model. This is indeed useful in the early design phase. Many following works propose improvements based on similar stories.

Link to the paper: https://arxiv.org/abs/1805.02566

Tenet

Tenet proposes an more accurate way to model dataflow (via polyhedral model), and it also enlarged the design space of DNN dataflow. The backend of Tenet is Integer Set Library, a library that solves the problem: “How many integer points are there inside a polyhedral?” via ILP and other sophisticated algorithms (I haven’t get to that part yet). Many problems can be solved vis polyhedral models (like calculating cache miss), and it is especially convenient when solving problems related to DNN workloads.

The Integer Set Library represent integer set and many-to-many maps via linear constrains, a set is simply (this is not the exact form). And a map is like (not the exact form). Map and set operations is then equivalent to solving/generating new ILPs. For example, to intersect two set one only need to append to , and do a reduction.

Some operations that Integer Set Library support:

  • Map_Apply_Domain(Map1, Map2)
  • Map_Apply_Range(Map1, Map2)
  • Map_Intersect/Map_Union
  • Map_Product
  • Map_Card

The sophisticate operation is the “Map_Card”, which map an element to the cardinality of its range. The underlying problem is “How many solutions are there for a set of linear constrain”.

It seems that one can build an ILP problem directly from such an analytical model, by replacing fixed point with variable and “inlining” all the code implemented in Integer Set Library, but the most critical operation “isl_union_map_card” is actually not an ILP, and the problem “How many solutions are there for a set of linear constrain” is in fact a very hard problem.

Link to the paper: https://arxiv.org/pdf/2105.01892.pdf
Link to ISL: https://libisl.sourceforge.io/
Wiki: N-points in polyhedral: https://en.wikipedia.org/wiki/Integer_points_in_convex_polyhedra

Modeling and DSE:

Intersteller

Intersteller base its analysis on loop nest, and it expand the Halide’s primitives to support systolic array and broadcasting. The Halide IR consider the tranformation of a single loop nest and interleaving of multiple correlated loop nest. It specify loop tranformation via “split” and “reorder”, and it specify the pipeline by defining when the input should be caculated. Two extreme case is “breadth first” and “fuse”. The former have no pipelining, and the latter compute all the input “just in time”, in the innermost loop. A geneal case is a sliding window, it reuse value in susequent iterations.

Link to the paper: https://arxiv.org/abs/1809.04070

CoSA

CoSA model the problem as a MIP problem. It first factorize the loop bound to its prime factors. Then it specify the 1) mapped memory level, 2) the permutation order, 3) the spatial mapping for each factor. It only target a CNN workload, and has a coarse grained definition for dataflows (only s/t mapping) and a relative simple model for inteconnection. It studies the traffic pattern of multicast and unicast on NoC carefully.

Link to the paper: https://arxiv.org/pdf/2105.01898.pdf