Cross-Stage Prediction #

congestion #

background #

Routing congestion can overwhelm routing resources and lead to low cell utilization and routing detours
congestion is not known accurately until late in the design cycle, after placement and routing.
Many modern placement and synthesis tools leverage congestion estimation in their cost analysis in order to minimize the effects of congestion in the final physical design
It is known that the total net length can be a good proxy for congestion
A simple approximation for congestion prediction is to use the size of the local neighborhood
和fan-in, fan-out强相关
Precise congestion prediction from a placement solution plays a crucial role in circuit placement
Multiple previous works have attempted to predict detailed routing congestion in the placement step in an effort to optimize routability of the placement solution: RUDY, POLAR 2.0. All these techniques are implemented in the placement step and need the position information of cells .
To avoid the high computation cost of placement, it is more useful to be able to predict congestion in the logic synthesis phase.
congestion prediction problem can be frame as node regression problem
with the growth of circuit scale and complexity, time consumption tends to be unacceptable when utilizing a global router in the placement cycle to obtain the congestion map.
Current machine learning models commonly follow a two-phase workflow. First, based on domain knowledge, human experts generate various local features on the circuit using predefined functions on netlist. Then, based on the generated features, a specific model, e.g. convolution neural network (CNN) model is designed to predict either the routing demand map or the congestion map
the emergence of Graph Neural Network (GNN) triggered applications of undirected homogeneous graphs models on routing congestion prediction, since a VLSI circuit can be naturally represented by a graph

RouteNet-DRC Hotspot Prediction-ICCAD-2018-CNN #

background #

Every chip design project must complete routing without design rule violation before tapeout. However, this basic requirement is often difficult to be satisfied especially when routability is not adequately considered in early design stages.
In light of this fact, routability prediction has received serious attention in both academic research and industrial tool development. Moreover, routability is widely recognized as a main objective for cell placement
CNN and Transfer Learning
- CNN learns more abstract patterns from images
- Our RouteNet transfers such state-of-the-art ability in image pattern recognition to circuits for capturing the patterns about routability. RouteNet predicts routability based on a pretrained ResNet architecture
- Fully Convolutional Network (FCN): outputs an image with size equal to or smaller than input. many FCNs have both deep and shallow paths in one network.
RUDY(Rectangular Uniform wire DensitY)
- 它被用作我们RouteNet的输入特征，因为它与路由拥塞部分相关，获取速度快，可以直接表示为与RouteNet相吻合的图像
challenge of macros
- The orange circles in Figure 3 indicate a strong tendency for hotspots to aggregate at the small gap between neighboring macros
- Blue dashed circles indicate the remaining sparsely distributed hotspots
- 有macro，线性程度低

task #

predict overall routability (DRC count), 分类任务，预测总的#DRV
predict DRC hotspot locations.DRC hotspots mean the specific locations with high density of DRVs. like an end-to-end object detection task, which is more difficult to solve. GCell内#DRV超过设定值则为DRC hotspot

contribution: #

mixed-size macros
first systematic study on CNN-based routability prediction
high accuracy and high speed

flow #

model #

#DRV prediction
ResNet18-based
preprocess
- ResNet是一个固定输入（224*224）的模型，为了使用知识迁移，将输入。具体怎么插？
hotspot prediction

data #

dataset:

ISPD 2015 benchmarks

different placement made by “obstacle-aware macro placement" algorithm [5].

each floorplan is placed and routed by Cadence Encounter v14.20 [2]

experiment #

we compare the TPR of all methods under the same FPR (error under 1%)

CongestionNet-predict congestion hotspots-IFIP-2019-GNN(GAT)-nvidia #

a graph-based deep learning method for predicting routing congestion hotspots from a netlist before placement. Predict the detail routed lower metal layer congestion values

why low layer? 因为较低金属层上的拥塞主要是由局部逻辑结构驱动的，而不是由无关逻辑簇之间的较长互连驱动的，后者往往在较高金属层上运行. predicting lower metal layer congestion is not only more important for the underlying task of identifying congested logic structures, but also simplifies the task for our graph based network

contribution #

阶段早,只使用网表
由于该模型仅基于网表的逻辑结构而不是任何特定的单元布局进行预测，因此它消除了基于布局的方法中存在的次优布局的伪影
can be done without any physical information
GNN, 快
the first work exploring the use of graph based deep learning for physical design problems

数据:

roughly 5000 distinct cell types

we project our per cell predictions back onto their respective 2D grid (using the final ground truth physical placement) and average all cells within each grid cell to come up with a predicted value that can be compared to the original ground truth grid value.

模型参数:

an 8 layer Graph Attention Network (GAT) with size 16 intermediate (or hidden) state

无向图, each node corresponds to a cell

节点特征: length 50 for each cell type and each cell’s logic description as well as the pin count and cell size of that cell

实验:

report correlation values using the Kendall ranking coefficient

实际效果可视化

对比实验

消融实验

cell type or function is an essential part of our predictions.

cell type 不是没起作用吗

缺点:

model needs to be retrained for every new process technology, since the embeddings are over cell types specific to a process technology.
it occasionally over predicts congestion in areas of low to moderate congestion, such as in most failing parts of Partition A
due to the graph based nature of the model, it sometimes makes overly soft decision boundaries
the CongestionNet uses informative cell attributes (cell size and pin count) alone as the input to the GAT and does not use any embedding encoding the netlist structure

可改进的点:

-Congestion prediction + embedding + matrix factorization + partition-arXiv-2021-GNN(Sage)- #

background #

predicting cell congestion due to improper logic combination can reduce the burden of subsequent physical implementations.
previous work: require informative cell features
Although the global routing result provides a good estimation of routing congestion [6], [19], an awareness of high congestion areas at an early design stage is of great importance to provide fast feedback and shorten design cycles
Multiple works have attempted to predict detailed routing congestion in the placement step in an effort to optimize routability of the placement solution

task #

during the logic synthesis stage
到底是什么时候的congestion数据? Routing后的真实值还是预测plcament后的congestion RUDY预测值? 应该是Global Routing后的:强调了congestion value = wiring demand/routing capacity

contrbution

data #

DAC2012 contest benchmark

http://archive.sigda.org/dac2012/contest/dac2012_contest.html

OpenROAD dataset

place via DREAMPLACE
Macros and terminals are removed from the graph
Nets with degree more than 10 are excluded from the final graph as they introduce cliques too large to work with efficiently.
node features (pin number, cell size) , This follows the flow of CongestionNet
flow #
congestion value for each grid cell computed as the wiring demand divided by the routing capacity , The output along the z-axis is reduced by a max function,
Our focus is on predicting congestion due to local logic structure, which manifests itself on lower metal layers. Therefore, we use congestion labels from the lower half of the metal layers to train and evaluate the model
推理的时候取所有cell的预测平均值

principle

提出相连越近的节点相似度越高,
提出structural node similarity
Sub-graph partition ? METIS? ClusterGCN?
Matrix Factorization ?

model #

The key difference between this approach and CongestionNet lies in embedding pipeline
graph is undirected complete circuit is too large for direct matrix factorization and must be partitioned into clusters, use METIS partitioning tool in ClusterGCN
Sub-graph partition: clusters of ≈ 5000 nodes each
Matrix Factorization ?

experiment #

three metrics of correlation to measure performance: Pearson, Spearman, Kendall

Before evaluation, both the prediction and the label have some (very low) noise added to them.

PGNN-DRVs prediction+Pin Proximity Graph-ICCAD-2022-GNN+UNet(CNN)-Korea #

background #

(1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations)
Particularly, the complex design rules put so much burden on physical design, demanding lots of iterations on the time-consuming process of cell placement and net routing to clean up all DRVs (design rule violations) before tapping out . Thus, at the placement stage, if we were able to identify, with high confidence, DRC (design rule check) hotspots that would be likely to occur at the routing stage, we can pay more attention
shortcoming of image based:
local pin accessibility cannot be accurately modeled by pin pattern image alone
using high-resolution pin pattern images incur significant additional run-time as well as memory overhead to the prediction models
to optimize the placement before routing.

task #

a novel ML based DRC hotspot prediction technique,

GNN is used to embed pin accessibility information, U-net is used to extract routing congestion information from grid-based features
placement 分割为grid, 长宽=G-Cell
DRVs are extracted as the ground-truth after detailed routing

contribution #

GNN model, base pin proximity graph

model #

PGNN can adopt pin proximity graph as well as grid-based feature map as input feature

Pin Proximity Graph :

无向图，同构图

U-Net:

featrue:

整体模型:

数据集:

以后也可以这么做, 同一个benchmark不同的config参数就有不同的数据

experiment #

Nangate 15nm library

9 groups are used for training and the remaining 1 group for test. K折验证

positive 和 negative是什么意思?

可视化:

消融实验:

以后也可以这样用特征消融?

对比实验(F1-score):

注意不需要GR!
GR-Cong is obtained from ICC2 after global routing stage, and grids with high routing congestion are classified as DRC hotspot. 商用
RouteNet和J-Net都是相关的学术工作

时间对比:

LHNN-CongestionPrediction-DAC-2022-GNN-CUHK+Huawei+YiboLin #

background #

图的节点的设置很新颖
with the growth of circuit scale and complexity, time consumption tends to be unacceptable when utilizing a global router in the placement cycle to obtain the congestion map.
due to the need for the “shift-left” in circuit design, researchers begin to seek alternative solutions in machine learning [4] [5] to achieve accurate and fast congestion map prediction

task #

two related tasks, routing demand regression and congestion classification

data #

regard each G-cell as a node and add an edge between two nodes if the respective two G-cells are adjacent.

hypergraphs and heterogeneous graph , 两种节点：G-cell和G-net

feature：

ISPD 2011 [16] and DAC 2012 [17] contest benchmarks ,

model #

他这里说congestion map是一个二值化(0/1?)的数据集，所以是分类任务, 但是为了利用数据，同时防止routing demand的信息丢失，还设置了一个预测routing demand的任务？

experiment #

15benchmarks: 10 for training and 5 for testing

run DREAMPlace [18] on each of the designs to generate placement solutions

NCTU-GR 2.0 [2] to attain horizontal/vertical routing demand maps , and set the congestion maps as a binary indicator according to whether the horizontal/vertical routing demand of the G-cell exceeds the circuit’s capacity

对比实验：

可视化：

消融实验：

-NN Robustness improve-arXiv-2024- -UC- #

background #

最近的工作已经证明神经网络通常是容易受到精心选择的输入小扰动的影响
Our definition of imperceptibility is characterized by a guarantee that a perturbation to a layout will not alter its global routing
recent work [10, 18] has demonstrated that image classifiers can be fooled by small, carefully chosen perturbations of their input

task #

design two efficient methods for finding perturbations that demonstrate brittleness of recently proposed congestion predictors
one potential approach to address the issues by modifying the training procedure to promote robustness

contribution #

Painting on PIacement-predict the routing congestion-ACM-2019-GAN-

-DRC Hotspot Prediction-ISCAS-2021-CNN

-Routing Congestion Prediction-ASPDAC-2020-GAN

slice FPGACong_ASPDAC20 (yibolin.com)

-predict #DRV, a macro placer-DATE-2019-CNN

Timing #

background #

TimingGCN-STA prediction-DAC-2022-GNN #

the first work！
opensource
still relies on local net/cell delay prediction as auxiliary tasks
no optimization, not fit the real-world scenario where timing optimization is taken into account

PreRoutGNN-STA prediction-AAAI-2024-GNN

opensource

[Multimodal Fusion-Restructure tolerant+CNN+Endpoint-wise Masking4Layout -DAC-2023-GNN+CNN-7nm RISCV](D:\MyNotes\EDA\Timing\Multimodal Fusion-Pre Route Timing Prediction-DAC-2023-GNN-7nm RISCV.pdf) #

slice

Restructure：预测终点的延时，但是Timing Opt会改变网表结构(end point不变）。对一个Pre-routing任务来说，输入的网表和最终的网表不一样
netlist restructuring causes a mismatch between local input features and ground-truth features in the restructured sub-regions
As a result, prior local-view models can only be trained on the unchanged regions in a semi-supervised manner.
In other words, the better the models fit on labeled (unreplaced) net/cell delays, the worse they fit on replaced regions and eventually on endpoint arrival time
数据集：基本信息和Timing优化导致的网表变化
- average 40% nets and 21% cells are replaced during timing optimization
- timing optimization brings an average change of 59.6% to net delays and 33.3% to cell delays
为什么用layout信息：Since most timing optimization techniques include gate insertion or gate sizing, placement should reserve space for subsequent timing optimization. In other words, the timing optimizer’s efficacy is tied closely to global layout information. The layout information plays a dominant role in determining the timing optimizer’s impact since most optimization techniques need space to be applied
整体模型
组成：GNN+CNN+Endpoint-wise Masking
- Netlist(GNN):
和 TimingGCN-STA prediction-DAC-2022-GNN很像(没发现不同)
- Layout(CNN+Endpoint-wise Masking)
  三个特征：cell density, rectangular uniform wire density (RUDY), and macro cells region
  Endpoint-wise Masking
对比实验：
run time实验

other #

Ahead RC network-STA prediction-DAC-2022-?

Doomed Run Prediction-TNS prediction-ACM-2021-GNN+RNN

not DL #

The two-stage approaches [2], [3] first predict localnet/cell delays and then apply PERT traversals [5] to evaluate the global timing metrics, i.e., endpoint arrival time.

Optimization #

Timing #

TSteiner - Steiner Points Opt-DAC-2023-GNN-CUHK #

background #

对于multi-pin net需要构建steiner tree来进行routing，故steiner tree中steiner points也会影响routing

FLUTE[ 3]是常用的生成steiner tree的算法。在生成steiner tree后，我们可以通过近一步优化steiner point来优化timing

the previous early-stage timing optimization works only focus on improving early timing metrics. 提出了诸如net加权和可微分时间目标等策略来优化时间, only focus on improving pre-routing timing metrics, which may have a considerable gap to signoff timing performance. 斯坦那点更加靠近布线阶段(和布线更加相关)

all the aforementioned works are not directly targeted at sign-off timing performance due to its high acquisition cost

任务:

In this paper, we focus on explicit sign-off timing optimization at the pre-routing stage to reduce the turnaround time

optimization framework is built to adjust Steiner point positions for better sign-off timing performance iteratively

The most popular Steiner minimum tree construction algorithms aim to minimize wirelength. Moreover, the Steiner point refinement is introduced to update the generated Steiner point positions for specific objectives, e.g., sign-off timing performance, while maintaining the two-pin net connections

启发:

we surprisingly find that the signoff timing performance could be significantly affected even by a random disturbance on Steiner point positions, as shown in Fig. 2.

Nevertheless, the impact of random moving is considerately unstable, and its average performance is slight (with a ratio close to 1.0). 所以启发找到一个好的方法来更新斯坦纳点来降低TNS

在最广泛使用的技术节点中，与路径长度最相关的定时度量——净延迟，并不能解释大部分的整体定时性能. 这里用的初始化斯泰纳树的方法的优化目标都是路径长度最短

contribution: #

first earlystage timing optimization framework via Steiner point refinement
GNN
TSteiner framework is fully automated with an adaptive stepsize scheme and the auto-convergence scheme
improves 11.2% and 7.1% on average (up to 45.8% and 43.9%) for WNS and TNS

模型:

Steiner tree construction decomposes each multi-pin net into a set of two-pin nets via additional Steiner points before global routing to reduce the problem complexity

The proposed framework can be divided into two stages, sign-off timing gradient generation (Section III-A) and concurrent Steiner point refinement (Section III-B)

和TimingGCN相比就是多了Steiner 节点, 然后吧第一部分的的node embedding部分加上了steiner的部分

实际是:

优化的指标, WNS和TNS的加权

根据优化指标对斯泰纳点坐标参数做梯度下降

相比简单的梯度下降，只是减小了对不同benchmark的手动学习率微调

数据

实验

Placement #

-Pin Accessibility+DRV prediction-DAC-2019-CNN-NTU #

background #

Standard cells on the lower metal layers severely suffer from low routability due to high pin density, low pin accessibility, and limited routing resources.
It can be observed that the access points of pin B are blocked by the metal 2 (M2) routing segments routed from Pin A and Pin C, so an M2 short design rule violation (DRV) will be induced when dropping a via12 on Pin B. pin accessibility is not only determined by cell layout design but also strongly affected by adjacent cells
对于传统方法，两个缺点：
- Cell libraries provided by foundries should not be considerably redesigned because the optimized cell performance and manufacturability may be highly sensitive to cell layouts
- Deterministic approaches based on human knowledge have been shown to be less effective in advanced nodes for optimization problems such as DRV prediction and minimization because of the extremely high complexity through the overall design flow
It can be observed that most of the congested regions in the layout do not have DRVs, while some regions with DRVs are not so congested. 但是我感觉还是有相关性的。他是想说明congestion出现的地方不一定有DRV，但是没congestion的地方可能因为poor pin accessibility导致DRV
- 也是说明：congestion出现的地方不一定有DRV，但是没congestion的地方可能因为poor pin accessibility导致DRV
- the two M2 shorts occur at the locations having the same pin pattern in the top cell-row and mid cell-row

task #

DRV prediction, 二分类
pin accessibility optimization, 给一个合法化后的布局结构，通过算法进行减少bad pin accessibility的detailed placement
其实也是一个预测模型，一个优化模型

contribution #

first work to apply pin pattern as the input features of DRV prediction models.

flow #

model:

PPR&DFPPR:

Model-guided Detailed Placement :

Dynamic Programming-based Placement Blockage Insertion

还会改方向？

Cell Displacement Refinement

data #

Both the width and height of each pixel are set as the minimum spacing of the M1 layer in order to prevent a pixel from being occupied by two different pins.

没看见关于benchmark的描述

experiment #

shortcoming:

flow need routed designs to train, time
The trained model is not necessarily applicable to other designs using different cells or different reference cell libraries
对于VLSI，一行一行，一对一对进行，很慢？

-Pin Accessibility+activ-ISPD-2020- -NTU+Synopsys #

background #

With the development of advanced process nodes of semiconductor, the problem ofpin accesshas become one of the major factors to impact the occurrences of design rule violations (DRVs) due to complex design rules and limited routing resource
supervised learning approaches extract the labels of training data by generating a great number of routed designs in advance, giving rise to large effort on training data preparation. the pre-trained model could hardly predict unseen data
Unlike most of existing studies that aim at design-specific training, we propose a library-based model which can be applied to all designs referencing to the same standard cell library set.
Due to the shrinking of modern process nodes of semiconductor, the pin access problem of standard cells has become more harder to be coped with, especially on the lower metal layers.
在这种placement下，Metal1 pin A/B由于各自左右两边在Metal2有pin，而且只能在黄色track下横向绕线，（Metal1不能绕线？），那么Pin A/B通过Via12后必定会短路
19年工作[5]的两个缺点
- flow need routed designs to train, time
- The trained model is not necessarily applicable to other designs using different cells or different reference cell libraries

contribution #

first work of cell library-basedpin accessibility prediction (PAP), which can be applied to predict other designs referencing to the same cell library set
applies active learning to train a PAP model
the proposed cell library-based PAP model can be trained at the earlier stage in a process development flow: once the cell libraries are provided.

Placement Optimization with Deep Reinforcement Learning- -ISPD-2020-RL+GNN-Google #

PL GNN-Affinity Aware for ICC2- ISPD-2021-GNN-Atlanta #

background: #

Placement is one of the most crucial problems, placement directly impacts the final quality of a full-chip design
multiple placement iterations to optimize key metrics(WL, timing), which is time-consuming and computationally inefficient, VLSI
the logical affinity among design instancesdominates the quality of the placement
logical affinity 源于这篇文章？
performing placement guidance requires in-depth design-specific knowledge,which is only achievable by experienced designers who knows the underlying data flows in Register-Transistor Level (RTL) well
K-means基础：

task #

基于网表数据，和floorplan结果（marco已经放好）
placement guidance(grouping information) for commercial placers ICC2, by generating cell clusters based on logical affinity and manually defined attributes of design instances
our framework will determine the cell clusters in an unsupervised manner which serve as placement guidance in order to guide commercial placers to optimize the key metrics such as wirelength, power, and timing by placing cells with a common cluster together

flow #

Two stages:

GNN do unsupervised node representation learning, (it is generalizable to any design)
weighted K-means clustering algorithm [3] to group instances into different clusters。To find the optimal number of groups for clustering, we introduce the Silhouette score [19] and perform sweeping experiments to find the sweet spot
K-means算法的基本思想是：通过迭代的方式，将数据划分为K个不同的簇，并使得每个数据点与其所属簇的质心（或称为中心点、均值点）之间的距离之和最小。

data #

two multi-core CPU designs：

design hierarchy : 根据网表层级. top/inst1/sky130_INV/A. (同时zero-padding)
logical affinity of memory macros ：logical levels to memory macros 𝑀 as features. because the logic to memory paths are often the critical timing paths

ef:

model #

GraphSAGE-based， two layers
Loss Function:

Silhouette score

用于评估分类结果，扫描分类数目，选择最高的分的

experiment #

env:

2.40𝐺𝐻𝑍 CPU
NVIDIA RTX 2070
16𝐺𝐵 memory.
PyTorch Geometric

setting:

the placement of memory macros is achieved manually based on design manuals provided by the design-house
Adam

result

Louvain：比较实验对比模型

Question:

benchmark少

扫描到的就适用所有？

开环？

-Innovus PPA placement optimize-Neurips-2021-RL #

contribution: #

-GP Routability Opt-DAC-2021-FCN-CUHK(SitingLiu BeiYu)+Yibo Lin #

background #

flow #

three input features are extracted from the cell placement solution
Through the inference of the pre-trained routability prediction model, we get the predicted congestion map.
take mean squared Frobenius norm of this congestion map as the congestion penalty

data #

model #

GR #

PROS-Routability Optimization-ICCAD-2020-FCN-CNHK+Cadence #

background #

task #

congestion predictor and parameter optimizer
only the data from the placement
it can optimize the cost parameters before the first routing iteration of GR and thus can give a better GR solution with less congestion.

contribution #

with negligible runtime overhead
plug-in
can be embedded into the state-of-the-art commercial EDA tool (Cadence Innovus v20.1)

model #

data #

19 different industrial designs

通过不同的placement参数和旋转（CNN原理），一共有1664 design cases in total.

Feature Extraction

Horizontal/Vertical track capacity map
Cell density map
Flip-flop cell density map
Fixed cell density map
Cell pin density map
Pin accessibility map
- Horizontal/Vertical net density map
- Small/Large-net RUDY map
Pin RUDY map
a combination of cell pin density map and large-net RUDY map

Label Generation

PROS does not need very detailed congestion map

two-step smoothening process to convert raw data to desirable congestion labels

help to make the prediction task easier

if there are at least six congested G-cells out of the eight in the surrounding of a center G-cell д, д will be labeled as congested

优化原理

这两个值在cadence怎么改的? cadence企业内部自己弄的（这是cadence的文章）？

model #

experiment #

PROS 2.0 - Routability Opt+Route WL estimation-Trans-2023-CNN-CNHK+Cadence #

background #

the amount of routing resources on a design is limited.
The quality of a GR solution has a great impact on that of the resulted DR routing solution
Congestion in a GR solution is one of the major causes of DRC violations in the DR solution since most of DRC violations are due to overcrowded wires and vias [1], [2]
a better GR solution with less congestion is needed to lower the probability of getting DRC violations in advance.
if the initial GR solution is not good and has a lot of congestion, the GR tool can hardly tackle the problem by rip-up and reroute.
placement engines [3]–[5] which take routing congestion into consideration are applied
FCN:FCN常用于图像中的每像素分类问题。采用任意输入大小，并产生大小完全相同的输出。GR拥塞预测也可以被视为任意大小的芯片设计上的像素二进制分类问题（拥塞与否）。因此，基于FCN的预测器可以自然地应用于PROS。

task #

stage: post-placement, pre-route
FCN based GR congestion predictor, use the predicted GR congestion to optimize the cost parameters of GR.
predictor based parameter optimizer to generate a better GR solution. GR tools are driven by the cost parameters stored in each G-cell. When arriving at a G-cell g, the tool will compute the cost, called moving cost, to move to each of its neighboring G-cells and push these costs into a heap. With optimized cost parameters in G-cells, the GR tool can find better paths and allocate the routing resources to each net more smartly. PROS optimizes two types of cost parameters based on the prediction result, including overflow cost and wire/via cost . PROS will adjust the cost parameters in the projected congestion regions on all layers
- overflow cost
- wire/via cost: divided into two groups (small/large) according to their BBox sizes.
  - Increasing the wire/via cost for small nets may be useless for congestion reduction and it may even increase the wire length or create new congestion due to detours out of the potential congestion region.
  - In contrast, increasing the wire/via cost for large nets can be helpful since they can select another route within its BBox to completely avoid the potential congestion region
CNN based wirelength estimator , By multiplying the predicted wirelength ratio and the precomputed FLUTE wirelength (训练一个系数). The lack of consideration of routing congestion in traditional methods is due to the dif ficulty of quickly obtaining accurate congestion estimation at the placement stage

contribution #

plug-in for Innovus: it can avoid extra runtime overhead of feature preparation
industrial design suite
advanced technology node
SOTA
high accuracy
first work that
utilizes the information of GR congestion to estimate routed wirelength at the placement stage
PROS does not change a lot for the original EDA steps

Overall Flow :

分类和回归

F is the feature number.
XWL has two features: These two features will be resized to 128 × 128 before prediction
- the predicted congestion map
- the cell pin density map

data #

feature F

Horizontal/Vertical Track Capacity Map
Cell Density Map
Flip-Flop Cell Density Map
Fixed Cell Density Map
Cell Pin Density Map
Pin Accessibility Map
Horizontal/Vertical Net Density Map
Small/Large-Net RUDY Map
Pin RUDY Map ?

label

congestion label pre-process

PROS does not need a very detailed congestion map

最后还是为了优化服务的

model #

DC: get more local information, but more GPU usage(acceptable)
SUB: w*h*4c –>2w*2h*c.
- Compared with bilinear upsampling which is not trainable, subpixel upsampling can learn to recover the local information.
- Compared with deconvolution, subpixel upsampling is parameter free, so it will not significantly increase the training difficulty.

dataset

industrial benchmark suite and DAC-2012 benchmark suite(19个 benchmark)

industrial benchmark suite 通过11种不同布局参数，翻转和旋转，制造了一共有1664个(约等于19*11*8)benchmark

DAC-2012 20 different placements

(4, 4, 4, 4, 3) 5折交叉验证

experiment #

env

Tensorflow
Intel Xeon CPUs at 2.2 GHz
256 GB memory
NVIDIA TITAN V GPU

setting

Adam
One entire training process of the congestion predictor has 25 training epochs! 这么少（收敛好快）

congestion classification prediction

compare with PROBABILISTIC METHODS

DR优化结果

线长估计

Runtime

DR #

-Detailed Router-DATE-2021-RL #

[DPRouter-Detail Routing(package design) Opt+net order decision-ASPADC-2023-RL(MARL)-diagonally route](“D:\MyNotes\EDA\Routing\DPRouter-Detail Routing(package design) Opt+net order decision-ASPADC-2023-RL(MARL)-diagonally route.pdf”) #

BackGround
- most time-consuming stages in the package design flow
- package designs have fewer layers; thus, we need to prevent net crashing cautiously
contrbution:
- redefine the routing area and shrink the routing problem by dividing the entire design into non-overlapping boxes
- use DRL, not heuristic
- prove the number of design rule violations (DRVs), wirelength and layout pattern.
task
- 2-pin nets
Initial routing: ignores the number of bends and allows design rule violations
Model
multi-agent deep reinforcement learning (MARL) task [15] for asynchronous routing planning between nets. We regard each net as an agent, which needs to consider the actions of other agents while making pathing decisions to avoid routing conflict
route and slide the window repeatedly. advantage of box:process every box independently
- sequential routing
  the repulsion point will be moved from the inner ring to the outer one until the box is successfully routed.
  具体算法：
- sequential routing
- Refinement

-Detail routing+match+Opt-ISPD-2023-RL+GNN-FinFET #

background: #

cutom circuits: a custom detailed router cannot adopt specialized layout strategies for specific circuit classes like human layout experts
一直在强调match的问题：

contribution #

opt roouting, FinFET, sign-off solution
异构图
A rip-up and re-routing scheme
can easily adapt to future design constraints

three categories of routing methodologies

Template-based methods
- manual design
- suffers from scalability issues
Simulation-based techniques
- provide accurate performance feedback and can be generalized to consider various performance metrics (e.g., phase margin, power dissipation) across circuit classes
- long execution time and resource-hungry computations
Constraint-based approaches
- widely adopted in existing custom routing studies

PR Tools #

Placer #

Chip Placement with Deep Reinforcement Learning-marcro-arXiv-2020-RL #

first explores the application of artificial intelligence in solving placement with the attempt to ease the difficulties of manual effort, which may indicate a new development stage for physical design

Differentiable-Timing-Driven Global Placement-global placement-DAC-2022-GNN- #

Polar 2.0 #

An effective routability-driven placer

cells that are estimated to have high congestion are spread out and inflated to distribute routing demand more evenly.

NTUPlace3 #

DeepPlace #

flow #

RePlAce–TCAD-2018- #

DREAMPlace-GP-DAC+TCAD+ICCAD+DATE-2019~2023 #

introduction

Over 30X speedup over the CPU implementation ( RePlAce) is achieved in global placement and legalization on ISPD 2005 contest benchmarks

DREAMPlace runs on both CPU and GPU. If it is installed on a machine without GPU, only CPU support will be enabled with multi-threading.

DREAMPlace also integrates a GPU-accelerated detailed placer, ABCDPlace, which can achieve around 16X speedup on million-size benchmarks over the widely-adopted sequential placer NTUPlace3 on CPU.

Publications

Yibo Lin, Shounak Dhar, Wuxi Li, Haoxing Ren, Brucek Khailany and David Z. Pan, “DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement”, ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, Jun 2-6, 2019 ( preprint) ( slides)
Yibo Lin, Zixuan Jiang, Jiaqi Gu, Wuxi Li, Shounak Dhar, Haoxing Ren, Brucek Khailany and David Z. Pan, “DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2020
Yibo Lin, Wuxi Li, Jiaqi Gu, Haoxing Ren, Brucek Khailany and David Z. Pan, “ABCDPlace: Accelerated Batch-based Concurrent Detailed Placement on Multi-threaded CPUs and GPUs”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2020 ( preprint)
Yibo Lin, David Z. Pan, Haoxing Ren and Brucek Khailany, “DREAMPlace 2.0: Open-Source GPU-Accelerated Global and Detailed Placement for Large-Scale VLSI Designs”, China Semiconductor Technology International Conference (CSTIC), Shanghai, China, Jun, 2020 ( preprint)(Invited Paper)
Jiaqi Gu, Zixuan Jiang, Yibo Lin and David Z. Pan, “DREAMPlace 3.0: Multi-Electrostatics Based Robust VLSI Placement with Region Constraints”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2-5, 2020 ( preprint)
Peiyu Liao, Siting Liu, Zhitang Chen, Wenlong Lv, Yibo Lin and Bei Yu, “DREAMPlace 4.0: Timing-driven Global Placement with Momentum-based Net Weighting”, IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), Antwerp, Belgium, Mar 14-23, 2022 ( preprint)
Yifan Chen, Zaiwen Wen, Yun Liang, Yibo Lin, “Stronger Mixed-Size Placement Backbone Considering Second-Order Information”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Francisco, CA, Oct, 2023 ( preprint)

Architecture

flow #

GR_outdated #

FastRoute1.0—2006 #

roposed a simple way to construct congestion driven Steiner tree and an edge shifting technique to further refine it

fastroute 2.0-Monotonic–2007 #

monotonic routing to explore all shortest routing paths for two-pin connections.

task #

flow #

fastroute 3.0-virtual capacity-ICCAD-2008- #

fastroute 4.0-via min tree+3 bending-ASPDAC-2009- #

层分配

MaizeRouter- #

2nd place of ISPD 2007 contest 2D GR
1st place of ISPD 2007 contest 3D GR

BoxRouter 1.0 #

3rd place of ISPD 2007 contest 2D GR
2nd place of ISPD 2007 contest 3D GR
integer linear programming (ILP) based

FGR-3d-TCAD-2008- #

1st place of ISPD 2007 contest 2D GR
3rd place of ISPD 2007 contest 3D GR

-Layer assignment+Via minization-Trans-2008-DP-NTHU #

Congestion-Constrained Layer Assignment for Via Minimization in Global Routing
CUGR’s rely work
ISPD07 contest后的一个跟进工作
也没提到maze routing
没定义wire cost, 在每一对GCell之间layer assignment, 慢？
第一次用DP?

background #

there are two main approaches
- 3D: route all nets directly on the multilayer solution space. Because this approach directly generates a multilayer global routing result, it can take the via cost into account during construction. However, this method may cost too much CPU time with a large problem size. (现在都用GPU做并行了，这种方法就变多了)
  1. such as
- 2D + layer assigment: The other approach is to first compress a multilayer grid graph into a one-layer grid graph, then use a one-layer router to solve the one-layer global routing problem, and finally perform layer assignment to assign each wire in the multilayer grid graph
  The edges corresponding to vias disappear in the one-layer grid graph. The capacity of each edge in the one-layer grid graph is obtained by accumulating the corresponding edge capacities in the three-layer grid graph
  This approach can take advantage of many current full-fledged one-layer routers, e.g., [2]–[4], and use an affordable run time to generate an initial one-layer routing result. 本文主要针对layer assignment. 注意layer assignment 是对二维的所有边进行层分配。
vias not only degrade the reliability and the performance of a design but also increase the manufacturing cost.
previous work’s layer assignment use greedy heuristics [8] or time-consuming integer linear programming methods [9] to minimize the via cost.
像这种串行的还是要考虑net order, 越早布线的net越不会拥塞，net order很重要

task and contribution:

这篇没有考虑优先方向（To simplify the presentation of our algorithm, we do not make any assumption about the preferred routing direction for each layer in the layer assignment problem.）不过也说明了这个工作能够很简单引用到考虑优先方向的情况
follow ISPD07 contest, 假设via的capacity是无限的（CUGR中明确了不进行这种假设）
based on a one-layer routing result
minimize via cost, WL and congestion overflow
propose a polynomial-time algorithm: first generate net order , then solves the layer assignment problem
can improve 3 winner of ISPD07 contest

model #

COngestion-constrained Layer Assignment (COLA)’s submodule
- Net order generation
  1. The net order has a direct influence on the utilization of routing resources, so it is one of the key parts of COLA.
  2. 对net进行打分决定order
    注意，线长越短，分数越高，net越应该先布线。解释：
- Eemove Cycles
  1. Arbitrarily remove.
  2. （为什么映射到第一层会有cycles？初始是怎么连起来的？没说？FLUTE算法是08年才出来，可能当时还没用上）
- Single-net layer assignment （SOLA+APEC）
  SOLA(Singlenet Optimal Layer Assignment)
  1. determines an optimal layer assignment result without considering congestion constraints for a given net
  2. dynamic programming technique
  3. 不考虑拥塞，这个方法能得到最好质量
  4. step:
    01: for tree in layer 1, random select a pin as root, then use DFS or DFS to get a queue, so get the edge order. It become a DAG
    02: 定义图5(c)中, a的父节点是p2，定义mvc(v, r)（minimum via cost）
    03:
    for pins who have not child, mvc:
    for pins who have child and not root:
    这个公式其实就是为了确定下每个点下一步的layer在哪里。比如算出最小是mvc(v, 1), 那么e_(v, ch(e))就在第r层
    for root:
    - the difference is excluding r in ∆
    - because mvc(v, r) does not depend on the value of r when v
      is the root, we have mvc (v, 1) = mvc(v, 2) = · · · = mvc(v, k)
  APEC(Accurate and Predictable Examination for Congestion constraints)
  1. can detect and prevent any congestion constraint violation in advance
  2. prevention condition:
    如果存在一个在layer1上压缩的边不满足这两个condition，那么这条边的layer assignment（SOLA）结果就不可能满足congesion
- SOLA+APEC always finds a layer assignment result satisfying both prevention conditions for each net
COLA

data #

six-layer benchmarks from ISPD’07

GRIP-3d+IP-DAC-2009 #

基于整数规划

3d: solve the 3D problem directly on the 3D routing grids,

slow: Although theoretically the direct 3D technique should produce better solutions, in practice it is less successful in both solution quality and runtime than 2D routing with layer assignment –cite–> [Fastroute4.1]

slow: Although we see solutions with shorter wirelength generated by full-3D concurrent approach like GRIP [21], that solution quality is achieved by impractically long runtime –cite–> [Fastroute4.1]

MGR–ICCAD-2011

multi-level （coarsened and fine-gained）

FastRoute4.1-an efficient and high-quality global router-2012 #

https://dl.acm.org/doi/abs/10.1155/2012/608362

background #

FastRoute is a global routing tool for VLSI back-end design. It is based on sequential rip-up and re-route (RRR) and a lot of novel techniques. FastRoute 1.0 first uses FLUTE to construct congestion-driven Steiner trees, which will later undergo the edge shifting process to optimize tree structure to reduce congestion. It then uses pattern routing and maze routing with logistic function based cost function to solve the congestion problem. FastRoute 2.0 proposed monotonic routing and multi-source multi-sink maze routing techniques to enhance the capability to reduce congestion. FastRoute 3.0 introduced the virtual capacity technique to adaptively change the capacity associated with each global edge to divert wire usage from highly congested regions to less congested regions. FastRoute 4.0 proposed via-aware Steiner tree, 3-bend routing and a delicate layer assignment algorithm to effectively reduce via count while maintaining outstanding congestion reduction capability. FastRoute 4.1 simplifies the way the virtual capacities are updated and applies a single set of tuning parameters to all benchmark circuits.

model #

flow #

NTHU Route 1.0- -TVLSI-2010- #

NTHU Route 2.0- -TCAD-2013 #

NCTU GR 1.0-3D-congestion relaxed layer assignment- 2011- #

it improved the scheme to estimate the realtime congestion more accurately by using a history term that will gradually wear off as the number of iterations increases if the overflow disappears.

NCTU GR 2.0-Multithreaded Collision Aware- CAD-2013- #

people.cs.nycu.edu.tw/~whliu/NCTU-GR.htm

PengjuY/NCTU-GR2: This is a binary file of NCTUgr2, which is a global router

net-level parallel method
2D

BoxRouter 2.0 #

background #

task #

是一个2d的

整数规划

OGRE- new cost function- -2019- - #

Open source!
LEF/DEF-based
3D
用的是老方法，不过解释的挺清楚的
components by a group of undergraduate students as a course project.

SPRoute 1.0: A Scalable Parallel Negotiation-based Global Router-ICCAD-2019 #

基于net-level多线程的并行加速迷宫算法
negotiation-based rip-up and reroute two-phase maze routing
resolves livelock issue(CPU)
open source
introduced a concept called soft capacity to reserve routing space for detailed routing and explored several parallelization strategies to speed up global routing.
是CPU上的并行，讲了挺多关于锁的问题，没看懂，让我们看2.0吧

background #

总体

In many global routers, maze routing is the most time-consuming stage.

challenge

因为这个现象，多线程反而慢了

原理

Galois system
Net-level Parallelism
Fine-grain Parallelism

data #

ISPD 2008 contest

GR_Adv_RL #

-DRL method-2019-DRL- #

first DRL related work
proves its overall performance is better than the sequential A∗ algorithm.

task #

DRL(DQN) for global route
have not use real world design

example:

from A to B

read means over flow

pipeline

model #

state:

(pos_x/y/z, distance_x/y/z, 周围的capacity, )这种编码方案可以被视为当前状态、导航和本地容量信息的混合

action: 上下左右前后

reward：

contribution #

first deep learning model for global routing

-DRL-arxiv-2021-JP #

- -WCMC-2023-DRL-FuZhouU-Genggeng Liu #

most of the existing methods are heuristic algorithms, which cannot conjointly optimize the subproblems of global routing, resulting in congestion and overflow
DRL和RL的区别：RL often faces the problem of the excessive number of states when dealing with high-dimensional spaces. With the development of deep learning, the Deep Reinforcement Learning (DRL) algorithm is developed by combining artificial neural networks with RL [10], which makes it possible for RL to solve the policy decision in a high-dimensional space

background #

this paper takes the overflow as the main design goal and optimizes the wire length and congestion based on the overflow as 0.
Serial routing usually sorts nets in a specific order and routes them one by one; ==this method is fast==（相对并行组合优化的方法？）. However, there is an unfair phenomenon: the routing difficulty of the earlier nets has sufficient routing resources (meaning that the capacity of each edge in the routing area is large), while most of the later nets have tight routing resources, so the serial routing method usually rips up part of the nets and reroutes them
The parallel method routes multiple nets at the same time [21], solving the unfairness of routing resources in a serial method, but it is ==often very time-consuming and even impossible to solve==, mainly based on the commodity flow model [22] and ==integer linear programming== model [23]

contribution #

use DDQN instead of DQN
an action reduction method
a concurrent training method
- solve the unfair resource allocation problem
a new reward function

model #

输入state：a 15-bit code is used; the starting point, the ending point, and the agent’s position are all represented by a 3-bit code; and a 6-bit code represents the edge capacities in six directions

输出 action：action-value of 6 directions. 但是由于每层有优先方向，所以实际上最多4个。需要在代理选择动作时，首先消除无法执行的动作，以防止代理在训练过程中执行冗余动作，存储冗余经验，然后学习冗余信息。

reward:

If ed is higher than ec/2, a reward r < 0 is given; otherwise, a reward r ≥ 0 will be given. （他公式是不是错了？）

uses a heuristic algorithm to search for the path in advance and burn it into the experience replay buffer （类似预训练）convergence speedup

-DRL+segment based-ISEDA-2023-DRL+GNN-PEK #

DRL(GAT)
segment-based feature extraction
pattern routing enhance

enhance:

3d?
加上GCELL之间的连接？
像InstantGR做一些水平垂直分层的操作？
capacity 放边上

background #

many traditional global routing methods lack learning ability.
more and more problems in physical design are searching for automated solutions based on machine learning. One popular application is to adopt machine learning to help early prediction

contribution #

congestion-aware reinforcement learning model
Integrating pattern routing with reinforcement learning
Proposing a net segment mode

flow #

model:

GNN feature
- Node embedding.
- Pin number.
- Fly line number.
- Capacity value
- Bounding box number.
- Position correlation.
DRL(A3C)
- We set the policy network as a fully connected layer with 200 neurons and the value network as a fully connected layer with 100 neurons.
- Feature of net segments
  - Net density value
  - Congestion prediction value
  - Capacity ratio value

data #

ISPD18 benchmark

experiment #

question:

原文没说prediction model的label是什么
RL怎么做并行？具体是怎么样的，不熟

- -APCCAS-2024-DRL(DDQN)-CYCU #

DRL-based A* search algorithm
没有pattern routing的环节
就是19年那一篇，把DQN改成DDQN
俗文

background #

aim to find better solutions to minimize total wire length (WL) and edge overflow (OF)
current solutions mainly rely on heuristic-driven greedy methods, which primarily address situations with strict constraints on the problems to be solved, such as sequential network routing after network sorting [2].
The A* algorithm is based on heuristic search, using a heuristic function to estimate the minimum cost from the current node to the target node. It can be used to find the shortest path from the starting point to the target pin.

GR_Adv_Gen #

-generative-arXiv-2019-CNN- #

first CNN

-only CNN-DAC-2020-CNN(VAE)- #

no experiment!
只用CNN分类结果不会好吧
不知道是什么类型的文章，只用了两页
evaluates its router on parts of the nets from a public benchmark layout and achieves 96.8% of routability
it seems that the router can only route two- and three-pin nets, which may have some limitations for application.

background #

is approach treats the global routing problem as an image processing problem and solves it with a deep learning system

data #

ISPD’98 ibm01 64x64 circuit

model #

PRNet- -NeurIPS-2022- -SJTU+Noah’s Ark #

PRNet can generate each route in one-shot but cannot guarantee connectivity which requires considerable post-processing for failed routes
HubRouter 是两阶段框架，PRNet 是端到端框架。

HubRouter-generative model-NeurIPS-2023-GAN+RL-SJTU #

open source!
a chinese interpretation
a global routing solver that includes a two-phase learning framework
HubRouter 是两阶段框架，PRNet 是端到端框架。
对比 PRNet 生成模型，PRNet 在 CGAN 中使用双向映射将连接约束注入训练目标，将准确率提高了 10%，但在复杂情况下几乎无效。

background #

全局布线(Global Routing - GR)是 VLSI 设计中最复杂且最耗时的组合问题之一。GR 目标是总线长最小，同时避免拥塞(Congestion)，是个 NP 问题。
传统采用启发式算法，多样性和规模问题对传统算法有了挑战，机器学习(ML)已经用于全局布线，在芯片设计中从逻辑合成到布局
深度强化学习(Deep Reinforcement Learning - DRL )和生成式模型(Generative model)已经被用来解决全局布线。问题在于，DRL很受状态空间(State Space)影响，随着网格空间增大，需要花费大量时间生成。However, DRL methods suffer from large state space and often need to spend enormous time on generating routes as the scale of grids increases on the test instance, i.e., the netlist, which is practically intimidating for real-world global routing
相反，生成式模型有一次性生成能力，在计算上更容易处理。
生成式方法在训练时候考虑连通性限制，确保布线满足电路连通性要求。但是问题在于，如果初始生成路径不满足连通性要求时候，后处理阶段会变成一种穷举搜索过程。
图一这里上图表示原始布线，下图表示算法生成的布线，生成布线没有正确连接所有应该连接的点(pin)，对于这样的情况，平均连通率很低，低于20%，意味着超过80%的生成布线需要经过耗时的后处理才能达到要求。显著的缺点。其实就和[CNN-based](# -only CNN-DAC-2020-CNN(VAE)-)这篇一样

contribution #

为了解决上述问题，定义了一个新的概念，叫hub。将pin - pin问题 –> hub - pin问题。
提出了一种新的两阶段全局布线方法 –> HubRouter
- generation phase（生成阶段）
  hubs, routes, and stripe masks are together generated under a multi-task framework by generative models
  可以在多个框架下生成，比如 GAN (Generative Adversarial Nets) , VAE (Variational Auto-Encoder) , DPM (Diffusion Probabilistic Models) 。虽然hub是生成阶段的主要输出，但为了提升生成质量和准确性，发现生成附加信息是非常有用的。比如感知和掩码(local perception and stripe masks)，能够去除噪声点。引入多任务学习，布线和掩码一起生成，提高 hub 生成质量
- pin-hub-connection phase（hub和pin连接阶段）
  将连接视为最小斯坦纳树(RSMT)问题，使用 actor-critic 模型网络策略。
  is hub generate correcttly, reconstruction time complexity can be reduced to O(n log n)
SOTA generative global routing models

model:

Hub
- (virtual) key point in the route
- transferring the pin-pin connection problem to the hub-pin connection problem
- 斯坦纳点(Rectilinear Steiner Point –> RSP)是搜索全局最小总距离，但是 hub 是来确定路径。RSPs are special cases of hubs
- RSP是Hub的特例，Hub可以随意生成不同形状的路径(不仅是最短的)
- 这里的 c 和 x 分别代表条件图像和输入图像。条件图像可能包括引脚位置、已经提取的中心点以及条带掩模（stripe mask）。条带掩模是用来指示布线区域的一种方式，它可以帮助模型更好地理解哪些区域可以用于布线

flow #

hub生成阶段
- Hub 生成可以表示为图像到图像的multi-task learning framework 任务, address the impact of sensitive noise points with stripe mask learning
- 附录 B 介绍了将 GAN，VAE，EAN 纳入到生成框架
- 在这个阶段，模型旨在逼近条件分布 pθ(x|z, c) 使其接近先验分布 p(x|c)。给定条件 c 和从先验分布 pz(z) 中采样得到的潜在变量 z（通常假设为高斯分布），模型会生成一些“中心点（hubs）”. 这里的 c 和 x 分别代表条件图像和输入图像。z is a latent variable from a prior distribution
- The main objective of hub generation is to minimize the difference between probability distributions p(x|c) and pθ(x|z, c)
- a noise hub, especially the outermost one, can largely harm the wirelength of routing. Use stripe mask to focus on bad cases for hub generation
hub和pin连接阶段
- 模型连接第一阶段生成的中心点，以获得最终的布线路由。这个过程可以被视为构建矩形稳定最小生成树（Rectilinear Steiner Minimum Tree，RSMT）的一部分。为了完成布线，模型遵循了一个基于强化学习（Reinforcement Learning，RL）的算法 REST。
- 在两阶段的过程中，作者还提出了一个多任务学习框架来提高生成中心点的质量。特别是，提出了一种新颖的条带掩模学习方法，旨在减轻噪声点案例可能造成的负面影响。算法的具体细节在附录 B 中给出。

GR_Adv_Parallel #

han-GPU+netlevel parallelism-ICCAD-2011- - #

VFGR-congestion modeling-ASPDAC-2014–THU #

net-level and region-level parallelization

CUGR-3D pattern+Multi level maze routing+patching-DAC-2020-CUHK #

ICCAD 2019 Contest First Place
open source!
3d+多线程+
这个文章没有讨论prefer direction
多线程体现在哪里？
注意：这种格式的GR输出可以适配Innovus
time-complexity of 3D pattern routing is $\mathcal{O}(L^4|V|)$
compare with [Trans-2008](# -Layer assignment+Via minization-Trans-2008-DP-NTHU), CUGR reduces the complexity to $\mathcal{O}(L^4|V|)$ by selecting the root carefully so that each vertex will have at most three preceding vertices instead of four. 注意，这里说相比[Trans-2008](# -Layer assignment+Via minization-Trans-2008-DP-NTHU)的$\mathcal{O}(L^5|V|)$ ，它的复杂度是$\mathcal{O}(L^4|V|)$ ，感觉是放在了[Trans-2008](# -Layer assignment+Via minization-Trans-2008-DP-NTHU)进行不转弯的DP-based layer assignment方法上了，实际上按照本文说的方法，理论上是$L * L^{23}|V|$，因为CUGR每次是对一个L pattern为单位计算mvc,时间复杂度是$2L*L$.确实是$L^4$, CUGR对一个L pattern分了两部分计算mvc没一部分时间复杂度是$L*2$

background #

A common strategy of doing 3D global routing, as adopted by NCTU-GR 2.0 [5], NTHU-Route 2.0 [6], NTUgr [7] and FastRoute 4.0 [8], is to first compress the 3D grid graph into a 2D grid graph and perform 2D global routing.
directly route the nets in a 3D grid graph：FGR [10] , GRIP [11] , MGR [12]
Traditional pattern routing generates 2D topologies only, while our proposed 3D pattern routing directly generates 3D topologies without the need of an extra layer assignment stage
使用DR结果进行多角度metrics 评估：

task #

detailed-routability-driven directly-3d multi thread GR

contibution #

probability-based cost scheme
- minimizing the possibility of overflow after detailed routing
3D pattern routing technique (2D pattern routing + layer assignment)(前面又说directly in the 3D space?)
- without overflow even only L shape patten routing
- pre-work[15] 是先在2d上进行pattern routing, 然后进行layer assignment, 这里是直接在3d进行pattern routing. 3d pattern routing can avoid loss of accuracy caused by compressing 3D grid graph to 2D
multi-level maze routing:
- coarsened level –> searches for a region with the best routability. first narrows the search space to a smaller region
- fine-grained level –> searches for a lowest cost solution within the region
patching mechanism
- further improve the detailed routability

flow #

In 3D pattern routing (inital routing), the nets are broken down into two-pin nets, and a dynamic programming based algorithm will route the two pin nets sequentially using Lshape patterns and stacking vias at the turns.

In the multi-level 3D maze routing phase, the grid graph is coarsened to shrink the routing space, and maze routing is first performed in the coarsened space with an objective to find a routing region with the highest routability. A fine-grained maze routing will then search for a lowest cost path within the region. use its patching mechanism here.

model #

Gcell之间的容量等于track，一般GR表征via的容量是无限的，但是在本文中不是
three base definition:
- resource = capacity - demand
- 这三个变量在GCell和wire_edge上都有特征，也就是说有6个值
- resource 能够直接表示拥程度
cost scheme
- 主要分成wire和via两部分：
- wire cost:
  1. *wl*is wire lenght cost
  2. eo is expected overflow cost, where uoc is hyper parameter, The larger d(u, v) is, the more likely it is to be congested. is accurate if the DR adopts the simplest strategy of picking a track randomly to route. However, most well designed detailed routers will do much better than random selection.
  3. lg(u,v) is a variable to refine d(u, v). “+1” 是为了值域在（0，1）表示概率。 slope is hyper parameter. When the resources are abundant, there is almost no congestion cost, but the cost will increase rapidly as the resources are being used up and will keep increasing almost linearly after all the resources are used
- via cost:
  1. thanks to our 3D pattern routing strategy, a via cost scheme can be embedded to reflect the impact.
  3. uvc is hyper parameter.
  4. 公式（5a）为什么要“+1”
Initial Routing / 3D Pattern Routing
1. use FLUTE first (not congestion awared)
2. use edge shifting (described in FastRoute) to alleviate congestion.
3. randomly choose one node in net, use DFS to get a queue and then get a DAG
4. 类似[15]，动态规划选择cost最小的3d L pattern，每个L pattern有(2 * L * L)种可能
  最后在root处得到最终的结果
Multi-level 3D Maze Routing
- maze route planing
  aims at finding a smaller but highly routable search space
  1. compress a block of G-cells (5x5 in our implementation), use avg to descripe capacity, demand, resource
  2. cost function:
  3. 得到灰色粗网格：
  4. 之后会在这几个BBox中分别进行计算cost scheme，得到上图黑色实线
- fine-grained maze routing within guides
Postprocessing / Guide Patching
- we can add new guides to improve detailed routability. adding new stand-alone guides to alleviate routing hot spots.
- three kind of patching:
  1. Pin Region Patching
    - most effective
    - the ideal way of improving pin accessibility is to identify those hard-to-access pins and assign more resources to them
    - Our global router will check the upper (or lower) two layers of a pin, which are vital for accessing the pin. use 3 × 3 patching guides.
    - 没写判断hard-to-access pins 的具体的方法
  2. Long Segment Patching:
    - a longer routing segment often means more wrong way wires and causing more congestion.
    - If a guide is longer than a specified length I, we’ll consider long segment patching.
    - if a G-cell with resource below a threshold T is encountered, a single G-cell route guide will be patched above or below it, depending on which of them has sufficient resource
  3. Violation Patching:
    - For G-cell with inevitable violations, patching will be used again to enable the detailed router to search with more flexibility.

data #

iccad 2019 dataset

experiment #

他自己又比赛后改进了
our algorithm’s peak memory is close to the first place and is 1.83 times of that of the second place on average (ours is 8.22 GB on average and is 19.8 GB for
the biggest design)

SPRoute 2.0- detailed routability driven-ASP DAC-2022- #

OpenSource!
2D
可以将guide文件输入到innovus?
soft capacity The soft capacity is downsized from the hard capacity (number of available tracks), using the pin density and RUDY value of the region.
batch for ==deterministic== net-level parallelization strategy
bulk-synchronously maze-routes
baseline FLUTE, [FastRoute 4.0](# fastroute 4.0-via min tree+3 bending-ASPDAC-2009-) for pattern routing, [CUGR](# CUGR-3D pattern+Multi level maze routing+patching-DAC-2020-CUHK)

background #

In terms of parallelization, maze routing is widely used in global routing and ==is the most time-consuming stage== on hardto-route benchmarks.

contribution #

soft capacity to reserve space for detailed routability.
parallelize maze routing in a deterministic bulk synchronous approach
design a scheduler for the deterministic parallel execution model

flow #

model #

soft capacity #

Different layers have different parameters for the ratio function since they are influenced by the congestion in different scales

bulk synchronous deterministic approach #

就是分batch，all threads execute one batch of nets at a time

在批处理开始时，每个线程从批处理中获取一个网络，读取全局图的使用情况，并在其线程局部图中执行撕裂和重新路由。

还是看不太懂

data #

ICCAD19 contest

experiment #

FastGR-GPU pattern routing+ multi thread maze–DATE-2022-PKU+CUHK+HNAL #

GPU-accelerated
accelerated the 3D pattern routing algorithm of [CUGR](# CUGR-3D pattern+Multi level maze routing+patching-DAC-2020-CUHK) for initial routing by both net-level and path-level parallelization on GPU

Gamer- -Trans-2022- - #

GPU-accelerated
accelerated the two-level maze routing of [CUGR](# CUGR-3D pattern+Multi level maze routing+patching-DAC-2020-CUHK) for rip-up and reroute by updating vertical and horizontal routing costs alternatively on GPU

GGR-super fast gpu accelerate-ICCAD-2022- #

open source！ Xplace/cpp_to_py/gpugr at main · cuhk-eda/Xplace

background #

Performance depends on the detail route

Modern global routing problem, which was introduced at the 2019 CAD contest at ICCAD, targets at closing the gap between global routing and detailed routing. The LEF/DEF files for detailed routing are directly used as the input for global routing.

The global routing quality is evaluated using an academic detailed router Dr. CU[8]

2019 ICCAD contest on global routing did not directly evaluate global routing results based on overflows and total wirelength. The new evaluation uses the global routing results as route guides for a detailed router, and the metrics are all detailed routing related

2D & 3D

NCTU-GR 2.0[13], SPRoute[7] and FastRoute 4.0[14] are 2D GR

However, compressed 2D grid graphs are less accurate than 3D grid graphs in terms of routing resources, which could limit the global routing quality.

CUGR[11]. It has both 3D pattern routing and 3D maze routing

multi-thread vs GPU

LEF/DEF based academic global routers SPRoute 2.0[6] is the only 2D GR

GAMER[10] is a novel parallel maze routing algorithm integrated in CUGR.

FastGR[12] introduced GPU parallelization of L-shape pattern routing

contribution #

flow #

data #

model #

CUGR 2.0-DAG-based-DAC-2023- -CUHK #

open source!

background #

many of the aforementioned global routers is that most of them rely heavily on time-consuming path search algorithms like maze routing to resolve overflows. These approaches are not efficient enough even with parallilization and may cause lots of unnecessary detours

contribution #

a DAG-based generalized pattern routing algorithm
a new dynamic programming-based algorithm to calculate the routing cost
time complexity from $\mathcal{O}(L^4|V|)$ to $\mathcal{O}(L^2|V|)$
a DAG augmentation algorithm that enables the creation of alternative paths in a routing DAG. can even shift or create Steiner points. over 99% nets can be successfully routed without the need of maze routing
a new sparse graph maze routing algorithm
creation of alternative paths in a
routing DAG

flow #

RSMT
DFS and Routing DAG with L pattern
注意多了节点g,f,i,h, 现在每条都是直线
Routing DAG with other patterns，但是在这里没用做初始布线，初始只用了L-shape。文章也就这里提了一下，后面都和这个无关，得去源码仔细看看。
Dynamic Programming-based DAG routing(L-shape + Layer assignment)
没说怎么舍弃的？
DAG-based pattern routing with augmentation
sparse graph maze routing algorithm

model #

cost
- Dynamic Programming-based
- DAG Augmentation for Congestion
  1. create alternative paths
  2. Steiner point movement
    具体怎么移动的文章也没说

experiment #

compare with CUGR [12] and SPRoute 2.0 [13]
only one thread for run time
Effectiveness of steiner point augmentation
run time compare with GPU-accelerated GR
- compare with FastGR [14] and GAMER [15]
- GPU的好坏也有关系吧。本实验用的RTX 3090
- slightly faster than FastGR for initial routing
- around 5.2× as fast as GAMER

InstantGR-Scalable GPU Parallelization-ICCAD-2024-CUHK #

open source!
second place of ISPD25 contest
GPU Parallelization
parallel algorithm is mainly based on the DAG-based global routing algorithm in [CUGR2](# CUGR2.0 EDGE- -DAC-2023-). 应该是3D pattern routing DP的部分和maze routing的部分
parallel while do initial routing and RRR
提高了并行度，但是还是有串行的部分
也用了FLUTE
一定要以net为单元吗？是为了用DP

background #

GPU memory is limited
- This requires memory-efficient solutions that can minimize CPU-GPU communication while maximizing GPU utilization
- large designs have more nets with bigger routing graphs, providing many new parallelization opportunities that are not yet explored
nets in a batch can be routed in parallel

task #

parallelism for large-scale
partitioned design

contribution #

a new method for net-level batch generation. based on 3D fine-grained overlap checking and explores more parallelism by increasing the number of nets per batch
node-level parallel routing approach. achieves much higher parallelism compared to traditional net-level parallel routing.

flow #

In initial routing, we construct a basic routing DAG to perform L-shape pattern routing.

key points

specific explanation show in routing2

NET-LEVEL PARALLELISM
- simultaneous routing of a batch of nets that do not “overlap”
- [2, 3, 14, 19, 20, 22, 26] 19年开始的，cugr2和fastgr都用了
- Typical Batch Generation Algorithm
  used in [2, 3, 14, 19, 20]
  R-trees 是实现line 4的常用做法
  pessimistically approximates significantly lowers the degree of parallelism
- define and graph model
  以segment为单位，同时分开了水平和垂直两个部分，假设全部为L-shape，同时对于不在一条线上的两个节点，有两个L
  These four nets will be divided into just one batch based on our exact representation of routing graphs for overlap checking, while into four batches by the traditional bounding box-based pessimistic approximation
  via model:
  via用一个十字表示
- Overlap Checking Algorithms
  1. 以水平子图进行展示，垂直同理
  2. 以水平segment为单位进行checking
  3. 首先判断是不是y坐标相等：group the segments with the same 𝑦
  4. tradictional algorithm:
    This is a classical computational geometry problem that can be efficiently solved by segment trees [1] in 𝑂(log𝑛) time for both operations,
  5. new algorithm motivation:
    segments are very short
  6. new algorithm: Point Exhaustion
    simply use a Boolean array to record whether each point in [1, 𝑛] is covered by some segment 𝑠 ∈ 𝑆. We mark every point 𝑥 ∈ [𝑙, 𝑟] when a segment [𝑙, 𝑟] is inserted, and check every point 𝑥 ∈ [𝑙𝑞, 𝑟𝑞] for overlap query of a segment [𝑙𝑞, 𝑟𝑞].
    further improve the efficiency of this point exhaustion by using bit arrays
  7. another improvement: representative point exhaustion
    - allowing a little bit of overlap.
    - it only checks the two end points of a query segment. ??什么意思
    - covering most overlap scenarios in practice.
    - The only scenario that this algorithm fails to find the overlap of two overlapping segments is when the query segment [𝑙𝑞,𝑟𝑞] contains the overlapping segment [𝑙,𝑟], [𝑙,𝑟] ⊂ [𝑙𝑞,𝑟𝑞]
NODE-LEVEL PARALLELISM
- 还是以net为单位分到不同的batch？
- routing nodes of the same depth in parallel
  Suppose we have 4 nets, Net A, B, C and D in our grid graph. Since nets with overlap cannot be routed together, Net A and B are distributed to batch 0, as shown in Figure 7a, and nets C and D are distributed to batch 1.

experiment #

4 NVIDIA A800 GPUs and 8 CPU threads.
compare different overlap checking methods
The number of nets per batch is limited to 1000
compare 2 largest benchmark
compare with Top-3 Global Routers of ISPD2024 Contest
Runtime (s) of DAG-Based Augmented Routing with and without Node-Level Parallelism
acceleration 那一行好像是加速倍率才对

HeLEM-GR-Heterogeneous+Linearized Exponential Multiplier Method-ICCAD-2024- -PEK #

first place of ISPD25 contest
not open source 2025/2/6
2D routing algorithm

background

contribution: #

LEM(linearized exponential multiplier) method for ==2D routing problem== to minimize wirelength and overflow. This LEM framework is ==general to integrate any routing kernels.==
batched routing kernels including ==L shape and 3-bend routing== for GPU parallelization.
sweep operations for GPU-accelerated layer assignment.

flow #

preparation
- run on CPU
- use FLUTE
- use SPRoute 2.0 to compact 3D graph to 2D graph
2D routing
- run on GPU
layer assignment
- run on GPU

first GPU-accelerated RSMT generation algorithm

background #

Rectilinear Steiner minimum tree (RSMT) generation is a fundamental component in the VLSI design automation flow. Due to its extensive usage in circuit design iterations at early design stages like ==synthesis, placement, and routing==, the performance of RSMT generation is critical for a reasonable design turnaround time.
previous work are CPU-based
在GPU上加速RSMT生成是一项重要但极具挑战性的任务，主要原因在于其复杂的、非平凡(non-trivial)的分治(divide-and-conquer)计算模式与递归操作。
NP-completeness of RSMT generation –cite–>[1]
the current most efficient and widely-adopted heuristic is FLUTE [9],
Although most of the nets in a typical circuit design have only a small degree (≤ 9), larger nets are exponentially harder to solve
RSMT algorithms, such as FLUTE, are based on a ==divide-and-conquer== strategy with deep recursions, which are impossible to be executed on GPU threads with very limited stack memory
The sizes of nets in a circuit netlist are ==highly uneven==, from 2-pin nets to nets with 40 pins or more, which leads to an extremely ==imbalanced workload== and harms the parallelism.

contribution #

propose a levelized task decomposition strategy
- ensures a balanced workload and enables high-performance data parallelism
a algorithmic transforms
- eliminate the recursion patterns of FLUTE
GPU-efficient kernels

DR outdated #

DRCU #

academic DR

DR adv #

toread #

DGR- -DAC-2024–CMU+NVIDA #

OpenSource!

综述 #

ML4PR #

Towards Machine Learning for Placement and Routing in Chip Design: a Methodological Overview

放置和布线是两个不可或缺且具有挑战性的 NP-hard 问题

机器学习凭借其数据驱动的性质显示出了广阔的前景，它可以减少对知识和先验的依赖，并且通过其先进的计算范式具有更大的可扩展性 (例如 GPU 加速的深度网络)

挑战:

placement:

在路由完成之前，无法评估诸如可达性之类的放置目标；因此，在优化循环中可能需要花费数小时才能获得反馈，这对于进行数千次查询来说是负担不起的
现代的放置器需要在几个小时内处理数万个宏和数百万个标准单元。这种可扩展性的要求仍然超出了现有 ML 方法的能力

routing:

在公平的比较下，现有技术很难在效率和求解质量上系统地优于经典布线算法
大多数基于学习的技术在具有数千个网络的小型电路上工作得很好，而实际的布线引擎需要在超大型 3D 网格图 ( > 1000 × 1000 × 10 ) (> 1000 × 1000 × 10)(>1000×1000×10) 上有效地处理数百万个网络并产生高质量的解决方案