演员蒋方婷全身图片:Dynamic Power Optimization with IC Compiler a...

来源:百度文库 编辑:中财网 时间:2024/04/29 10:23:06

Dynamic Power Optimization with IC Compiler and PrimeTime PX


Prance ZhangPrance.Zhang@analog.com

Abstract

Power consumption is ever becoming more and more critical in these days not only in terms of the ecological viewpoint but also for targeted application specific requirements, and in particular modes of operation in these applications. One of our projects has restricted dynamic power requirements from the customer. Analysis of 1st silicon, sampled to the customer, found that the dynamic power consumption of the digital design was in excess of the requirement. According to the customer’s requirements, we need to reduce the dynamic power by 70%, which at first glance looked impossible but in this paper, I will share experiences on how we reduced the dynamic power to meet the customers’ requirements. Some of the key aspects include;
Low power implementation flow is introduced, including register clustering, switch activity annotation, inverted based clock network and some new features with the latest IC Compiler 2010.12 version. How much power can be saved by using each of these methods is evaluated separately. By switching buffer based clock network to inverted based clock network, clock buffers area, insertion delay, skew and power are been reduced simultaneously. All the methods above reduce the active digital power by almost 20%.
Analysis showed that the clock network took a large part of the total power, so this area in particular required some focus on the flow to build up a clean and efficient clock network. Original CTS flow was found to insert lots of buffers to balance all the sinks within synchronous clock groups. For some register groups which have no critical timing path with other registers, setting exclude pins on their clock pins will avoid ICC to insert clock buffers for these pins. By optimizing CTS strategy, the dynamic power decreases by more than 35%.
PrimeTime PX was used to report the power as it has good correlation with our initial silicon measurements. It was also used to analysis power density to find out blocks with power issues quickly. A trial of the potential power savings that PTPX predicted was possible as a metal mask option. This was executed and showed that PTPX was within 5% accurate..

1. Introduction

Power consumption is ever becoming more and more critical in these days not only in terms of the ecological viewpoint but also for targeted application specific requirements, and in particular modes of operation in these applications. The article discusses several methods to optimize dynamic power based on the experience of our current project. Good clock structure is the essential start point of power optimization, especially when there are complicated structure like delayed clock groups, many synchronous clock groups. Synopsys provides a number of effective ways to constrain your clock tree synthesis and optimize the dynamic power. The low power techniques used in our project include register clustering, gate-level power optimization, SAIF (Switching Activity Interchange Format) guided optimization, inverter based clock network and new ICC version 2010.12. All the power results in this article are from a 20MHz clock frequency, 200,000 gates design with 0.18um technology. Different design may have various optimization results. As we use 0.18um ULL technology, the leakage power is quite small. The article focuses on dynamic power optimization only. The power orientated flow used in our project is in Figure 1.

Figure 1 – Low power implementation and analysis flow

2. Low power implementation

The low power implementation flow with several techniques reduces the dynamic power by nearly 20% in our project (may vary in different designs), which is much greater than how much we expected tools can save. Figure 2 shows the low power implementation flow with specific low power techniques used.

Figure 2 – low power implementation flow

2.1 Switch activity annotation

Firstly, run RTL level simulations to dump VCD file, then convert VCD file to SAIF file. In our flow, DC reads SAIF converted from RTL simulation VCD, and ICC reads SAIF dumped from DC. In order for better switch activity annotation in ICC, you could run gate-level simulation with synthesis netlist and dump VCD file; then convert the VCD file to SAIF file. The solid line flow in Figure 3 shows the SAIF file generation procedure in our current flow, and the dot line shows another flow which generates 100% annotation rate SAIF but needs another round of simulation.

Figure 3 – SAIF file generation for DC/ICC use

2.2 Pre CTS power optimization with register clustering

Pre CTS power optimization enables you to perform power optimization that includes power-aware placement and clock- gating optimization before the clock tree synthesis (CTS) stage [1].

Figure 4 – Power-aware placement

Figure 4 shows an example of power-aware placement optimization. The left circuit diagram is without power-aware placement, and the right circuit diagram is with power-aware placement optimization. Net switching power is linearly proportional to the product of switching frequency and capacitance. As the switching frequency is fixed on each net, ICC would try to reduce the capacitance on high switch activity net. See Figure 4, ICC tries to obtain smaller clusters, shorten the high activity net (clock net) and lower total net capacitance by moving the registers closer. Although the green net is longer, as it has low activity, the power consumption increase on the green net is much less than the power saving on the red net.
# SCRIPT: Pre CTS power optimization with register clustering
set_power_options -dynamic true
set_power_options -leakage false
set_power_options -dynamic_effort high
set_power_options -low_power_placement true
set placer_max_cell_density_threshold 0.7
set_optimize_pre_cts_power_options -default
optimize_pre_cts_power

The pre CTS power optimization should be used with switch activity annotated; else the power optimization effect is not significant. Results in Table 1 shows the power saving is 9% by using pre CTS power optimization with switching activity annotation.

Table 1 – Pre CTS power optimization with SAIF

2.3 New ICC version 2010.12

New ICC version 2010.12 brings in lots of new features which could reduce the clock tree power greatly. Please see Figure 5 for detailed technologies used in this new ICC version [2].

Figure 5 – ICC 2010.12 new features in clock tree power optimization

The power reduction taken by the new ICC version is far beyond our expectation, which is 9.5%. Detail comparison is in Table 2.

Table 2 – Power saving with new ICC version


More complete low power implementation results are in Table 3 and Figure 6. Low power implementation with new ICC version reduces the total power by 17.63%. The area is reduced marginally as well.

Table 3 – Low power implementation result


Figure 6 – Low power implementation result

2.4 Inverter based clock network

In our previous CTS flow, we use both buffer and inverter to build up the clock tree by default. Based on our project experience, we found some benefits of inverter based clock:
1. Less power consumption 2. Achieve better insertion delay 3. Achieve better clock skew 4. Maintain better clock duty cycle ratio

Table 4 – Inverter based clock network comparison


In Table 4, we can find that the buffer area, insertion delay, skew and power is reduced with inverter based clock network. The buffer level may increase in other design according to Synopsys document.

3. Clock tree exceptions

3.1 Understand the clock structure

Before CTS, it’s extremely important to understand the clock structure. When the design becomes complicated, the clock tree structure becomes not easy to handle. Certain clock tree constraints are needed. There are some handy features in ICC to report clock tree. Figure 7 shows one clock tree fanout schematic. In this figure, we can easily get the clock hierarchy and clock gating cell location. Also in Figure 8, a clear clock tree levelized graph is given.

Figure 7 – Clock tree fanout schematic

Figure 8 – Clock tree levelized graph

3.2 Constrain the clock tree synthesis

As there are special requirements and clock structure in our CTS.

Figure 9 – Set exclude pins

Take Figure 9 as an example, the block A is used to select one clock from multiple clocks. The only output of block A is a clock signal and all the input data signals are treated as asynchronous. So registers in block A do not need to balance with other registers outside block A. Due to large fanout of the output clock from block A, there will be several nanoseconds insertion delay for reg_C. If there is no special constraint on clock tree synthesis, the tool will try to balance reg_A with reg_C, so lots of buffers will be inserted ahead of reg_A. The unnecessary buffers consume lots of power. The target is not to balance the insert delay for reg_A with reg_C to avoid unnecessary buffers.
Setting exclude pin on the registers in block A is a way to meet the target.

# SCRIPT: clock tree exceptions
set_clock_tree_exceptions -exclude_pins [get_pins block_ A/*/CK]

In our project, due to the lack of communication between design engineer and implementation engineer, there was no such setting in ICC script in the first tapeout. More than 30% digital power consumption can be saved by using this kind of settings.

4. Power analysis with PrimeTime PX

4.1 Power analysis with PrimeTime PX

PrimeTime PX can generate a wide range of reports that provide information about power consumption [3]. These power reports help us to know the power dissipation in hierarchy modules, the clock network power, the clock gate savings and so on. The report below shows the power consumption by hierarchy blocks.

Beside reports in text format, PrimeTime PX also provides a GUI way to analyze the power density. For example in Figure 10, we can see the blocks with red color which have high power density. You can go down to different hierarchy level to check the power density. These red blocks should be analyzed first, some of them are expected to have high power density, but some of them could be optimized by clock gating or some methods else.

Figure 10 – Power density graph in PrimeTime PX

4.2 PrimeTime PX report VS silicon measurement

PrimeTime PX has a very good correlation with silicon data based on our experience. A FIB (Focused Ion Beam) and a metal edit tapeout have been done and the results are shown in Table 5. From the table, the power saving number in PrimeTime PX is matched with silicon measure well. Base on the results, we gain more confidence on predicting the power consumption with PrimeTime PX report for the next tapeout.

Table 5 – PrimeTime PX correlation with silicon

5. Conclusion

The article introduces the low power optimization methods used in our project. Good design and clock structure is the basis of low power target. With intelligent clocking strategy, DC/ICC power optimization features like clock gate insertion/ merging, power aware placement, and power driven sizing, a very power efficient design could be implemented to meet the end requirements. A good power predication can be done with the precise power analysis in PrimeTime PX after the power optimization.

6. Acknowledgements

I would like to thank ADI CAD and Synopsys technical support for their help!

7. References

[1] IC Compiler User Guide Version E-2010.12, February 2011
[2] IC Compiler 2010.12 Update Training, February 2011
[3] PrimeTime? PX User Guide Version E-2010.12, March 2011