Friday, 12 April 2013

Reducing crosstalk effects



Si and crosstalk are more so a wire-dominated effect. Some of the things that the designer should keep in mind while resolving crosstalk and si effects are:
1)      Reduce congestion: The more the routing congestion, the more is the si effect. So, the cells need to be either down sized or up sized based on the case. If the driver is weak and is driving a fanout of cells at long distances, these can become crosstalk victims. Hence these drivers should be upsized.  All low frequency signals like scan, reset and other static signals are common victims of this behavior. At the same time, if the driver is driving a fanout of cells sitting close by and if the driver is having a high drive strength, then it can become a potential aggressor. So, it needs to be down sized.
2)      Spacing: As we know the crosstalk capacitance is an indirect function of spacing. The more the spacing between the sheets of metal (routes in this case), lesser is the capacitance and so the amount of voltage that will be coupled from the aggressor to the victim will reduce, resulting in lesser si effects.
3)      Moving the routes: Since the crosstalk is more so a coupling effect, having more aggressors is often a bigger problem than having one (this is not with the assumption of the timing windows not overlapping). Hence, the nets can be moved so that the victim has just 1 aggressor instead of 2, reducing the quantum of crosstalk capacitance.
4)      Avoiding long parallel routes: Since capacitors in parallel add the capacitance, having long parallel routes will only increase the capacitance. Hence it is better to switch the net route layers between different metal layers.
5)      Shielding: The capacitance to ground is assumed to be ideal. So, if the transition value for the victim is not bad, then shielding with the ground net makes sure that the effects of crosstalk between the victim and the original aggressors are nullified.
6)      Repeater stitching/buffering: This is often the best fix if there is enough slack on the timing path. This avoids long routes and hence the coupling cap is reduced. Also, with staging there are many benefits like improved transition time, better IR and EM and less noise.

Solving hold time violations



In most scenarios, the winner in hold failure resolution is the increase of the datapath delay without affecting the setup time. If this is not possible, then the negative skew component is increased/positive skew component is reduced keeping in mind the setup timing. But, this is not normally practiced.

Also, note that unlike the setup time resolution, the hold failures have nothing to do with the RTL and hence the resolution has to happen at the PD stage of the design.

Increasing the datapath delay:
1)      This is achieved through addition of buffers at the timing arcs of cells which are part of the timing paths that are failing hold but not setup. Normally, the designer would query the setup timing on each of the input pins of cells (not just the timing arcs that are part of the failing hold timing path) and if enough slack is seen, then the hold buffers can be added at these points. The exact number of hold buffers added depends on the quantum of the failure. The library team would give the appropriate buffer sizes and buffers characterized to ensure that the delay for these cells does not cause damage at the slow corners.
2)      For marginal hold failures that don’t need a buffer/cell addition, then detours through a routing tool like the one supported by ICC can be used. But, the designer needs to be cautious of the si components acting on the nets as at times, detours can have a positive secondary crosstalk effect, hence making the signal transition faster. This will reduce the net delay hence worsening hold. So, for si hold violations, the best fix in route detours is to have the aggressor information. For ba hold violations, route detours adds delay and fixes the hold violation.
3)      Managing the AOCV derate values: The derate values applied should have gone through lot of experimentation. The derates applied may be too optimistic leading to hold violations. The derates should ideally be PVT corner dependent.
4)      Cell swaps: If the cells in the path are not failing setup timing, then the cells can be swapped to hvt. The hvt cells have inherent low drive capability and so offer more delay.
Clock skewing:
This is not suggested as the CTS will be built keeping the setup timing in mind. But, if there is enough slack on the launch side of the path, then additional CTS inverters can be added, hence inducing negative skew and fixing hold.

Solving Setup time violations



Somehow the datapath delay needs to be reduced to meet the setup timing requirement. If this is not possible, then the other option that the designer can try is to skew the clock so that the clock edge progresses and hence capture happens at the next clock edge.
An elaborate listing of the possible fixes are:
1)      Better placement: The timing path should be as linear as possible. There shouldn’t be any zig zag placement of the cells. This kind of haphazard placement means that the routes will be long leading to more delays.
2)      Better transitions/less capacitance: The timing path must be well structured. This means that there shouldn’t be cells driving loads seated at long distances. This leads to bad tran/increased cap on the nets and hence delay increases. At appropriate stages, buffers/repeaters have to added to ensure better tran value. Note that in latest process technologies (28nm and below), the cell delay is almost comparable to the net delay. Hence, an addition of an extra stage of buffers/repeaters reduces the effect of bad tran on the nets by half.
3)      Better optimizations: The tools are themselves intelligent to optimize the timing path better. But, in some cases the tool may not honour the timing weights and so some paths may not be optimized well. The cells may either be of very high drive strength/or poor drive strength. If the cells are of high drive strength and the preceding stage is not able to drive this load, it leads to a cap violation and so delay worsens. On the other hand if the cell is of low drive strength, then the tran on the driven net worsens and so delay increases. Hence the choice of cells should ideally be dependent on timing path criticality and placement of cells.
4)      Logic re-structuring: Tools may perform a good physical synthesis, but at times the designer himself has to restructure the logic. For example an AND gate may provide a delay of 20ps while the combination of NAND and NOT gates may give a delay of just 10ps. Some of the redundant logic added/inferred by the tool too can be eliminated. Unnecessary buffers, complex gates (like AND OR combinations for a simple mux etc) can be eliminated.
5)      Logic replication: Gates/flops can be duplicated to cater to a large fanout. For example, if there is a flopped version of a reset signal going to 100 AND gates, then the flop can be replicated (assuming that the D side has enough slack). This ensures that there are now 100 new timing paths instead of 1 and so the slack margins will improve significantly.
6)      Proper constraints: Constraints has a direct impact on placement of cells in a timing path, choice of cells by the tool during physical synthesis and the timing path group optimization. Interface paths need proper constraints to ensure the IO paths are neither over optimized or less optimized.
Path groups: The critical range and the weight set on each of the major timing path (like reg2reg, io_to_io, input_to_reg, reg_to_output) should be done carefully. If the internal timing closure is the priority, then a high weight can be set.
7)      Cell swaps: If there is difficulty in achieving any better optimization, then the cells can be swapped to lvt. With this the channel in the gates will be formed sooner and hence they switch faster, improving the delays.
8)      Addition of latches: Addition of negative level latches at before launch flop in a failing reg2reg path (both flops posedge) or positive level latch (in case of both flops being negedge) allows us to borrow half cycle and meeting setup.
What is also followed at some places is the swap of the flop failing setup with a positive/negative level sensitive latch to borrow time. But care should be taken so that data is not missed/X is propagated.

Clock skewing:
Positive skew relaxes setup so if there is enough slack on the launch clock path, then the capture clock path can be delayed, relaxing setup. This is normally followed after the CTS is optimized and one round of hold fixes is done.

AOCV derates:
The derates applied should be PVT corner specific and should be done with care.

Wednesday, 24 October 2012

Decaps

Decaps are physical only cells that are added for PG stability. The proper PG distribution is important due to these reasons:
1) The IR drop is critical in the PG network. Abnormal IR drops can lead to the cells getting powered up/ switched off at non-desired intervals. This migh impact the functionality. Also, a delay in switching of the cells can lead to timing issues. Just to recollect, the amount o voltage applied to the drain of the MOS transitor is proportional to the amount of current that flows through the MOS transistor.
2) Reliability: Hot spots in the chip might result from excessive voltage drops. These are observed mostly near the clock rows/clock paths and the PG network.

The IR drops in the PG network are maintained minimal by routing the PG nets in higher metal layers as they have lesser resistance, through advanced PG grid structures (like alternating between VDD and VSS nets for every double repeatition etc), having more VDD and VSS pins for the standard cells etc.

To stabilize the PG network and act against the effects of inductance in the network leading to IR drops (R+Ldi/dt), the decaps are used. 

The decaps are placed all over the design. They store the charge and provide the power for switching. The power and ground bounce does not impact the cells as it can't be bypassed by decaps.



To decide on the numbers that are ideally needed for a given design, first the voltage supply magnitude swing and noise budgets need to be finalized. For a 10% noise budget, Cdecap=9Cload.
So, roughly for a 100k gate count (relative to NAND gate), we need 900k decaps assuming loaded functionality.

Saturday, 11 February 2012

PLL overview

 

The PLL(phase locked loop) is often the most important analog component
used in SoCs and other clocked circuits. What might wonder us when somebody
says that his/her device works fast is how fast that can be or whether
the claim is relative? The answer lies in the performance of this entity, the PLL.


The design and the verification of the specs is indeed the biggest
challenge. The performance of the PLL is critical because it generates the
clocking signal required to sequence the execution of the entire chip.


I would like to present  summary of the working and the composite of
the same in this section.

The PLL works like a closed-loop control system.
It comprises of the phase detector (which in the simplest case can be an
XOR gate. The phases of the input signal and the feedback signal are
compared), the loop filter (this is normally a lowpass filter) and the Voltage Controlled
oscillator (VCO can in the simplest way be implemented from the Wein Bridge
Oscillator that amplifies the voltage and produces an output
that is dependent on the feedback factor- determined by the 2 resistors).

If the output frequency is to be the same as that of the input, then
there is no need of any frequency divider to be placed at the feedback
path, else a divider can be used that divides the frequency in suitable steps.

The input signal can be that from any oscillator such as Quartz.

The design is critical in terms of the choice of the feedback resistors,
the loop filter.  


Tuesday, 15 November 2011

LSSD vs MUXD cell. Which is better?

The scan cells as part of the DFT stitching process that are normally used are the MUX-D cell and the
LSSD cell. In newer technologies (45nm and lesser); the Mux D cell is not used because of the the combinational elements and
the inherent impact on the controllability thats a major concern in semmiconductor testing.

The LSSD cell, though offers a lesser usable slack is inherently a master-slave flop flop/ latch combination and is free from the effects of
combinational logic- hence better controllability and observability. Even in terms of area and power; LSSD cell fairs better.

In making the choice of which is better; it just depends on the technology. MUX-D is better if we can offer just one scan clock. No phase relationships to be maintained if the select signal is strobed. But, with MUX (again a combinational logic), there can be guaranteed uncertainty.

The LSSD cell needs the proper non-overlapping phase relationship to be maintained between the scan clocks and the free running clock and between themselves; and of course the timing overhead. Apart from that, in terms of performance, this will be a better choice.
//muxdcell.v
//`include "userprocs.v"
module muxdcell (d,scan_d,clk,sel,q);
input d,clk,sel,scan_d;
output q;
reg q;
wire a1,a2,a3,data;
assign a3=~sel;
assign a1=sel&scan_d; //sel enables the operation in scan mode
assign a2=d&a3;
assign data=a1|a2;//delay may be added to compensate the combinational overhead
always@(posedge clk)
begin
q<=data;
end
endmodule

//lssd cell.v
module lssd(clk,q,scan_in,scan_1,scan_cntrl,scan_2,latch_output,d);
input clk,scan_1,scan_2,scan_cntrl,d,scan_in;
output reg q,latch_output;
wire a;
wire b;

assign b=latch_output;
assign a=scan_cntrl&1'b1;
always@(clk or a or scan_1)//the master is controlled by the transition on scan_cntrl
begin
if(scan_1==1'b0)
latch_output<=d;
else
latch_output<=scan_in;//assuming that the scan and free running clocks are synchronous and have a definite phase relationship
end

always@(posedge clk)
begin
q<=d;
end

always@(scan_2 or a)
begin
q<=b;
end