## SYNOPSYS

## Addressing SDC challenges with Silicon Lifecycle Management

Jyotika Athavale October 6, 2023

## BACKGROUND



Moore's Law → More than Moore → SysMoore

- High levels of complexity Data center and Automotive
- Technology scaling
- Integration of numerous devices and software with vulnerabilities
- Increased threat surface
- Users are demanding better guarantees of safe operations of devices, software, and systems
- Today's system diagnostics are inadequate

#### FAULT CLASSIFICATION

"Emerging Fault Modes: Challenges and Research Opportunities" S. Gurumurthi et al



### SILENT DATA CORRUPTION

SDC (Silent Data Corruption) represent data errors that go undetected by the overall system, resulting in either:

- System or application crash or hang

- Change in the output of application (miscompare)

– May be masked altogether

# **SDC CAUSES & CONSIDERATIONS**

- Sources include permanent, intermittent, transient and degrading faults
  - Root causes can be either extrinsic manufacturing defects, intrinsic silicon aging or transient errors.
  - Severe defects are easily detectable during manufacturing test.
  - If the defects are weak, they can create circuit marginalities that fail only under certain operating conditions.
  - Latent defects are not symptomatic until after the components have been operational for a certain duration.
  - Common mitigations rely on resiliency in HW and SW to detect and correct these errors such as via dual modular redundancy (DMR) and triple modular redundancy (TMR).
    However, these approaches are expensive and not sustainable.
  - Error mechanisms often manifest in the field as timing issues, the best predictor for potential errors is reduced timing margins.
    - Monitoring environmental changes in the silicon as well as application stress, and tracking timing margin changes for memory and logic paths over time, allow for prediction of an SDC error before it manifests.

## DEGRADING FAULTS

A degrading fault exhibits characteristics that degrade over time and can result in an error.

Degrading faults can result in changes to circuit behavior before leading to a failure.

The degradation could be due to expected aging (intrinsic) or due to defects (extrinsic).

Intrinsic faults can be considered systematic, whereas extrinsic faults are primarily classified as random.

Extrinsic faults are caused by defects introduced by external sources during manufacturing.

A prognostics solution that monitors timing paths can be used to detecting degrading faults and predict Remaining Useful Life (RUL) before manifestation of a failure.

#### **Traditional Planar Vs 3D FinFET**

- The use of FinFET transistors enables smaller process geometries and faster processing.
- However, it also changes the failure mode susceptibility characteristics compared to traditional planar transistor technologies found in 28nm and larger process technologies.



Schematic representations of a planar transistor (left) compared to a FinFET (right).

# FAULT MODEL TRENDS



NMOS gate leakage increase with stress is less in 22nm FinFETs than in 32nm planar technology. EOL leakage is matched between the technologies. [5]

#### Fault Model trends: BTI



BTI comparison between 22nm FinFET and 32nm planar transistors. NMOS is significantly improved due to gate optimization, oxide scaling, and work function tuning.



Normalized degradation of 32nm and 22nm nMOSFETs as a function of drain voltage. Results show that the reduction in channel length at low drain bias improves hot carrier-induced stress in the 22nm node compared to the 32nm node.

#### FinFET processes exhibit significant improvement in intrinsic silicon aging mechanisms, compared to planar

SYNOPSYS<sup>®</sup>

"Advanced CMOS Reliability Update: Sub 20 nm FinFET Assessment," Sandia National Labs, Walraven J. et al

Stress Voltage [V]

8

Fault Model trends: TDDB



Stress Voltage [V]

#### Fault Model trends: Gate Leakage

# IM FAILURES SHIFT RIGHT



- On Sub-20nm process technologies, as the wearout related failures are reduced, the degrading defect (extrinsic) failing signatures are observed continuing past the IM phase, and into the useful life region of the bathtub curve.
  - Prognostics capabilities an be used to
    - Detect degrading faults before they manifest as failures (permanent faults / SDC) by monitoring if Vmin exceeds a pre-defined threshold

9

• Calculate the RUL based on the measured rate of degradation

## Manifestation of a Degrading Fault



IEEE 1856 model introduces 3 metrics:

Response Time for the predictive algorithm, defined as the time between fault detection and first correct prediction of RUL
Prognostic Distance, defined as the time between the correct prediction and the occurrence of a failure

— **Prognostic System Accuracy**, defined as the difference between the predicted failure time and the actual failure time.

IEEE Std 1856-2017, IEEE Standard Framework for Prognostics and Health Management of Electronic Systems



ISO TR 9839 - Application of predictive maintenance to hardware

#### Silicon Lifecycle Management: Monitor, Collect, Analyze & Act



Silicon Lifecycle Management IP enables capabilities to

- Monitor the health of the part
- 2) Detect symptoms of a degrading fault
- 3) Predict an SDC error before it occurs and
- 4) Take the necessary corrective action to improve availability

### Silicon Lifecycle Management (SLM)



Cloud, On-Prem, Edge, Embedded

### **SLM Use Cases**

SLM helps improve silicon health at critical stages within the device lifecycle



## SLM Silicon.da

Silicon insights and analytics from design through manufacturing

- Synopsys Silicon.da production analytics, spans design through product manufacturing phases.
- It automatically highlights silicon data outliers, enabling engineering teams to quickly identify and correct underlying issues in design and manufacturing.
- It boosts productivity by consolidating analytics across all manufacturing phases within a single environment





### SLM Monitoring IP

Throughout the Silicon Lifecycle



#### SLM PMM (Path Margin Monitor)



SLM PVT Monitor IP

SYNOPSYS<sup>®</sup>



SLM CDM (Clock & Delay Monitor)

The SLM solution also includes AXI monitor, signal monitor, Ring Oscillators, Test & Repair and ECC

# Error Estimation: Remaining Useful Life (RUL)



#### **Corrective Action**



Based on the calculated RUL using the Synopsys SLM solution, we can identify the point at which a component or system is likely to fail and take action to prevent it.



Thereby it is possible to improve the system's reliability and availability by identifying earlier potential issues before they lead to an SDC event.



This can help hyperscalers achieve their metrics while reducing maintenance costs and improving overall operational efficiency.

# SUMMARY

- The growing challenge of addressing Silent Data Corruption necessitates the need for increased resiliency of hardware components, with enhanced RAS capabilities for HPC and mission critical use cases.
- With technology scaling, integration, increase threat surface, multi-die packages and safety critical applications, we need to consider mitigations in design, architecture, test and employ best practices throughout the hardware device lifecycle.
- In order to meet these challenges, Silicon Lifecycle Management (SLM) solutions will be critical to improve silicon health and operational metrics.
- Silicon Lifecycle Management IP enable performance and resiliency needs for high computation and also provide the monitoring and detection capabilities needed to greatly enhance manufacturing quality and product integrity in the field.