Cell Coprocessor Acceleration (CPA) at IBM for Computational Lithography

G. J. Dick ¹, J. Allen ¹, L. Biskup ²,
T. Dunham ¹, J. Rogich ¹, J. Russell ²

¹ IBM Semiconductor Research and Development Center
² IBM Systems and Technology Group, Operations
Computational Acceleration: Mentor Calibre™ nmOPC & Cell Broadband Engine™
Optical Proximity Correction (OPC)
What Is It and Why Is It Needed?
Increasing Lithography Imaging Complexity

**Optical Proximity Correction (OPC):**
Anticipate and compensate for sub-wavelength optical lithography distortions in manufacturing.
Need for Optical Proximity Correction

This is what the designer would like to see on the wafer
Need for Optical Proximity Correction

The brown hatched area is what would be on the wafer if the design is not corrected.

Severe discrepancies exist between the actual wafer image and the designer’s intent.
Need for Optical Proximity Correction

The solid shapes are the result of Optical Proximity Correction.
Need for Optical Proximity Correction

The blue hatched area shows what would be on the wafer when the Optical Proximity Corrected shapes are used to make the mask.
Need for Optical Proximity Correction

Comparison of the pre-OPC (brown hatched), post-OPC (blue hatched) wafer images, and design intent (shaded gray).

Wafer image now more closely matches the designer's intent.
Hardware Acceleration of OPC
Mentor Graphics + Mercury Computer Systems + IBM Partnership
A comprehensive 3 company value proposition

- Calibre® nmOPC
  - Dense image simulation
  - Co-Processor Acceleration
  - Hierarchy engine
  - New resist process model
  - Process window correction
  - Design intent awareness
- Application software support

- MultiCore™ Plus middleware
- Software integration
- Algorithm optimization
- FFT library optimization
- CPA cluster integration and test
- Performance and tuning optimization
- IBM HW and HW support sale

- Cell Broadband Engine™ Architecture
- HW warranty and single-point-of-service
  - CPA standalone, or
  - Hybrid x86 + CPA cluster system
- High density computing
- Data center services

The value of the whole is greater than the sum of the parts
Mentor Calibre® nmOPC - Convergence of Technologies: **Dense Imaging Simulation and Hardware Acceleration for OPC**

**Why Transition from Sparse to Dense Imaging?**

*Grid-based simulation more efficient with increasing layout density*

65nm sparse simulation  
45nm sparse simulation  
45nm dense simulation

**Dense vs. Sparse Imaging**

- Computational efficiency
- Better model accuracy
- Support more complex shapes
- Designer intent
OPC Computational Acceleration: How it Works

Implementation by Mentor Graphics and MERCURY

- x86 Core
- x86 Core
- x86 Core
- x86 Core

Measurement
Cost Function
Edge Movement

Master
Dispatcher

Work Management

4:1 ratio of x86 cores to Cell/B.E. processors

PPE

- SPE
- SPE
- SPE
- SPE
- SPE
- SPE
- SPE

Rasterization Simulation Contouring

Enabled with Mercury’s MultiCore Framework (intra-Cell) and Parallel Acceleration System (inter-Cell) middleware
How OPC Works in Dense Imaging

Iterative Model-based OPC: A Great Fit for Cell/B.E.

- OPC simulations using FFTs are ideally suited for Cell/B.E.
  - >80% OPC run time consumed in simulation
  - OPC Aerial Image Calculation [1K x 1K Fourier Transform]:
    - 150ms - 3.2GHz Opteron
    - 1.2ms - Cell BE Processor (simulated)

- Mask Layout → Ant-aliasing & Rasterization → Raster Mask Image → Fast Fourier Transform (FFT)
- Move Mask Edge
- Image Contour
- Inverse Fourier Transforms → Convolve with Imaging Model (Pointwise Multiplication) → Mask Image in Frequency Domain

Up to 100X Speedup
Example Higher Density Computing

Performance / Sq. Ft. / Watt / $

With CPA

**7.4X GFLOPs/Watt**

Without CPA

**7.2X GFLOPs/Sq. Ft.**

912 x86 Linux Cores
228 1U 2S Dual-Core Servers
5.43 Racks
90.5 KW = 16.7 KW/rack
(397 W/server)
~5,472 SP GFLOPS

256 x86 Linux Cores
64 1U 2S Dual-Core Servers
1.52 Racks
25.4 KW = 16.7 KW/rack
(397 W/server)
~1,536 SP GFLOPS

64 Cell BE Processors
32 Cell BE Blades
3 BladeCenter H Chassis
0.64 Rack
12.6 KW = 19.7 KW/rack
(5,527 W/chassis = 395 W/server)
~13,120 SP GFLOPS

= Runtime

**Typical 4:1 ratio of x86 cores to Cell/B.E. processors**

---

<table>
<thead>
<tr>
<th></th>
<th>From</th>
<th>To</th>
<th>Advantage</th>
</tr>
</thead>
<tbody>
<tr>
<td>SP GFLOPS</td>
<td>5,472</td>
<td>14,656</td>
<td>+9,184 (+168%)</td>
</tr>
<tr>
<td>KW</td>
<td>90.5</td>
<td>38.0</td>
<td>-52.5 (-58%)</td>
</tr>
<tr>
<td>Racks</td>
<td>5.43</td>
<td>2.16</td>
<td>-3.27 (-60%)</td>
</tr>
<tr>
<td>x86 Cores</td>
<td>912</td>
<td>256</td>
<td>656 Freed Up</td>
</tr>
</tbody>
</table>

CPA = Mentor Calibre nmOPC CoProcessor Acceleration
Example Higher Density Computing

**Performance / Sq. Ft. / Watt / $**

<table>
<thead>
<tr>
<th></th>
<th>Without CPA</th>
<th>With CPA</th>
</tr>
</thead>
<tbody>
<tr>
<td>SP GFLOPS</td>
<td>624</td>
<td>5,954</td>
</tr>
<tr>
<td>KW</td>
<td>10.3</td>
<td>15.4</td>
</tr>
<tr>
<td>Racks</td>
<td>0.62</td>
<td>0.83</td>
</tr>
</tbody>
</table>

Typical 4:1 ratio of x86 cores to Cell/B.E. processors

**3.2X Speedup**

104 x86 Linux Cores
26 1U 2S Dual-Core Servers
0.62 Racks
10.3 KW = 16.6 KW/rack
(397 W/server)
~624 SP GFLOPS

104 x86 Linux Cores
26 1U 2S Dual-Core Servers
0.62 Racks
10.3 KW = 16.6 KW/rack
(397 W/server)
~624 SP GFLOPS

26 Cell BE Processors
13 Cell BE Blades
1 BladeCenter H Chassis
0.21 Rack
5.1 KW = 24.3 KW/rack
(5,527 W/chassis = 395 W/server)
~5,330 SP GFLOPS

CPA = Mentor Calibre nmOPC CoProcessor Acceleration
Calibre® nmOPC - Lithography Run Time and Speedup

Some of our very first results. These were run at the Mentor Graphics Corporation facility in Marlborough MA while we were in the process of bringing up our installation at IBM Burlington.
IBM’s Computational Lithography Experience Using Cell Coprocessor Acceleration (CPA)
Configuration of Accelerated Cluster

- QS21 Cell blades: 126 (Cell/B.E. processors: 252)
- x86 remote cores: Over 1000 available
  - A standard OPC run uses 120 x86 cores
- Standard ratio of x86 remote cores to Cell processors: 4:1
  - Standard accelerated run uses 30 Cell processors
  - The x86/Cell chip ratio can be varied.
- Memory per x86 remote core: 3GB
- Master System p P4 Regatta - 32-way with 256GB memory
A 32nm Testsite Floor Plan

Chiplet #1
20.054 x 20.054mm

Chiplet #2
20.054 x 5.4720mm

Chiplet #3
20.054 x 20.054mm

Chiplet #4
10.402 x 10.054mm

Chiplet #5
10.402 x 15.472mm
32nm Testsite’s Chiplets’ Total Elapsed Times

120/0 is 120 x86 cores and no Cell B.E. Chips (no acceleration)
120/30 is 120 x86 cores and 30 Cell B.E. Chips (4:1 ratio)

Size: ~1.5 x 26 ~20 x 5.5 ~20 x 20 ~10.4 x 10 ~10.4 x 15.5
32nm Testsite’s Chiplets’ Simulation Elapsed Times

120/0 is 120 x86 cores and no Cell B.E. Chips (no acceleration)
120/30 is 120 x86 cores and 30 Cell B.E. Chips (4:1 ratio)

![Bar chart showing simulation elapsed times for different chiplet sizes and configurations.]

Size: ~1.5 x 26 ~20 x 5.5 ~20 x 20 ~10.4 x 10 ~10.4 x 15.5
ERROR: stack underflow
OFFENDING COMMAND: ~
STACK: