## **JAIST Repository**

https://dspace.jaist.ac.jp/

| Title        | 効率的耐故障集積回路のための時間・空間冗長最適化<br>に関する研究 |
|--------------|------------------------------------|
| Author(s)    | 呉, 政訓                              |
| Citation     |                                    |
| Issue Date   | 2018-03                            |
| Туре         | Thesis or Dissertation             |
| Text version | ETD                                |
| URL          | http://hdl.handle.net/10119/15322  |
| Rights       |                                    |
| Description  | Supervisor:金子 峰雄,情報科学研究科,博士        |



氏 名 Junghoon Oh 学 位 類 博士(情報科学)  $\mathcal{O}$ 学 博情第 385 号 位 記 番 뭉 学位授与年月 平成 30 年 3 月 23 日 日 Temporal and Spatial Redundancy Optimization for Efficient Fault-Tolerant 題 論 文 目 **LSIs** 北陸先端科学技術大学院大学 文 審 查 委 員 主査 金子 峰雄 教授 井口寧 同 教授 准教授 田中 清史 同 井上 智生 広島市立大学大学院情報科学研究科 教授 伊藤 和人 埼玉大学大学院理工学研究科 教授

## 論文の内容の要旨

Due to the downsizing of VLSIs, several reliability issues have become more explicit.

Among the issues, soft-error induced degradation is one of the dominant contributors to faults on modern VLSIs. Soft-error is a transient fault which is triggered by cosmic ray induced neutron and alpha rays from radioactive contaminants in IC package materials. Soft-error lasts only short time but it can affect several spatial points simultaneously.

Approaches to dealing with soft-errors are roughly divided into the following three groups: i) approaches on the device level such as selecting of IC packing materials and improving of well structures; ii) approaches on the circuit level such as a flip-flop with additional circuits for error detection, error recovery and error avoidance; iii) approaches on the system level which includes multiple module redundancy such as concurrent error detection and triple modular redundancy. Most semiconductor designs rely on computer aided design systems and implementation of reliability on semiconductor devices on higher design level is important at the viewpoint of optimization. Nevertheless, because there is no dominant approach with a single level for soft-error tolerance and such a single level scheme imposes high overheads in terms of power, performance and chip area, a higher level approach should be assisted by approaches on other levels. Thus, it is assumed that several approaches are implemented across distinct levels in this research although the author focuses on the system level approach via high-level synthesis. As a result of high-level synthesis, fault-tolerance is implemented to datapath circuits in register-transfer level (RTL).

This dissertation proposes a method to synthesize application specific soft-error tolerant datapaths via high-level synthesis. To guarantee reliability and high real-time property, the proposed method is based on the triple redundancy of computation algorithms, which are to be realized as datapath circuits. The method reduces time overhead originated in redundancy, and at the same time, it realizes datapaths that keep multiple component soft-error tolerance. In order to mitigate time overhead, error detection parts with comparison and

error correction parts with retry share resources speculatively (speculative resource sharing). Operations in the retry parts are not executed as long as no error detected. During this period, resources bound to those operations in the retry parts are in an idle state. If it is assumed that the probability of the recurrence of soft-errors in a short period is low enough, operations which are executed as retry and other operations which are executed simultaneously can share resources.

To use hardware resources more efficiently, a tripled data flow graph of a computation algorithm is partitioned into several connected subgraphs. However, the more comparison- operations are selected, the more subgraphs are partitioned. As a result, latency of the datapath improves because fine-grained subgraphs are relatively easier to fulfill the speculative resource sharing condition than coarse-grained ones. On the other hand, latency may become larger as increasing the number of comparison-operations. Thus, deciding insertions of comparison-operations is an important factor in design optimization.

In order to reduce an excessively applied fault-tolerance and mitigate time overhead for the executions of retr schemes, the author introduces spatial/temporal adjacency constraint between datapath components considering a concept of localities of soft-errors. If a single soft-error has a spatial and temporal boundary, and its influence is limited against multiple component errors, several components can be executed at the same time. Majority-voting (M-V) schemes have a disadvantage that third copies should be always executed, while third copies in C-R schemes need not be executed as long as no error has occurred. Moreover, M-V mechanisms require more hardware resources although datapaths to which those schemes are implemented can achieve small latency. Because of introducing adjacency constraint, an M-V mechanism for error masking and correction instead of a C-R mechanism, and the three copies in every subgraphs can be executed concurrently. On the other hand, an advantage of C-R based error correction mechanisms is that the mechanism can reduce power consumption contributed by idling of retry parts, in case no error is detected. Nevertheless, the executions of retry parts cause time overhead, which is a drawback of the C-R mechanisms. In order to merge the advantages of both C-R schemes and M-V schemes, the author proposes a combination of the two error correction schemes to take advantage of the strengths of each scheme. In addition, a heuristic algorithm to find a latency-optimized integration of the two schemes is suggested.

It is found that the proposed method can reduce latency in several different applications without deterioration of reliability and chip area compared with a conventional C-R scheme. In addition, the experimental results show that the proposed method is more effective when a computation algorithm possesses higher parallelism and a small number of resources is available.

## 論文審査の結果の要旨

微細加工技術と半導体デバイス技術の進歩に伴い、非常に大規模で高速に動作する集積回路 (LSI)が可能となってきているが、この反面、LSIがアルファ線や中性子線を受けて発生する動作誤り(ソフトエラー)が大きな問題となっている。IoT時代を迎え、あらゆる機器がLSIによって制御される中、障害発生時にそれによる計算誤りを隠蔽し、正しい結果を計算し続ける耐故障化は必要不可欠な技術である。こうした状況にあって、本論文はソフトエラーに耐性を持つLSIの設計手法を提案したものであり、英文8章から成っている。

第 1 章 Introduction ではLSIにおける障害による計算誤りの問題や耐故障化技術を概観 し、本論文の目的がソフトエラーに耐性を持つ時間・空間(ハードウエア資源)利用効率の優れた アプリケーション特化型LSIを合成するための設計理論の確立と合成アルゴリズムの開発に あると述べている. 第2章 Related Works ではソフトエラーへの既存の対策技術をデバイスレ ベル,電子回路レベル,システムレベルにおいて概観し,本論文にて提案する設計手法の位置づ けと特徴を明らかにしている. 第 3 章 Preliminaries では提案手法の基礎となるアルゴリズム 冗長による誤り検出・訂正のメカニズムを説明すると同時に, 以降の議論にて共通に使用する幾 つかの重要な概念と表記法を導入している. 第4章 Soft-Error Tolerant Datapath Synthesis Based on Speculative Resource Sharing では、誤り検出のための計算二重化と誤り訂正のため の再計算との間での特殊な資源共有(Speculative Resource Sharing: SRS)を導入することによ り、従来の構成と比べてソフトエラーに対する誤り訂正能力をほとんど損なうことなく、アプリ ケーションの計算時間短縮やハードウエア資源の節約が可能であることを示した. 併せて, SRS 導入を前提に、アプリケーションに特化して最適な耐故障LSI実現を合成する高位合成アルゴ リズムを提案し、合成実験を通して、最大 32.3%の実行時間短縮が達成されたと述べている. 第5章 Latency-Optimized Selection of Check Variables では SRS の導入を前提とした誤り検 出箇所選定の最適化アルゴリズムを提案している. 設計実験では, 誤り検出箇所の違いにてアプ リケーションの計算時間が 20%から 50%も異なることが示され、誤り検出箇所選定最適化の重 要性も明らかにされた.第6章 Adjacency Constraint between Circuit Components ではソフ トエラーの時間的・空間的局所性に注目し、SRS が機能する条件を資源共有の時間的・空間的 制約という形で捉え直すことにより,レイアウト近接制約と引き換えに時間的制約を緩和できる 資源共有を可能とする方式を提案している. 整数線形計画法に基づく最適資源割り当て手法を開 発・実装しており、実験を通してその有効性を検証している. 第7章 Mixed Error Correction Scheme and Its Design Optimization では,第6章での議論を更に拡張し,誤り検出・再計算 方式の誤り訂正と多数決型三重化方式を部分部分で適宜選択し, 両方式を混在させた耐故障化の 方式を提案し、設計最適化によってアプリケーション時間の更なる短縮が可能であると述べてい る. 第8章 Conclusionでは,本研究にて得られた知見をまとめ,残された課題を整理している.

以上,本論文は,ソフトエラーに耐性を持つLSIの設計に関して,時間的・空間的資源利用 効率に優れた数々の耐故障化方式と設計最適化手法を提案したものであり,学術的に貢献すると ころが大きい.よって博士(情報科学)の学位論文として十分価値あるものと認めた.