| URL<br>Rights | http://hdl.handle.net/10119/1011           |
|---------------|--------------------------------------------|
| Text version  | author                                     |
| Туре          | Thesis or Dissertation                     |
| Issue Date    | 1997-03                                    |
| Citation      |                                            |
| Author(s)     | 松本,英樹                                      |
| Title         | FPGA利用を前提とした最小リンク付加による相互結合<br>網の耐故障化に関する研究 |



# A Fault Tolerant Interconnection Network Adding Minimal Links with FPGA

Hideki Matsumoto

School of Information Science, Japan Advanced Institute of Science and Technology

February 14, 1997

Keywords: Interconnection Network, FPGA, Alternative Path, Reliability.

## 1 Introduction

Recently, a scale of interconnection networks of parallel computers is growing by increasing data. The number of hardware components constructing the system, such as processors and links, increase. For the reliability for the system, The fault tolerant design having the system run correctly under faults is very important. Advance in processing technology has made the reliability of a processor high, while the links connected between processors still have many physical faults such as loose connection.

In this paper, we concentrate on recovering link faults. For the purpose of groping the best method of recovering link faults, we propose a strategy of recovering link faults using the unique property of FPGA(Field Programmable Gate Array)[3], which enables to change hardware in it. We also evaluate for the pattern of alternative path, and consider the reliability of the method.

# 2 Backgrounds

There have been many methods to recover faults link in interconnection networks. Most well-known method is duplicate links to make every link have a redundant connection, and to make the system use the redundant links corresponding the fault paths, when a link fault occurs. The cost of links is expensive on this method. Meanwhile, the other major method is based on detouring fault links. This method, however, causes the cost for routing and communication overhead.

Copyright © 1997 by Hideki Matsumoto

To construct an interconnection network using FPGAs as a node, FPGAs have to be connected by one daisy chain link for down-loading circuit data. Since the link can be used for I/O after the down-load is finished, it is available for a part of recovering link faults. In this case, the nodes connecting the fault links must be reconfigured to use the daisy chain link. The advantages of the method are (1) the performance of the communication does not decrease, (2) it does not spend the time to recover, and (3) the cost of the links is half compared to that of the duplicate link method. Therefore the method overcomes the weak points in the conventional strategies.

## 3 Strategies for Recovering Link Faults

I propose two fault recovering methods in this thesis — one is centralized one and the other is distributed one.

A host manages all of the system under the centralized method, and also controls the sequence, which is detecting faults and recovering the system. Thus, this method can simplify the configuration of the circuit in each node. However, when it occurres a fault on a configuration line under this method, it is impossible to down-load circuit data onto the FPGAs locating after the line. In this case, this method cannot recover the system, because those FPGAs cannot be reconfigured.

While under the distributed method, each FPGA has a dedicated ROM storing a recovering sequence. Each FPGA can independently load the sequence from the ROM, and recover the system even if a fault occurs on the alternative line. The advantage of this method avoids that a host reconfigures all nodes.

## 4 Generation of Alternative Paths

Since the alternative links have the number of patterns for connecting recoverable links, we have to show the way to connect alternative links to obtain higher reliability.

Due to select an optimal pattern from n! patterns in an n nodes system, I used topological and cost-based restrictions to reduce the patterns.

When the nodes are reconfigured once to recover fault links, it yields unrecoverable links, which depends on the patterns, on the next reconfiguration. The cost-based restriction makes use of the number of the unrecoverable links.

I obtained 36 alternative patterns having highest reliability. I also derive a rule to construct an optimal link pattern on an  $n \times n$  torus network.

## 5 Reliability of the Proposed Methods

Both the proposed recovering strategies in this thesis are based on alternative fault connections. As I assume that faulty links are not repaired and the system is regarded as a stuck-at model, I calculated the MTTF of an obtained optimal pattern based on the

Markov model [1, 2]. Since it is hard to calculate the MTTF for an  $n \times n$  torus network, I calculated the MTTF for a  $3 \times 3$  network.

I compared the MTTF of the optimal pattern with that of a duplicate link system, which is considered to be the most reliable for the interconnection network. The result showed that the centralized system acquired the reliability of 83% compared to the duplicated system even if the number of redundant links decreases to the half.

#### 6 Conclusions

In this thesis, I proposed two recovering strategies for link faults on an  $n \times n$  torus network using the reconfigurable property of FPGA. I also evaluated alternative link patterns and the reliability of the system.

I obtained 36 alternative patterns by limiting the search area, and analyzed them. As a result, I found a optimal rule to connect nodes with alternative path on an  $n \times n$  torus network.

I showed that the centralized method had 83% reliability under the half number of links compared to a duplicate link system.

I would like to evaluate the reliability of distributed method, and also implement the methods on an experimental system.

#### References

- [1] Haruo Yokota. On Applying RAID to Networks and Improving Reliability. CPSY 93-11, IEICE, April 1993.
- [2] Yasuyuki Mimatsu. Research on Performance and Reliability of Disk Systems. Master's thesis, 1995.
- [3] Toshio Asano. Research on Flexible Structure of Interconnection Network using FPGA. Master's thesis, 1995.