UAC389AE02 HIEE300888R0002 DCS Fault Analysis and Technical Measures



By
jonson
25 1 月 24
0
comment

1、 Overview
The # 3- # 6215MW units of Zhenhai Power Plant have been undergoing automation transformation since 1998, using domestically produced DCS systems. Since 2007, each unit has been gradually upgraded and renovated, and three units have been upgraded. The network structure of DCS in Zhenhai Power Plant is divided into three levels from top to bottom: monitoring network, system network, and control network, as shown in Figure 1. The engineer station and operator station in the monitoring network are included.
Advanced computing stations and on-site control stations in system networks.


Implement interconnection through system servers; The control network is composed of ProfiBus DP, which enables communication between the on-site control station and process I/O units. The system can be composed of multiple sets of servers, which can divide the system into multiple domains. The DCS of the 215MW unit in Zhenhai Power Plant is divided into two domains, namely the main domain and the auxiliary domain. Each domain is composed of independent servers, system networks, and multiple on-site control stations. The data within the domain is configured and managed separately, completing relatively independent collection and control functions; The two domains share the monitoring network and engineer station, while the operator station logs in to different domains through domain names for operations.
The DCS system of Zhenhai Power Plant had a relatively high failure rate before upgrading and renovation. After analyzing the fault statistics over the years, the main faults were caused by main controller faults, I/O module faults, server faults, control network faults, and other factors. Taking 2006 as an example, there were a total of 39 DCS related faults in Units # 3- # 6, including 13 main controller faults and 8 module faults, accounting for 53.8% of the total number of faults, Therefore, control system failures are the main factors causing thermal system failures, and their classification statistics are shown in Figure 2.
2、 DCS Fault Phenomenon and Analysis
Based on the main types of faults that occur in the DCS system mentioned above, the following is an analysis of typical software and hardware failures in the application of DCS in Zhenhai Power Plant in recent years.
2.1 Main controller malfunction
The main controller failure accounts for a large proportion of DCS failures in Zhenhai Power Plant, and the causes of main controller failures vary. Some failures can be restored to normal after simple reset or restart, while others have a serious impact on the operation of the unit.
(l) Abnormal controller cannot automatically switch
On August 31, 2009, during an on-site inspection, it was found that the main controllers of Unit # 5, Unit # 11 I/O Station, and Unit # 26 I/O Station were faulty. Both were due to the A main control fault light flashing, the dual redundant communication light not working, and the B main control being in standby. From the engineer station, it can be seen that the main controller displays A as the main controller and B as the backup controller, with normal status display; Check DCS history records, no relevant fault records; All parameter acquisition and control equipment actions within the relevant I/O station are normal. After analysis, it is believed that the main controller is still operating normally, but there is a problem with the synchronization of dual machine redundancy. If a main control switch occurs at this time, significant disturbances will occur. Prior to this, the abnormal controller’s inability to switch redundancy had occurred multiple times. For example, when the automatic adjustment deviation of air pressure in the DCS system of furnace # 3 was greater than 36OPa, the operator manually intervened to operate the regulating actuator of the supply fan spoon tube, which was ineffective. They quickly went to the site for manual operation. Through the engineer station, the thermal engineering team checked the corresponding # 12 I/O station and found that the A main control was offline, while the B main control was standby. On the I/O station, it was found that the A main control system light 1 and system light 2 were not on, and the fault light was not on, indicating that the main control had lost data communication with the system network, but the main control did not achieve redundant switching. The I/O station of furnace # 3 and # 23 also experienced an A main control failure and went offline. The fault light and dual machine redundant data exchange light were not on, and the main controller did not automatically switch. These fault cases indicate that the redundant switching function of the MACS system’s main controller is incomplete, and this function fails under certain fault states.
(2) Cooling fan malfunction causing main controller malfunction
If the cooling fan inside the main controller malfunctions, it will greatly increase the failure rate of the main controller. Since 2005, Zhenhai Power Plant has recorded a total of 13 main controller failures caused by abnormal production of cooling fans inside the main controller (one or several cooling fans in the main controller with such failures are either malfunctioning or completely malfunctioning, and can generally be restored to normal operation after replacing the cooling fans).
(3) The impact of electronic room environment on the main controller
The temperature and humidity in the electronic room have a certain impact on the main controller, and the main controller with forced heat dissipation has a greater impact. Excessive temperature and humidity may not necessarily cause immediate failure of the main controller, but long-term exposure to such an environment will inevitably increase the failure rate of the main controller. Moreover, according to our statistics, the impact of humidity is greater than that of temperature. According to statistics since 2005, the number of main controller failures from March to June each year accounts for more than one-third to half of the total number of times in the year. This period coincides with the humid and hot rainy season in the south, and central air conditioning often replenishes a large amount of fresh air, causing an increase in humidity in electronic rooms. The main controller anomalies that occur in this situation can generally be restored by resetting or restarting, with only a few requiring replacement with a new main controller.
Compared to main controller failures, module failures are relatively easy to resolve and can generally be restored to normal through module reset and replacement. But some malfunctions are quite unique due to other factors affecting them.
(l) External interference causes I/O modules to go offline
In January 2007, Unit # 5 underwent minor repairs as planned. During the shutdown process, the operating personnel used a micro oil ignition device to assist combustion. Soon after, a thermocouple measurement module for measuring the wall temperature of the micro oil burner malfunctioned, rendering the wall temperature display of the micro oil burner ineffective. After the thermal personnel reset the module, it returned to normal. During minor repairs and startup, the module malfunctioned multiple times and was able to be resolved through reset. During this period, the module was also replaced, but the fault persisted. After the unit was put back into service, the module operated stably until March 4th when the module malfunction occurred again. After on-site inspection, it was found that the installation position of the two thermocouple components connected to the module was too close to the micro oil ignition gun. When the micro oil ignition gun ignited, high-energy electromagnetic interference entered the module through the cable string, causing the module to go offline. This was confirmed after the test. After adjusting the installation position of the thermocouple and ignition gun, this fault was resolved.
(2) Single channel failure
There are two types of module faults: hard and soft. We refer to them as hard faults that need to be resolved by replacing the module, while faults that can be resolved by resetting the module are called soft faults. These faults may only be reflected in one of the channels and can be determined through actual measurements. On January 15, 2007, the # 5 mechanized water replenishment control valve could not be opened, and regardless of the instructions given in the DCS, the on-site measured current value remained at 4mA. Afterwards, reset the module and control it back to normal. Another time, the electric drain valve for the # 4 boiler’s fixed discharge was opened and could not be closed. On site inspection of the corresponding switch output module, the output of the first channel is “1” (corresponding to the opening command of the electric door), and the status of this channel in DCS is “0”. Replacing the module is ineffective. After disassembling the main controller, the control returns to normal.
2.3 Server Failure
The monitoring network and system network of Zhenhai Power Plant DCS are interconnected through servers. Therefore, server failure will cause the operator station in the upper level monitoring network to lose monitoring and control of the operating parameters and control equipment in the lower level system network, which will have a serious impact on the safe and stable operation of the unit. On June 11, 2007, the main server of the host domain of Unit # 6 malfunctioned, and the server failed to automatically switch. All parameters on the operating station failed, and the control malfunctioned. The operators relied on DEH and backup instruments to maintain the operation of the unit. The thermal personnel manually switched to server B and the DCS resumed operation. However, from the system status diagram, it was observed that the lower level network connecting server A to the system network was still in a faulty state, and the local network was not connected. After restarting the server, the network connection was restored. Afterwards, Unit # 6 experienced similar malfunctions multiple times, and no abnormalities were found in the server host and network card. The server was also replaced, but the cause is still unknown. At present, the method of regularly switching and restarting servers has achieved certain results.
2.4 Control network failure
Generally speaking, DCS network failures often occur in network equipment such as switches and optical terminals, and the problem can often be resolved after hardware replacement. On January 29, 2007, a switch failure on Unit # 3 caused the monitoring network B to go offline; Previously, one switch of machine # 3 crashed, causing system network A to go offline# 5 machines with one optical transceiver malfunctioned, causing remote I/O station # 30 to go offline; These faults were restored to normal after resetting or replacing network devices.
Due to the connection of the control network of DCS to the main controller and process I/O modules, the impact of control network faults on the system is relatively large, often resulting in multiple modules being offline simultaneously in a segment of the link. The reasons for this are diverse:
(l) Network cable connection accessory malfunction
On February 5, 2007, Unit # 3 was operating normally, but multiple parameter displays in Unit # 20 I/O station were invalid, resulting in malfunctioning control equipment operations. On site inspection, # 20 I/O station A main control is running, B main control is standby, A train modules are running normally, and B and C train modules are offline. After taking necessary safety measures and switching the main controller, most of the B and C modules will resume operation, with some still intermittently offline, while the A modules will experience intermittent offline with intervals ranging from a few seconds to a few minutes. The identified cause was a malfunction in the DP line plug of the B main control network (the DP plug is equipped with a terminal resistor internally, whether to use it is optional), resulting in link interruption or impedance mismatch. After replacing the DP plug, it returned to normal. Afterwards, similar faults occurred twice in Unit # 3, both of which were restored after replacing the DP head. Therefore, during the unit maintenance period, we replaced all DP plugs of the same type.
(2) DP bus “virtual connection”
The control network of the DCS remote I/O station in Zhenhai Power Plant adopts a base serial connection method to expand the I/O modules. This connection method has high flexibility and is easy to connect in a decentralized manner. However, it also has disadvantages such as multiple fault points in DP communication and unstable characteristic impedance of the communication bus. In April 2006, the remote temperature measurement cabinet for Unit # 4’s generator went offline from below the second module. After pressing or touching these modules, it was able to recover. Similar faults occurred multiple times and were mostly resolved in the same way. After analysis, the cause of this type of fault is due to the mechanical vibration of the vertically installed module base causing contact loosening and poor on-site environment such as humidity and heat, which will cause contact oxidation, resulting in “virtual connection” of the DP bus and mismatched characteristic impedance. This type of fault occurs more frequently in remote I/O cabinets installed on site, while I/O stations installed in electronic rooms do not occur much# After replacing all the bases and reinstalling the temperature cabinet of the 4-unit generator during unit maintenance, this situation has greatly improved.
(3) The impact of faulty modules on DP bus
When the communication interface of several modules on a DP bus fails, it may cause all modules on a DP link to go offline. Multiple modules frequently went offline in the remote I/O cabinet for pump temperature of Unit # 4, with offline intervals ranging from a few seconds to several minutes or even longer, and no virtual connection on the DP bus. Measures such as installing the main control and replacing the module are ineffective. During the process of plugging and unplugging modules, when a certain module is unplugged, the DP link returns to normal, and when plugged back in, another module begins to go offline. Therefore, it is determined that a module failure is causing the entire DP link to go offline. Through a step-by-step elimination method, it was found that there was a module malfunction. After disassembling the module, the naked eye could see signs of varying degrees of bursting of the capacitor components.
It is difficult to determine the offline fault point of a module on a DP bus due to module failure. Offline modules may not necessarily be faulty, and faulty modules may not necessarily go offline. However, there is no good testing method, and only a step-by-step elimination method can be used to determine the fault point, which poses certain difficulties and risks during unit operation. But this type of bus fault will not occur when there is only one module fault, and the fault points inside the module can be observed with the naked eye. Therefore, during unit maintenance, the module can be disassembled for inspection, which can have a good preventive effect.
2.5 Faults caused by other factors
(1) The impact of GPS clock on DCS
The system clock of Zhenhai Power Plant DCS is calibrated by the server through communication with GPS electronic clock. On September 17, 2006, during normal operation of Unit # 4, it occurred that the DCS operator station had both exited offline, and the main server in both the host and auxiliary domains had exited offline. The engineer station also exited offline, and the redundant servers in the host and auxiliary domains automatically switched successfully. The on-site thermal personnel immediately started the engineer station, and the operators maintained the operation of the unit through the engineer station. After on-site analysis, it was found that due to a malfunction in the GPS electronic clock, the DCS system clock was erroneously calibrated to 2178 years old, and the offline operation of the operator station caused by this should be a bug in the system program. After restoring the system clock, start the operator station and server one by one, and DCS will resume normal operation.

发表回复