A MANAGEMENT APPROACH TO RELIABILITY GROWTH FOR COMPLEX ELECTROMECHANICAL SYSTEMS

This paper proposes a reliability management process for the development of complex electromechanical systems. Specific emphasis is the development of these systems in an environment of limited development resources, and where small production quantities are envisaged. The results of this research provides a management strategy for reliability engineering activities, within a systems engineering environment, where concurrent engineering techniques are used to reduce development cycles and costs. OPSOMMING Hierdie artikel stel 'n proses, vir die bestuur van die betroubaarheid gedurende die ontwikkeling van komplekse elektromeganiese stelsels voor. Die omgewing van beperkte ontwikkelingshulpbronne en klein produksie hoeveelhede word beklemtoon. Die resultate van hierdie navorsing stel 'n bestuurstrategie, vir betroubaarheidsbestuur in n stelselsingenieurswese omgewing waar gelyktydige ingenieurswese tegnieke gebruik word am die ontwikkelingsiklus en -kostes te beperk, voor. http://sajie.journals.ac.za 1. PROBLEM DEFINITION AND SCOPE This paper proposes a reliability management process .for the development of Gorpplex electromechanical systems . Specific emphasis being the development of these systems in an environment of limited development resources, and where small production quantiti es are envisaged. A complex electromechanical system may be defined as a system using integrated electronic, electrical and mechanical subsystems, which together provide a functional solution to a customer 's, or market's, requirement. A guided missile system is a good example of a complex electromechanical system. A phenomenon ofthe new global economy is that product life-cycles are decreasing [9], and in order to maximise market share it is important to get new products onto the market as soon as possible. This requires the shortening of development cycles while still satisfying the customer 's expectations and requirements as far as quality, cost and performance of the product are concerned [5, 6, 10]. In order to achieve a decrease in product development cycles and to improve product quality while reducing costs, more and more organisations are turning to the concept of concurrent engineering (10] to achieve these goals. Concurrent engineering is focussed on developing a product while simultaneously designing the manufacturing; test and support processes [11, 10, 7, 8]. This paper applies this same principle to reliability growth management, by providing the systems engineering team with reliability design criteria and models , as the design is evolving. Thus, the reliability growth of the product or system is managed concurrently with its design and development. The following criteria were chosen to achieve the goal of managing the reliability growth of a product simultaneously with its design and development: To ensure that reliability is designed into the system under development by providing reliability design data to the system engineering and development teams . To optimise reliability testing resource expenditure by managing the reliability growth process through a design problem monitoring and solution management process . That system life-cycle costs are optimised through good reliability design choices : To provide an environment where reliability engineering is an integral part of the system engineering process so as to optimise the rapid development process while keeping development expenditure within acceptable limits . After a brief discussion of the of the background concepts, the development ofa proposal fOT a reliability growth management process will be discussed. 2 http://sajie.journals.ac.za 2. BACKGROUND DISCUSSION Three subjects are discussed briefly: The first is rapid product development; the second is reliability engineering and management; the third is software reliability engineering and management. 2.1 Rapid Product Development Rapid product development exploits the advantages of concurrent engineering in reducing time to market and development costs of new products. "Concurrent Engineering is a systematic approach to the integrated, simultaneous design ofboth products and their related processes, including manufacturing, test and support. This approach is intended to cause the developers, from the outset, to consider all elements ofthe product lifecycle from conception through disposal, including quality, cost, schedule, and user requirements" [11,10,7,8]. Figure 1, illustrates the concept of concurrent engineering, where design for manufacture and support takes place in parallel with the functional design and development. N E E D ConceptualPreliminary Design Detail Design and Development Production and/or Construction Product Use, Phaseout, and Disposal Manufacturing Manufacturing Configuration Design Operations Product Support Configuration Product Support Design and Development and Maintenance Source: Blanchard and Fabrycky [2] Figure 2.3 Figure 1: Concurrent product, process and support life-cycles Implemented using cross-functional design teams, concurrent engineering seeks to ensure that a product or system is designed to meet its performance requirements, while still being easy to manufacture and support. Seven design tasks which take place in parallel can be identified, these are: Design for performance Design for manufacturability Design for testability Design for quality Design for serviceability Design for compliance Design for affordability 3 http://sajie.journals.ac.za Reliability engineering has an impact on most of the design activities , listed above . Therefore . in order for reliability engineering to positively influence product and system design and · development it should take place in paralle lwith the system engineering-activities .-· 2.2 Reliabil ity Engineering and Management Reliability is: "The probability that an item will perform a required function without failure under stated condi tions for a stated period of time" [I] . "Reliability is not a matter of chance; it has to be consciousl y and actively built into hardware through careful specification of good design and manufacturing processes" [3]. Reliability engineering and management encompasses the entire life-cycle of a system . This is illustrated by figure 2. During the design phases reliability engineering facilitates sound engineering decision making, through the provision of reliability data to the system design team. In manufacturing reliability engineering focusses on the quality assurance of production processes which ensure that the reliabilit y which has been designed into the system is, in fact, built into the system. During support the reliability engineering and management process seeks to ensure that the maintenance and support practices do not adversely affect the systems operational reliabi lity. Reliability engineering and management during system support also seeks to enhance system reliability through the development of design enhancements. Reliability data also need to be fed back into the design process so that future iterations of the systems design and indeed new systems can benefit from operational experience . CONCEPTU AL DESIGN PRELIMINARY SYSTEM DESIGN DETAIl. SYSTEM/PROD UCT DESIGN • Feasib ility study • System functional analy sis • Syst em/product detail design • Advance syst em plann ing • Prtlim inary synthesis and allocat ion • System proto type development of design criteria • System prototype test and evaluation Quantitat ive and qualitative I----• Syste m optimi zation ~ ~Iiability requ irements (or • System modi fica rio n (as required) syst em (MTBM. MTBF. X, • System synthesis and defin it ion etc.}, reliabilily planning Reliability analysis and trade-offs, design Allocation of reliab ility requi rements . support . reliabilit y predict ions . reliability reliobility analysis and trade-offs, test ant! evalua tion . formal design review design support. rcliability pred ictions, and approval. forma! design review and app roval t ! I L ---------Fccdback IOOP .-~----_____ J 4 ~ 5 6 PRODUCTION/CONSTRUCTION SYSTEM UTll.1ZATION AND LIFE CYCLE SUPPORT • Fabrication. assembly. and test of • Consumer util izat ion o f system and its compon ent s system and its components las applicable) • Life-cycle sys tem support SYSTEM RETIREMENT • System const ruction <as applicable) I-I-AND Reliabil ity data coll ect ion, anal ysis. and PIIAS EOUT Reliability test and evaluat ion : reliabili ty evaluati on : system modific at io n (as requ ired ) data collection . analysis, and corrective action. I , I -+-Feedback loo p __J f Source: Blanchard and Fabrycky [2J Figure: 13.1 Figure 2: Reliability in the system life-cycle 4 http://sajie.journals.ac.za 2.3 Software Reliability Engineering and Management Complex electromechanical systems have a significant portion of their functionality implemented in software. Reliable software is dependent on two important processes. The first relates to the software design process, which is ably supported by the methodologies proposed in the ISO/lEe 12207 standard [15]. The second process is the testing and elimination of software faults using Test Analyse and Fix (TAAF) principles [14]. The reliability 0:( software in complex systems cannot be ignored and should be taken into account when allocating system reliability requirements to subsystems and assemblies 114]. Testabilityof software is an importantdesignfeature, which shouldnot be overlooked during the design process [14]. This greatly enhances the productivity of software test personnel. 3. THE RELIABILITY GROWTH MANAGEMENT PROCESS Reliability engineering must ultimately provide the customer benefits in the form ofoptimised life-cycle costs. To achieve this, reliability engineering must add value during system development, by providing three services. The first service is to support the system engineer in determining the reliability performance specifications of the system and subsystems to be developed. The


PROBLEM DEFINITION AND SCOPE
This paper proposes a reliability management process .for the development of Gorpplex electromechanical systems .Specific emphasis being the development of these systems in an environment of limited development resources, and where small production quantiti es are envisaged.A complex electromechanical system may be defined as a system using integrated electronic, electrical and mechanical subsystems, which together provide a functi onal solution to a customer 's, or market's, requirement.A guided missile system is a good exampl e of a complex electromechanical system.
A phenomenon ofthe new global economy is that product life-cycles are decreasing [9], and in order to maximise market share it is important to get new products onto the market as soon as possible.This requires the shortening of development cycles while still satisfying the customer 's expectations and requirements as far as quality, cost and performance of the product are concerned [5,6,10].
In order to achie ve a decrease in product development cycles and to improve product quality while reducing costs, more and more organisations are turning to the concept of concurrent engineering (10] to achieve these goals.Concurrent engineering is focussed on developing a product while simultaneously designing the manufacturing; test and support processes [11,10,7,8].This paper applies this same principle to reliability growth management, by providing the systems engineering team with reliability design criteria and models , as the design is evolving.Thus, the reliability growth of the product or system is managed concurrently with its design and development.
The following criteria were chosen to achieve the goal of managing the reliability growth of a product simultaneously with its design and development: To ensure that reliab ility is designed into the system under development by providing reliability design data to the system engineering and development teams .
To optimise reliability testing resource expenditure by managing the reliability growth process through a design problem monitoring and solution management process .
That system life-cycle costs are optimised through good reliability design choices : To provide an environment where reliability engineering is an integral part of the system engineering process so as to optimise the rapid development process while keeping development expenditure within acceptable limits .
After a brief discussion of the of the background concepts, the development ofa proposal fOT a reliability growth management process will be discussed.

BACKGROUND DISCUSSION
Three subjects are discussed briefly: The first is rapid product development; the second is reliability engineering and management; the third is software reliability engineering and management.

Rapid Product Development
Rapid product development exploits the advantages of concurrent engineering in reducing time to market and development costs of new products.
"Concurrent Engineering is a systematic approach to the integrated, simultaneous design ofboth products and their related processes, including manufacturing, test and support.This approach is intended to cause the developers, from the outset, to consider all elements ofthe product lifecycle from conception through disposal, including quality, cost, schedule, and user requirements" [11,10,7,8].
Figure 1, illustrates the concept of concurrent engineering, where design for manufacture and support takes place in parallel with the functional design and development.Implemented using cross-functional design teams, concurrent engineering seeks to ensure that a product or system is designed to meet its performance requirements, while still being easy to manufacture and support.Seven design tasks which take place in parallel can be identified, these are: Design for performance Design for manufacturability Design for testability Design for quality Design for serviceability Design for compliance Design for affordability Reliability engineering has an impact on most of the design activities , listed above .Therefore . in order for reliability engineering to positively influence product and system desig n and • development it should take place in paralle l with the system engineering-activities .-•

R elia bil ity Engineering and Management
Reliability is: "The probability that an item will perform a required function without failure under stated condi tions for a stated period of time" [I] .
"Reliability is not a matter of chance; it has to be consciousl y and actively built into hardware through careful specification of good design and manufacturing processes" [3].
Reliability engineering and management encompasses the entire life-cycle of a system .This is illustrated by figure 2. During the design phases reliability engineering facilitates sound engineering decision making, through the provision of reliability data to the system design team.
In manufacturing reliability engineering focusses on the quality assurance of production processes which ensure that the reliabilit y which has been designed into the system is, in fact, built into the system.During support the reliability engineering and management process seeks to ensure that the maintenance and support practices do not adversely affect the systems operational reliabi lity.Reliability engineering and management during system support also seeks to enhance system reliability through the development of design enhancements.Reliability data also need to be fed back into the design process so that future iterations of the systems design and indeed new systems can benefit from operational experience .

Software Reliability Engineering and Management
Complex electromechanical systems have a significant portion of their functionality implemented in software.Reliable software is dependent on two important processes.The first relates to the software design process, which is ably supported by the methodologies proposed in the ISO/lEe 12207 standard [15].The second process is the testing and elimination of software faults using Test Analyse and Fix (TAAF) principles [14].
The reliability 0:( software in complex systems cannot be ignored and should be taken into account when allocating system reliability requirements to subsystems and assemblies 114].
Testabilityof software is an importantdesignfeature, which shouldnot be overlooked during the design process [14].This greatly enhances the productivity of software test personnel.

THE RELIABILITY GROWTH MANAGEMENT PROCESS
Reliability engineering must ultimately provide the customer benefits in the form of optimised life-cycle costs.To achieve this, reliability engineering must add value during system development, by providing three services.The first service is to support the system engineer in determining the reliability performance specifications of the system and subsystems to be developed.The second service is to support design trade-off studies, with respect to reliability in the overall design optimisation process.The third service is managing the reliability growth process.The aim is therefore to assist the system development team in providing the customer with a system which is cost effective to acquire and operate.
From a customer's perspective, the ultimate measure of success is for a system to perform its allocated tasks, and/or functions, with the reguired success rate at the lowest life-cycle cost.The developer's perspective is focussed on risk reduction.In managing risk, reliability engineering and management must ensure that the reliability targets, as required by the customer, are met when the system is delivered.The risk to be managed is the cost of providing warranty cover for the system during its warranty period.The responsibility of the reliability engineering team is to provide the management tools so that the reliability growth process can be managed to ensure that the system is not delivered before the specified levels of reliability have been achieved.Refer to figure 3. http://sajie.journals.ac.za

A(t)
Figure 3: Warranty risk of premature system delivery With this in mind the proposed reliability growth management process is summarised in the next few sections.

The System Development Team
Reliability growth management is a team effort, the team respons ible for ensuring that a reliable product is developed is the system development team.While not necessarily a member of the core development team , the reliability engineer is responsible for co-ordinat ing and managing the reliability engineering growth processes.
The lead author has been part ofa missile development team , at Kentron Division ofDenel (Pty) Ltd., for which the reliability management model , being discussed , was developed.The management structure chosen for the missile development team is illustrated in figure 4. As can be see in figure 4, the reliability engineer is not a member ofthe core development team, but provides an essential role of co-ordinating all the engineering management efforts which contribute to the reliability growth of the system.Taking this into account and referring to the reliability growth management process in figure 5, it can be deduced that the reliabilit y engineer plays an very important part in ensuring that the engineering management loop is in fact closed .

Reliability Growth Management Process
The reliability growth management model which has been used.is a closed loop management process and is illustrated in figure 5.The first three steps in the process , namely, the reliability allocation, reliability budgeting and the determination of inherent reliability, are the reliability engineer's contribution to the design of the product.In the reliability allocation and budgeting processes the .reliabilityengineer assists the systems engineer to define the reliability design specifications which will enable the system to satisfy the customer's operational reliability performance requirements .Determining the inherent reliability of various subsystem design proposals assists the system development engineers to perform trade-off studies to ensure the system design is optimised in terms of the reliability performance requirements.

Reliability
As described in Blanchard and Fabrycky [2], system level performance requirements must be allocated to the various subsystems , assemblies and components which make up the system .The reliability allocation process takes the top level reliability requirements, and determines what reliability performance each of the subsystems, assemblies and components must attain in order for the system to meet these requirements .
The reliability allocation process is described by figure 6.
Source: Rooney[161  The first step in the reliability process is to determine the customer's reliability performance requirements .The next step is to define the relative complexities of each of the lower level subsystems , assemblies and components.These complexity factors are then used to calculate the reliability performance requirements for each of these lower level system items.
The complexity factor is a proportional weighting of the technical complexity of each of the subsystems, assemblies and components .System and subsystem reliability is directly proportional to its technical complexity [2].The proportion of technical complexity assigned to each subsystem is, therefore, a measure of what proportion of failures will occur in each of them.The more complex an assembly, for instance, the more failures it is likely to contribute to the overall system failure rate.Blanchard and Fabrycky [2], recommend that the complexity factor is based on an estimate of the number and relationship ofparts , the equipment duty-cycle and whether the system is subjected to temperature extremes, amongst other factors .
The reliability allocation process discussed above, is a top down process, starting with the system level requirement and allocating the reliability requirements down to the lowest applicable assembly level.It is now necessary to examine each ofthese end level assemblies and components, and determine what reliability performance levels can be realistically achieved, by each of them, in practice.This is called the reliability budgeting process and is illustrated by figure 7.  The reliability allocation process , calculates the minimum reliability for which each subsystem and assembly must be designed, in order for the system to attain the ,reliability requirements specified by the customer.This approach does not.however, take iI1to account what can be practically achieved during the course of detailed design, hence the necessity for the reliabil ity budgeting process.The reliability budget seeks to address two essential shortcomings in the reliability allocation.The first is that some of the allocated reliability figures may be well below what can be practically achieved by the designer.For an example,experience within the author's company suggests that the typical reliability of a printed circuit board assembly of average complexity will easily exceed an MTBF of5000 hours .The second requirement, which the reliability budget must address , is to ensure that a reliability safety margin exists.Refer to figure 8.
There is an important distinction to be made here.The system, as delivered, must comply with the reliability allocated from the customer specified reliability requirements.Reliability determined by the budget process, is the target for which the subsystem or assembly must be designed .The allocated reliability is specified in the system development specification and the design targets are specified in the subsystem development specifications.These targets, where practically possible , should always be higher than the minimum allocated reliability specification to allow a reliability safety margin [4].

RELIABILITY GROWTH STRATEGY
AA -Specification The reliability safety margin Wessels [4] advocates the use of the reliability safety margin as a design target, refer to figure 8, to ensure the minimisation ofwarranty risk when a system is'first put into operation , refer to figure 3. The reliability safety margin is the difference between a reliabilit y budget set as a design goal (/"8) and the effective contracted reliability O'A) required by the customer, refer to figure 8. Pecht [3] states that experience has shown that only 70% of failures which can be attributed to design shortcomings are likely to be eliminated through a reliabilit y growth process.For the project under review, the system development team decided on a general rule of a minimum safety margin of 1.5 times the allocated reliabilit y figure as being acceptable .In many cases the practical margins were much larger.In some cases a lesser margin was considered acceptable, usually when the assembly was an "off-the-shelf' item backed up by its manufacturer's specified data .
The remaining steps illustrated in figure 5 are the management processes used to actively manage the reliability growth of the system .This is achieved by recording all system failures and design problems in a database, and using the recorded data to ensure that the subsystem design teams take the necessary actions to resolve them.This database system.uses the Failure Reporting, Analysis and Corrective Action System (FRACAS) technique to manage the reliability growth of the system.In order to include and manage the resolution of design problems, which are not necessarily failures, the system has in practice been designated the Problem Reporting, Analysis and Corrective Action System (PRACAS ).Refer to figure 9.
During full-scale development the data provided by the PRACAS system allows the system 's design to be continuously improved , so as to achieve reliability growth and system design maturity.This results from the active management ofthe steady elimination of identified design weaknesses.This management is done bythe development managers using management reports generated by the reliability engineer from the PRACAS database.Apart from providing management data to the development teams the reliability engineer needs to use the PRACAS data base for other purposes in order to make the most use ofthe data gathered .The primary use of the PRACAS data, by the reliability engineer, is to calculate the achieved reliability of the system.This is done only from those PRACAS reports which have been designated as genuine failures.One of the fields in the PRACAS database identifies failures as chargeable [13] , a chargeable failure is one that is counted as a failure for the purposes ofcalculating the reliability (MTBF) of a subsystem.

Provide Rea son for not Accept ing Report
Record FRB Deci sion in

Database
Source : Rooney [16] Figure 3.16 [1] lists the following points as being important data to be recorded for each failure : A description of the failure's symptoms, and its effect.
A description of the immediate repair action taken .
A record of the equipment' s total operating time at occurrence of the failure.(e.g.elapsed time indicator reading, odometer reading, etc.) A description of the operating conditions.
Date and time of the failure .
Failure classification (e.g.design, maintenance induced, quality control, etc.) Report of the investigation into the failure, and its recla ssification if necessary.

»+
Recommended action to eliminate the failure mode .
Apart from being used to record and manage the resolution ofsystem failures and problems, the data contained in the PRACAS database is also used to measure the reliability achieved by the system during operation and testing .This achieved reliability data can be utilised in two ways: Firstly to assist the design teams to identify design weaknesses, so that some form of design and/or management action can be taken to rectify the identified shortcoming; and ultimately , to prove to the customer, that active steps have been taken to ensure that reliability growth has indeed taken place, that the product has matured, and that his reliability performance criteria have been met.
The use of the PRACAS system to record and actively manage the resolution of design problems and failures has just been discussed.Functional design problems and failures are usually easily identified during the functional testing of a system.However, inherent design weaknesses, which may result in problems and failures when the system is put into operation , may not be detected in this way, as will be explained .Design weaknesses often result from complex interactions of various environmental factors, including various combinations of temperature and vibration cycling , humidity, ingress of dust and other contaminants, Electromagnetic Interference (EMI) and many others.In order to identify these potential design problems and failures , in a system, it is necessary to institute an environmental testing regime , to induce them.This testing regime is referred to as Test, Analyse and Fix (TAAF) [13), refer to figure 10.Tests, using combinations of environmenta!stresses are conducted on systems and subsystems, and are specifically designed to induce failures.The test stresses are increased incrementally until a system failure is induced, the failure is then analysed, and if it is due to an inherent design weakness, the design is improved and the testing is repeated.Accelerated ageing tests are typical of the TAAF testing.There are two methods of implementing a TAAF strategy described in Mil-Hdbk-189 [13].The first is the Test-Fix-Test method and the second is the Test-Find-Test method.A combination of the two methods may also be appropriate in some circumstances.
Test -Fix-Test: When a failure is detected, a design improvement is imp lemented before the testing resumes.This is most appropriate to development testing ofsubsystems and laboratory testing of an integrated system.Continuation of testing after the fix has been implemented.will ensure that the effectiveness of the fix can be verified and has indeed made a difference to the reliability.The advantage of this method is that a more immediate measure of reliability growth is possible.

»+
Test-Find-Test: In this method the testing continues after a failure or problem has been identified.This method is used when an immediate fix is not practical, for example.when a design improvement must be made to a component or assembly which has a long lead time for its implementation A new printed circuit board layout is a good example.This method is also appropriate when the identified failure or problem does not have an impact on the testing that follows.The fixes implemented as a result ofthis method are referred to as delayed fixes in Mil-Hdbk-189 [13].The disadvantage of this method is that immediate proof of reliability growth is not possible.Only the next series of tests will show the improvement as a step in the initial reliability.
lH Combined Test-Fix-Test and Test-Find-Test: Using a combination of the methods is probably a more efficient way of implementing TAAF.This allows the more simple design improvements to be implemented before testing continues and the more time consuming improvements to be delayed until a later date.
The principle of analyse and fix must be extended to all problem and failures encountered, no matter what their origin, this ensures that design weaknesses are forced out of the system and that reliability growth takes place .

CASE STUDY FINDINGS
Reliability growth has been achieved in an ongoing project on which the case study has been based , using the proposed, closed loop, active reliab ility management process.This is visible in the decreasing trend ofPRACAS reports received per developm ent baselin e. as illust rated by the analysed figures in table 1.In section 1, four management objectives, which need to be met in order to achieve successful reliability growth, were set.Each of these objectives are met by the reliability growth management process outlined in section 3.In summary, the objectives set, and the associated management technique instituted in order to achieve them .are as follows: To ensure that reliability is designed into the system under development by providing reliability design data to the system engineering and development teams .This has been achieved through a reliability allocation and budgeting process .
To optimise reliability testing resource expenditure by managing the reliability growth process through a design problem monitoring and solution management process .This has been achieved by establishing a Problem Reporting, Analysis and Corrective action system and using it to manage the reliability growth process .
To ensure that system life-cycle costs are optimised through good reliability design choices .This has been achieved by performing design trade-offstudies, using reliability determination techniques, based on the use of reliability database s such as Mil-Hdbk-217F [12).
To provide an environment where reliability engineering is a integral part of the system engineering process so as to optimise the rapid development process while keeping development expenditure within acceptable limits.This is achieved through recordin g, analysing and reporting reliability management data, continuously throughout the development process using the PRACAS system. http://sajie.journals.ac.za

CONCLUSIONS
The foundations for the success of making reliability an inherent design feature of a system, are laid by specifying the system and subsystem reliability performance criteria, using the reliability allocation and budgeting processes.However, in the author' s opinion, the success ofa reliability growth management process can only be guaranteed by using a closed ' loop management technique to identify and manage the resolution of design problems and fai lures.TAAF testing processes provide an essential tool to actively identify and analyse these problems and failures, while the PRACAS system and its associated database, provide the tools to ensure that the necessary design and management actions are taken to resolve the root causes of the problems and failures.
In conclusion, the management process described, can be successfully applied to managing the reliability growthofa complex electromechanical system, with limited financial and manpower resources, and within a short development timescale.

Figure 1 :
Figure 1: Concurrent product, process and support life-cycles

Figure 2 :
Figure 2: Reliability in the system life-cycle

Figure 9 :
Figure 9: Problem Reporting and Corrective Action System Flow Chart