##### By D Weeraddana, B Liang, Z Li, Y Wang, F Chen, D Phillips, N Saxena and L Bonazzi.

First published in *Water e-Journal* Vol 4 No 3 2019.

# Abstract

Data61 and Western Water worked collaboratively to apply engineering expertise and machine learning tools to find a cost-effective solution to the pipe failure problem in the region west of Melbourne, where on average 400 water main failures occur per year. To achieve this objective, we constructed a detailed picture and understanding of the behaviour of the water pipe network by 1) discovering the underlying drivers of water main breaks, and 2) developing a machine learning system to assess and predict the failure likelihood of water main breaking using historical failure records, descriptors of pipes, and other environmental factors. The ensuing results open an avenue for Western Water to identify the priority of pipe renewals.

# Challenges and highlights of the work

- While there is significant existing literature on pipeline failure causes, discovery of major failure factors was critical to discern which of these causes were the most important for Western Water;
- Thus, an in-depth analysis was carried out to identify underline pipe failure factors by data pre-processing through a sequence of steps;
- A machine learning prediction model was developed to identify future pipe failure likelihoods for every water main asset. These predictions were validated by separating the data into training and testing samples. Based on the prediction model, a derived list was generated and evaluated on the testing data;
- Some divergent trends were observed in the Western Water records (e.g. failure rate for AC pipes decreases with the age). Therefore, data mining techniques were used to explore the intricate interplay between age and other factors to reflect the true trend, and;
- Finally, a long-term forecasting model was developed for predicting which pipe assets are most likely to have a water main failure within the next twenty years. Furthermore, burst and fitting failures were considered separately;
- A user-friendly, end-to-end runnable tool was developed for the prediction.

# Introduction

The consequences of water pipeline failures can be extremely severe in-terms of water supply disruption, high repair cost and compensation claims. However, prediction of the water main breaks is not an easy task due to their low failure rate and high cost of inspection, which have led to sparse historical data.

Data scientists at Data61 and engineering experts in Western Water commenced model development to answer these questions and ultimately produced more targeted break mitigation and asset renewal programs for Western Water. Western Water is one of Victoria’s thirteen regional urban water corporations servicing 69,371 properties over an area of 3,000 square kilometres and a population of 160,339.

Mitigation of the water main breakage and water asset renewal programs should balance the consequence of water main failure and the cost to customers. To effectively achieve these dual objectives, it is important to know: what are the causes of pipeline failure, what is the probability of a failure for an individual pipeline asset and the risk of these failures associated with the business?

Therefore, our main aim in this paper is to construct a detailed picture of factors affecting pipe failure rate to predict the future pipe breakage likelihoods. These likelihoods will be used to calculate the risk distribution by combining with asset consequence factor data (Risk = Likelihood x Consequence) to develop a risk-based investment decision framework for capital interventions.

Although the factors affecting pipe failures have been studied before, understanding of these factors is to a large extent incomplete due to their high complexity. Thus, comprehensive analyses were performed to identify the factors that lead to failures of water pipes. This involved exploring statistically significant correlation between water main breaks and operational factors sourced from Western Water’s internal databases as well as external datasets such as the Bureaus of Statistics and Meteorology. In addition, a machine learning-based data analytic model was developed to predict the likely probabilities of future pipe failures. Data mining techniques were used to explore the intricate interplay between age and other factors to reflect the true trend of failure rate over time. Finally, the probability of failure for an individual pipeline was calculated by extrapolating past performance of similar assets in similar operational conditions elsewhere in the network. An annual failure probability value was calculated for all water main assets until 2037.

The results were validated by comparing the number and location (suburb) of breaks projected by the modelling with actual performance in calendar year 2017. Validation was also carried out at the asset level by comparing assets with high failure probability against the asset renewal program. Further to this, end-to-end data analytic process is automated within Docker engine for the end user’s convenience.

# Existing methods

Analysis of water pipe breakage and forecasting future failure rates has been studied over the past few decades using a variety of methods and frameworks.

Uri et al. (1979) developed a forecasting technique to study how the number of breaks would change with time if the pipes were not replaced. In that study, authors used a Poisson model based on the age of the pipes. Moreover, prediction of water main breaks has been studied using survival-based methods, such as Poisson regression by Asnaashari et al. (2009), and Weibull model.

Most recently, tree-based machine learning techniques have been used to analyse water pipe breakages in Syracuse, USA by Avishek et al. (2017) and in Queensland, Australia by Liang et al. (2017).

Although there is significant existing literature, there still exist open questions regarding the intricate relationship among the major factors causing pipe failure, and their long-term effect on the lifetime of a pipe. Thus, discovery of major failure factors is critical to discern which of these factors are the most important for different water utilities.

# Data analytic model for failure prediction

The framework of the proposed model is depicted in Figure 2. The first step entails pre-processing pipe attribute data, and pipe failure data obtained from the Western Water’s internal database. In the next step, the influential and significant factors are investigated, and a water main failure prediction model is developed using a machine learning model (Random Forest Regression). Then the performance of the model is evaluated. Finally, a long-term failure forecasting model is developed, with an end-to-end runnable tool to automate the entire prediction process. The following subsections discuss the processes.

## Data pre-processing

There are three main data sources used as the input to the analytical model:

**Network data**describes water main information such as asset number, installation date, material, diameter, length, and**Work order data**describes water main failure information such as asset number, failure date, location, and failure type (burst, fitting).**External data**includes information in addition to assets, such as weather data from the Bureau of Meteorology and census data from the Australian Bureau of

The above data should be sufficiently accurate for the intended use, so a data quality review has been undertaken based on three key characteristics: completeness, validation, and consistency (examination for invalid values). The quality review demonstrates that the data is sufficient and accurate for further analysis. Accordingly, this process allows to establish a comprehensive data file with complete information for each asset that can be used as an important input to further analysis.

Moreover, when information is gathered from multiple sources, and prior to adoption of advanced analytic techniques it is essential to match the failure records with the network data and identify gaps in the datasets. In addition, environmental and demographic factors need to be matched with the network data. Specifically, failure records and information are assigned to the corresponding assets based on the work order number, and environmental and demographic information are assigned to the assets based on the geographic locations.

## Factor analysis

Factor analysis has been used to identify pipeline failure drivers and compare their relative impact on the network based on the water network information.

Factor analysis measures the correlation between asset performance based on the comprehensive data and a large range of factors (including environmental, demographic, asset specific factors). While there is significant existing literature on pipeline failure causes, this step is critical to discerning which of these causes would be the most important for Western Water. The asset performance is based on failure rate which is the number of asset failures per 100km per year. Both single factor analysis and multi-factor analysis have been performed to identify the possible driving factors. The asset performance usually is not related to only one factor, so it is essential to measure the correlation based on multiple factors. Compared to the single factor analysis, multi-factor analysis is a factorial method devoted to study a group of individuals which is described by a set of factors.

## Pipeline failure prediction

This phase involves predicting future (short-term) water pipe failure probabilities. We framed this scenario as determining the likelihood of failure on each given pipe within the next immediate years.

The model we developed includes specialised algorithms to handle the large amount of numerical calculations and data for prediction. The underlying statistical principle employed here is the Random Forest Regression, as trees are ideal candidates to capture complex interaction in the pipe data. This model is initially reported by Breiman (2001) and extended by Harvey et al. (2014). Random Forest Regression captures and extrapolates non-linear interactions among failure factors.

The failure prediction is generated by training the machine learning model on historical failure records and other factors. Prediction accuracy is achieved by running many iterations of non-linear regression and then averaging the results. Finally, this trained model produces a failure probability score for each water main asset. This process is schematically illustrated in Figure 3.

## Long-term forecasting

This phase extends the short-term prediction results to forecast pipe failures 20 years into the future. Here we assume that the function of failure rate with age is linear. Which means the failure rate will increase with constant value for each year.

Our approach on choosing the optimal coefficient, 𝑎^{∗} for the trend shown in Figure 4 is given below:

- Calculate the maximum and minimum values of the coefficients from single data point (𝐴
_{𝑖}, 𝐹𝑅_{𝑖}), which are: - Set the step-size as 𝜖. For integers 𝑘 (𝑘: 𝑎
_{𝑚𝑖𝑛}≤ 𝑎_{𝑚𝑖𝑛}+ 𝑘 ⋅ 𝜖 ≤ 𝑎_{𝑚𝑎𝑥}), obtain the sequence {𝐿_{𝑘}} of sum of squared regression errors (loss), each of which is calculated as ∑_{𝑖}(𝐹𝑅_{𝑖}− (𝑎_{𝑚𝑖𝑛}+ 𝑘 ⋅ 𝜖) ⋅ 𝑡)^{2} - Select the smallest value of {𝐿
_{𝑘}}, return the corresponding 𝑎^{∗}as optimal.

Where, 𝐴_{𝑖}, 𝐹𝑅_{𝑖} are age and failure rate of each pipe, 𝑖. To obtain the data points shown in Figure 5, we must use a set of pipes (this is because individual pipe can only provide a small number of points with high variance). Here, the set of pipes can be a category. The category was initially fixed manually, e.g. all AC pipes. Thus, an optimum coefficient is calculated for each pipe type.

# Result analysis

## Data pre-processing

First of all, a data pre-processing task was carried out to clean the raw data and match pipe attribute data with failure records. Data cleaning was conducted to make sure the data is complete and valid.

**Completeness:**this is a statistic that does not allow empty values. For water pipes, all the records are complete, as shown in Figure 5. However, for failure incident records, 990 records have empty EVENT_DATE values, with 95% of completeness.**Validity:**this is a statistic that does not allow invalid values. For water pipe data, 3432 records have invalid DATE_MADE values, making 98.5% validity (see Figure 6). Failure data include 50 invalid records making it 99% valid for processing.

Figure 7 shows the data matching process, which matches the network data with work order data. Over 90% of data can be successfully matched.

Ultimately, the data pre-processing including the quality review has demonstrated that the data is sufficient and accurate for further analysis. Using the processed data, overall failure rates for burst and fitting failures were calculated for each year from 2005 to 2016, as depicted in Figure 8.

## Factor analysis outcomes

Factor analysis allowed Western Water to compare the relative impact each factor has on causing failures. For example, within operational factors, AC mains were found to failure more often than others (See Figure 9). It was also found that water mains with laid year before 1985 exhibit higher failure rates. (See Figure 10).

Furthermore, environmental factors have been analysed, including weather data and soil data. The weather data is extracted from the Bureau of Meteorology. Monthly mean temperature data over ten years (2006-2015) was used for the analysis. The analysis results show that with the increase of temperature, overall failure rates increase (See Figure 11).

To quantify the amount of pipe failure information stored in each of the features in isolation, we calculate the mutual information between the pipe failure count and each feature. The resulting information scores for Western Water are presented in Figure 12. Pipe size (or diameter) shares the highest amount of mutual information with failures while pipe type has the least effect on failures. In general, all predictors by themselves display very low levels of mutual information indicating that by themselves, they do not predict failures sufficiently well.

## Failure prediction outcomes (model validation)

**Pipe length based:** Firstly, the prediction model was calibrated using the failure records from 2005 to 2015. The calibrated model was applied to predict the pipe failure probability for each pipe from year 2013 to 2016. The pipes were ranked according to the failure probability of each pipe. Using the ranked list, actual breaks from highest probability to lowest probability are accumulated (cumulative sum of breaks). The percentage of detected breaks is plotted against the percentage of inspected pipe lengths. Figure 13 shows that if the first 10% pipes are inspected, more than 30% of burst failures can be detected.

**Suburb based:** Results were also validated based on suburb. Here, the suburbs were ranked according to the accumulated failure probability of each pipe in a suburb. Using the ranked list, actual breaks from highest probability to lowest probability are accumulated (cumulative sum of breaks). The percentage of detected breaks is plotted against the percentage of inspected suburbs. Figure 14 shows that if the first 10 suburbs are inspected, more than 70% of burst failures can be detected.

Moreover, if the top 30 suburbs are inspected, 27 overlapping suburbs can be found based on our model, as illustrated in Figure 15.

Figure 15: Number of overlapping suburbs between model output and actual burst failure.

Table 2: Top 10 risky suburbs for fitting failures

Suburbs are sorted by their normalised fitting failure probability in descending order. The top 10 risky suburbs from 2013-2016 have been listed in Table 2. The suburbs highlighted in green are overlapped with actual failure records. Hence, in each year, our model can successfully detect 9 out of top 10 risky suburbs accurately.

Figure 16 shows the top 10 risky suburbs for fitting failure in 2016.

Long-term burst failure prediction for all water main assets is given in Figure 17. Note that the prediction is a probability distribution where darker areas represent high probability estimates and the lightly shaded upper and lower bounds represent low probability estimates. The uncertainty arises due to statistical significance of correlations from factor analysis.

When we consider the entire spectrum of the years, failure rate increases with age (see Figure 4). In recognition of this pattern, the failure rate gradually rises over the years starting from 2017. Thus, the mean prediction or the trajectory most likely to occur is represented with a solid line in Figure 17 and Figure 18. Our model predicts that by 2030 there will be an increase of 22%, 26% in the burst and fitting failures respectively.

# Discussion

The machine learning model developed in this project is based on the Random Forest Regression (RFR). We have also compared RFR with gradient boosting (GB) and few other machine learning techniques such as Neural Networks and Gaussian Process (GP). We have obtained highest accuracies with both RFR and GB. GP is computationally expensive than RFR. Therefore, we employed RFR to build the model.

## Significance of our approach:

- Water mains in more than 40 suburbs in Victoria were studied
- Data was pre-processed for application through sequence of steps
- The factors were analysed for their impact to failures
- Prediction model was developed and evaluated on historical data
- Evaluation shows more than 20-40% failures can be detected by inspecting 10% of total pipe length
- Some divergent trends were observed in the Western Water records such as failure rate for AC pipes decreases with age.
- We used divide and conquer method by dividing the AC pipes into sub sets based on different features and analysing the trend
- Failure likelihood of each pipe for next 20 years was predicted, and further analysed based on material and pipe size
- Failure prediction tool was developed, which automates the process from data cleaning to long-term forecasting, as illustrated in Figure 19.

Our model provides a projection of the likelihood of pipe failure. These likelihoods along with the consequence of failures are being used in Western Water’s current investment planning to make risk-based investment decisions for capital interventions. The severity of the consequence of failures is determined with the input from Western Water’s internal data.

# Conclusion

This project has been developed to assist with forecasting and planning water main renewals with more confidence via predictive analytics. Pipeline maintenance and renewal programs balance level of service requirements and the need to minimise cost to customers. Therefore, we constructed a complete picture of factors causing pipe failures in Western Water’s water pipeline network and developed a prediction model to estimate the probabilities of water main breaks based on those factors.

Results demonstrate that our model is capable of providing valuable assistance to forecast and plan water main renewals with more confidence via predictive analytics. The next step is to apply a consequence rating to enhance the model predictions, as both of these factors are important to identify the priority of pipe renewals. Ultimately, we believe this work, at the intersection of machine learning and asset management, will lead to more effective and proactive infrastructure maintenance in the Australian water industry.

# About the authors

**Dilusha Weeraddana **| Dr. Dilusha received her PhD degree from Monash University, Australia in 2017. She is currently working as a Research Fellow in Data61-CSIRO. Dilusha has more than 7 years of experience obtained through both commercial and research pursuits. She has diverse research interests in data mining, machine learning, survival analysis, and mathematical modelling.

**Bin Liang **| Dr. Bin Liang received his PhD degree from Charles Sturt University, Australia in 2016. He is currently a lecturer in UTS. Before joining UTS, he was a post-doctoral fellow in Data61-CSIRO. His experience and research interests include computer vision, machine learning, and survival analysis for predictive modelling.

**Zhidong Li **| Dr. Zhidong Li received his PhD degree from UNSW, Australia. He is currently a senior lecturer in UTS. Before this, he was a senior engineer in Data61-CSIRO. Zhidong has been awarded multiple awards including the Australian Museum Eureka Prize for excellence in data science. His research interests include machine learning, data mining, and pattern recognition.

**Yang Wang **| Dr. Yang Wang is an associate professor at UTS as well as a visiting principal researcher of Data61-CSIRO. He received his PhD degree from National University of Singapore in 2004. His research interests include machine learning and information fusion techniques, and their applications to asset management, intelligent infrastructure, medical imaging, and computer vision.

**Fang Chen **| Professor Fang is a prominent leader in AI/data science with industrial recognition. She is the winner of the Australian Museum Eureka Prize for Excellence in Data Science, 2018. The transformations to industry with practical impact won her many industrial recognitions including being named as “Water Professional of the Year” in 2016.

**Livia Bonazzi **| Livia is General Manager Strategy at Western Water. Livia is a passionate advocate for strategic asset management and sustainable urban development. Her current role involves community engagement and inter-agency collaboration to develop and implement innovative policies as well as integrated planning solutions to efficiently service the unprecedented urban growth in Melbourne’s western growth corridors.

**Dean Phillips** | Dean is a chemical engineer who has been in the water industry for 7 years working in trade waste and asset management. He is now in water treatment, where he leads a team of operators and drives efficiency, compliance and quality across all 7 of Western Water’s water filtration plants.

**Nitin Saxena **| Nitin is a former Manager Strategic Asset Management at Western Water. Before joining Western Water he has worked at SA water for 5 years as a Manager Asset Maintenance. Nitin has over 10 years of experience in asset management and urban development.