Clustering data processing.
Read more about our research work.
Clustering processes within the same unit.
Cluster analysis is a method used to organize data, to allow for a better understanding of the connections between pieces of data, recognizing patterns that may otherwise not be visible. With any large data set, one faces the challenge of exactly how to interpret and connect the data in a way that makes sense. If one cannot do this, it is impossible to draw meaningful information and conclusions from any data.
In the most basic sense, cluster analysis involves arranging data so that observations with similar characteristics are grouped together. To demonstrate this, let us consider a simple example: say we have a data set containing information about consumers of a certain product. At the most basic level, these consumers could be clustered into, say, age groups, allowing us to consider the age demographics involved. However, more complex clusters can be created by taking further variables into account, so that the consumers would be grouped by their overall similarity to one another. Through cross-analysis with the data, this allows for a richer and deeper insight into the types of people consuming this product.
Are there different types of cluster analysis?
Just as there are many different techniques used for data analysis more broadly, there are also different approaches used in cluster analysis. The three most common are K-means, hierarchical, and DBSCAN.
K-means cluster analysis involves determining first the number of clusters that the data will be partitioned into (K). Each partition will be given a central value, or centroid, which is the average of data points within the cluster. Once the number of partitions K has been determined, the data is then clustered in a way that minimizes the distance between the centroid and the values of the observations within each cluster.
Hierarchical clustering follows a rather different method. First, each observation is considered on its own to constitute an entire cluster; the total set of all observations clustered individually forms the ‘bottom level’ of the hierarchy. Then, a merge takes place in which the two most similar observations join to form a cluster, and the number of overall clusters decreases by 1. This repeats, until eventually the desired number of clusters is reached. Alternatively, a ‘top-down’ hierarchical method can be used, where the entire set begins together as one large cluster, which is then split, with each split increasing the number of clusters by 1, again continuing until the desired number of clusters is reached.
Both of these methods of cluster analysis are favored for being intuitive and simple to implement. In the context of Nproperty™, this means that our analysts can quickly and easily use cluster analysis when considering our customers’ data sets, and draw accurate conclusions. It is also worth noting that these two methods are highly sensitive to outliers in the data, and so it is important that our analysts remain aware of any observations that do not fit the typical data pattern to ensure results are not skewed in any way.
How is this done?
We have developed two different means of analysis that allow us to identify anomalies within data and ultimately provide recommendations to businesses. The first of these methods is used to identify any facilities operating with unusual electrical load profiles, and is based on an analysis of electric operational data across multiple facilities of the same business. The second is used to identify abnormal days of operation within any individual facility of the business.
Both of these methods employ cluster analysis techniques to arrive at our results. Through our experiments, we have identified cluster analysis as an efficient and effective tool for identifying anomalous electricity use. In our operations, it has meant that our analysts can easily identify problem facilities, and are not required to manually scan through all available data, allowing us to provide timely and accurate recommendations.
Projects using cluster analysis
Nproperty™ has access to very large volumes of time-series data – in our case, energy use data is recorded and stored for many buildings at 1-minute intervals. This data has been drawn from a range of sources, including customers’ existing metering infrastructure, our installed meters and gateways, or third parties. A manual analysis of this data would not be possible due to the size and complexity of the information stored, and so cluster analysis is used as a tool for simplification.
With cluster analysis, however, we are also able to categorize this data in a way that allows us to view anomalies. This categorized data is then viewed by our Nproperty™ analysts, who are able to identify the similarities that exist within clusters and make inferences as to why these clusters are grouped together. These inferences are then used to identify the causes of excess electricity use, and recommendations can be made accordingly.
Datascience and industry
Clusting and optimisation