Contents
Keywords.
Abstract
Introduction.
Literature Review..
Clustering Algorithm Techniques.
Partitional Clustering.
Hierarchical Clustering.
Density Based Clustering (DBSCAN)
Neural Networks.
Fuzzy Clustering.
Grid clustering.
Conclusion.
Recommendations for future research, agreement and disagreement
References:
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), Density based Clustering (DBSCAN), Distributed Density- Based Clustering (DDC), Ordering Point to Identify Clustering Structure (OPTICS), Clustering using Representatives (CURE) and Fuzzy C- Means (FCM)
Clustering is a method in which a collection of given data is split into different sets, these sets are known are clusters in such a manner that the data sets which same lie with each other in one cluster. Clustering plays a vital role in data mining because in data mining the data sets of very large amount. This study defines many techniques of clustering algorithms that are available for data mining and this paper gives an analysis on the base of different algorithms such as CURE, CLARANS, KMeans, CLARA, DBSCAN clustering.
The word clustering defined as the “unsupervised classification of different data sets or different observations, it means the data set have not been divided into any cluster and so they don’t have any class attribute attached with them. The term clustering is used in a wide range as one of the vital steps in the investigation of data evaluation. The different algorithm of clustering helps in finding the useful and not identified patterns of different classes. Clustering helps in dividing the data sets into objects of same groups. Objects which are not similar and they are placed in different clusters. A data object can be belonging to one cluster or it can belong to more than one cluster according to the selected metric. For instance, thing about a retail database that includes the information and data about the products taken by the customers, clustering help in grouping the customers as per the purchasing the patterns. When the objects combine in groups of clusters then simplification is accomplished and at this point some information is lost. To recover this lost information different clustering method is used and this is the main focus of this study. To select the suitable type of clustering algorithm that needs to applied is a critical step. The primary focus of this paper is to offer an explicit evaluation of distinct clustering algorithms that is used in data mining and it should be based on the similar criteria and the type of complexity. The paper contains 3 sections, the very first section is a literature review, second section is all about different clustering techniques in data mining and how these clustering techniques are better than other clustering techniques because of the time and space complexity. The third section provides the provides a table of different clustering with their applications and functions and then this section followed by a conclusion. It is very much important to write this review as it helps in addressing the issues that generates while carrying out the large and constant data sets. This paper will be discussed about different aspects of clustering such as partitioning based, hierarchal based, density based, model based and grid based to address many issues. Different types of sources used in this review such as Electronic libraries and reliable information available on the internet, and periodical, Journals, magazines, newsletters and report from the collaborated sector.
According to (Dalal et al.,2011) the clustering techniques can be divided into three primary techniques; first is portioning clustering, second is density-based clustering and third is Hierarchical Clustering, and the Hierarchical clustering is further divided in various parts as demonstrated in figure 1. As per the study of (Liu et al., 2015) if the data is continuous and large according to its nature then the conventional methods of mining are not suitable due to the real time data which changes very quickly and needs instant response as well. This is the issues in earlier time to handle the rage and continuous data sets. To access the data stream randomly becomes very expensive so only one access streaming data is offered and the storage that is required is also very large. Hence, for streaming data, two approaches are required such as data mining and clustering. (Fahad et al., 2016) described the effectiveness about the candidate clustering algorithm and it can be calculated with the help of various external and internal validity metrics, scalability tests, run time and stability. Large amount of information or huge data, has its own weaknesses as it requires a large volume of storages and this much of volume prepares functions like analytical functions, process functions, retrieval functions, it become very large and very time consuming. To address these issues of challenge of big data should be clustered in a compact format and it is an informative version of the whole data. There are other clustering techniques available to deal with the large datasets like BIRCH, OptiGrid and DENCLUE.
There is no clustering algorithm present at this point to evaluate the criteria and future work is dedicated to address these kinds of issues and challenges for each clustering algorithm to handle with big data. As per the study of Douglass document clustering is not received very well because some constraints. The techniques of clustering are relatively slow while dealing with collection of large documents and in addition to this, it is the thought that the technique of clustering does not assist in enhancing retrieval. Douglass defined technique of document browsing that uses various fast algorithms of clustering as main functions. They have prepared a browsing technique which is very interactive. This method is called as scatter/gather method. This technique is required because is fast clustering technique. The working of this particular technique can be understood by taking an example of book. If in a book, there is a requirement to find some term then it can be searched directly through the index conversely, if look for the answer for some other question, there is a need to go table of content. The system which is proposed that is scatter/gather use cluster-based navigation technique to navigate between other documents. These techniques can combine one or more same documents for reference of future. In beginning, the system scatters all the processed cluster documents in various groups and collect process and then chose group to make sub collections. This particular technique presents and offer two facilities, one is re-clustering and other is clustering. For this method, it is required to use fractional algorithm and some buckshot, as per (Singh et al, 2013).
Distributed clustering technique works on two different kinds of architecture, one is homogenous datasets: in this particular data sets, all the local site has similar attributes and second is heterogenous datasets: in this particular data sets, all the local site has distinct attributes but attaching among clusters relied on a general attribute. According to the study of (Zhao et al., 2012) high quality techniques and algorithms and fast algorithms are necessary to browse large volume of data. Here, one can use agglomerative clustering and portioned clustering, these two are part of hierarchal clustering. By evaluating the performance on the basis of experiment, it was observed that partitional algorithm is much better then agglomerative algorithm because the computational requirements are very less and performance of clustering. (Guha et al., 2018) defined that clustering algorithm of new class and it is named as constrained agglomerative algorithms because it is an amalgamation of agglomerative and partitional approach both. In this study, a new type pf clustering algorithm proposed which is very robust and it deals with those clusters that are sphere in shape. The hierarchal clustering algorithm provides a middle field in between the all point and centroid based method. (Guha et al, 2018) offers a comparison in between the CURE and BIRCH algorithm. To deal with the large set of databases, this algorithm uses sampling and random portioning. The algorithm named clustering algorithm used in various applications such as image segmentation, speech recognition, character recognition, vector quantization, and many more. These examples are the parts of machine learning, so automatically clustering algorithm plays an important in machine learning. The main challenge in the clustering algorithm is that it selects the initial centroid randomly and their capability to handle with constant arrival of information. This study presents a classification of information with the distributed nodes. If the conventional clustering techniques/algorithms are implemented in distributed database then it needs to transmit all the sets of data to the central site, however it is illogical because of the presence of huge datasets at every local site since the privacy and bandwidth concern is limited. The distributed clustering is of two types; hard clustering; it is also known as crisp datasets and every data object can present in one cluster only that is, in this the clusters are disjoint. For instance, k-means clustering and PAM. The second type is soft clustering, in this each data object fits in each cluster, for instance, particle swarm optimization (PSO), neural networks. This is a stochastic optimization technique which is a population-based method, this has been prepared on the basis of swarm particles. Various researchers defined the PSO algorithm, this technique is used to make clusters on the basis of some sample features that are, accuracy, efficiency and error count (Guha et al, 2018).
Partitional Clustering
This particular clustering represents with the help of prototype and also need to use iterative controstrategy to make the cluster optimum. The portioning algorithm splits the data into different subsets of data which is called partitions and every partition is a cluster. The clusters that formed having these characteristics;
This technique of algorithm is divided into two categories, first is medoid algorithm and second is centroid algorithm. In medoid algorithm, every cluster includes the instances that are very close to the centre of gravity. In centroid algorithm, the centre of gravity is helpful in representing every cluster. For instance, K-means technique of clustering, in this method, data set is divided into k number of subsets in such a way that all the points which is contained in a subset are very much close the centre of gravity. The efficiency and effectiveness of this particular algorithm is based on the aim function that is used to determine the distance among sentences. There is a requirement of this technique that all the information must be available before.
Time Complexity: the time complexity is O(nkl), n represents the patterns’ number, k represents the clusters’ number and l represents the iteration required to coverage. The numbers of the l and k are fixed before attaining the complexity of linear time.
Space Complexity: the space complexity is O(k+n) and more additional storage is needed for storing the information.
Hierarchical Clustering
Dendogram is known as tree of clusters and it is constructed in Hierarchical technique according to the medium of proximity. Every cluster node includes another node called offspring nodes and the nodes which belongs to same parents, these are known as sibling nodes. Hierarchical method has a very special property that is called quick termination. Instance of this particular clustering are, BIRCH, CHAMELEON, CURE etc. this technique can be further bifurcated into Agglomerative Method and divisive method.
Agglomerative Method:
This particular method works in bottom up manner that is it make a group of all instances of data that fits into one cluster. All the pairs which are close integrate together. The closeness of pairs can be described as complete- link and single- link.
Divisive Method:
This particular method works in a top down manner in which data can be divided inti low volume clusters till every cluster includes exactly one instance of data.
Space Complexity: it is O(n2 )
Time Complexity: this particular complexity is O(n2 logn), in this n represents the pattern’s number
Density Based Clustering (DBSCAN)
IN this type of clustering, distinct metrics of distance are used and integer of clusters is calculated automatically with the help of algorithm. All the data objects required to separate from each other based on the boundary, connectivity and their area. In DBSAC, either the data points related to any cluster or they categorised as noise. The points of data can be classified as core points, noise points and border points.
Core points: Points which are set in within the cluster is known as core points. A point can be known as the data point within the cluster if the data points which exist in the neighbourhood has high threshold value.
Border points: these points are those which does not belong to core points but these lie in the close proximity to core points.
Noise points: these points are those which is neither border point nor core point.
Time Complexity: O (m* t), t is the time which is required to find the points in the close areas.
Space Complexity: O(m)
Neural Networks
This is called non linear data modelling technique that help in simulating the brain’s working. These networks are used to identify the connection among the patterns that depends on the output and input values.
Fuzzy Clustering
This clustering is based on reasoning technique that make association with different patterns and the clusters and it carried out on the basis of functions of membership. It produces overlap clusters.
Grid clustering
This particular Grid based technique is very much useful in spatial data mining. This splits the space into various cells and the functions are then carried out in the quantised space.
Clustering algorithm |
Function |
application |
K means and fuzzy c means |
Helps in extracting genetic patterns |
Battle simulations |
Density based sequencing clustering |
Helps in identifying the climate patterns |
Metrological department |
Hierarchal clustering |
Helps in identifying the trend change in traffic |
Traffic department |
K means clustering |
Helps in share and stock patterns |
Stock market |
Mean shift clustering |
Recognition of features |
Capturing images |
EM learning |
Supervise the condition of tools |
Public data |
This paper discussed about various kinds of clustering such as partitional clustering, density-based clustering, hierarchical clustering; the clustering which is Grid based and their time and space complexities and how these clustering help in data mining. Partitional clustering generally represents the clusters with the help of prototype. It is very useful for convex shapes and these shapes should be of similar size and the numeral of clusters should be identified before. The main weakness is in prediction of number of clusters prior. They split the data sets into different levels, these levels of partitioning are known as dendograms. The above stated techniques and algorithms are very much effective for data mining however the formation of these dendograms’ cost is very expensive for large volume of datasets. Other algorithm which is also very useful in data mining for large volume of data sets is density-based clustering and this particular technique can easily identify noise and also handle the clusters of arbitrary shape.
Vectoral data is very useful in actual data mining, but the future data will be stored in much more complex data and data mining will have to adjust all the increasing volume of data. The other aspect is “patterns in data”, its importance will also increase in future because the study evolves with time. The next aspect is, the usability to detect the patterns which are understandable is increased and it will also be necessary to make the method of data mining user friendly. Data mining in future has to deal with all these complicated input and pre-processing, it is possible that the user in future has to adjust a greater number of data. Thus, accomplishing user friendliness with clear or even minimum parameterization is a primary aim. The usability can also be improved by finding new kinds of patterns that are simple to interpret even though the input is very complicated.
Even though no human can predict the future, it is believed that there are ample number of challenges and issues are waiting ahead and few of them cannot be foreseen at the present moment.
Dalal, M. A. & Harale, N D. (2011). A survey on Clustering in data mining, International Conference and Workshop on Emerging Trends in Technology. TCET. Mumbai. India. 559-562
Guha, S., Rastogi. R. & Shim, K. (2008). CURE: An efficient Clustering Algorithm for Large Databases. 73-84
Zhao, Y. & Karypis, G. (2012). Evaluation of Hierarchical Clustering Algorithms for Document Datasets. Singh, D. & Gosain, A. (2013). A Comparative Analysis of Distributed Clustering Algorithms: A survey, 2013 International Symposium on Computational and Business Intelligence. 165 – 169
Liu, Q., Jin, W., Wu,S & Zhou, Y. (2015). Clustering Research using dynamic modelling based on granular computing. 539 - 543
Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Zomaya, A., Khalil, I., Foufou, S., & Bouras, A. (2016). A survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. 267 - 279
Remember, at the center of any academic work, lies clarity and evidence. Should you need further assistance, do look up to our Computer Science Assignment Help
1,212,718Orders
4.9/5Rating
5,063Experts
Turnitin Report
$10.00Proofreading and Editing
$9.00Per PageConsultation with Expert
$35.00Per HourLive Session 1-on-1
$40.00Per 30 min.Quality Check
$25.00Total
FreeGet
500 Words Free
on your assignment today
Get
500 Words Free
on your assignment today
Doing your Assignment with our resources is simple, take Expert assistance to ensure HD Grades. Here you Go....
Min Wordcount should be 2000 Min deadline should be 3 days Min Order Cost will be USD 10 User Type is All Users Coupon can use Multiple