Table of Contents
In this article, you’ll learn about Outliers in Data mining.
The data which deviates too much far away from other data is known as an outlier. The outlier is the data that deviate from other data.
The outlier shows variability in an experimental error or in measurement. In other words, an outlier is a data that is far away from an overall pattern of the sample data.
Usually, outliers are confused with noise( Noisy data is meaningless data ). However, outliers are different from noise data in the following sense:
- Noise is a random error, but outlier is an observation point that is situated away from different observations.
- Noise should be removed for better outlier detection.
Outliers are of three types, namely –
- Global (or Point) Outliers
- Collective Outliers
- Contextual (or Conditional) Outliers
1. Global Outliers
They are also known as Point Outliers. These are the simplest form of outliers. If, in a given dataset, a data point strongly deviates from all the rest of the data points, it is known as a global outlier. Mostly, all of the outlier detection methods are aimed at finding global outliers.
For example, In Intrusion Detection System, if a large number of packages are broadcast in a very short span of time, then this may be considered as a global outlier and we can say that that particular system has been potentially hacked.
2. Collective Outliers
As the name suggests, if in a given dataset, some of the data points, as a whole, deviate significantly from the rest of the dataset, they may be termed as collective outliers. Here, the individual data objects may not be outliers, but when seen as a whole, they may behave as outliers. To detect these types of outliers, we might need background information about the relationship between those data objects showing the behavior of outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package from one computer to another may be considered as normal behavior. However, if this happens with several computers at the same time, then this may be considered as abnormal behavior and as a whole they can be termed as collective outliers.
3. Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object deviates significantly from the other data points based on a specific context or condition only. A data point may be an outlier due to a certain condition and may show normal behavior under another condition. Therefore, a context has to be specified as part of the problem statement in order to identify contextual outliers. Contextual outlier analysis provides flexibility for users where one can examine outliers in different contexts, which can be highly desirable in many applications. The attributes of the data point are decided on the basis of both contextual and behavioral attributes.
For example: A temperature reading of 40°C may behave as an outlier in the context of a “winter season” but will behave like a normal data point in the context of a “summer season”.
Outlier Detection methods
Some of the outlier detection methods are as follows:
- Z-Score Normalizatoin
- Linear Regression Models (PCA, LMS)
- Information Theory Models
- High Dimensional Outlier Detection Methods (high dimensional sparse data)
- Proximity Based Models (non-parametric)
- Probabilistic and Statistical Modeling (parametric)
- Probabilistic and Statistical Modeling (parametric)
- Numeric Outlier
- Numeric Outlier :Numeric Outlier is the nonparametric outlier detection technique in a one-dimensional feature space. TheNumeric outliers calculation can be performed by means of the InterQuartile Range (IQR).
- Z-Score :Z-score is a data normalization technique and assumes a Gaussian distribution of the data. Outliers detection can be performed by Z-Score.
- DBSCAN: The DBSCAN technique is based on the DBSCAN clustering algorithm. DBSCAN is a density-based, nonparametric outlier detection technique in a 1 or multi-dimensional feature space. In DBSCAN, all the data points are defined in the following points.
- Core Points
- Border Points
- Noise Points.
Various causes of outliers in Data Mining
There are various causes of outliers in Data Mining. Some of these causes are given below:
- It is used in identifying the frauds in banking sectors such as credit card hacking or any similar frauds.
- It is used in observing the change in trends of buying patterns of a customer.
- It is used in identifying the typing errors and reporting errors made by humans.
- It is used in discovering the errors or faults in machines or systems.
What is the need of handling the outliers in Data Mining?
There are various reasons to handle the outliers in Data Mining. Some of those reasons are listed below:
- Outliers affect the results of the databases.
- Outliers often give useful or beneficial results and conclusions due to which various trends or patterns can be recorded.
- Outliers can be beneficial in research department also. They can be extremely useful in some discovery.
- Outliers are the key branches of data mining.
Applications of Outlier Detection in Data Mining
In Data Mining, Outlier Detection is extensively used. It is used to obtain patterns or trends in data mining. The applications of Outlier Detection in Data Mining are given below:
- Fraud Detection
- Telecom Fraud Detection
- Intrusion Detection in Cyber Security
- Medical Analysis
- Environment Monitoring such as Cyclone, Tsunami, Floods, Drought and so on
- Noticing unforeseen entries in Databases