# Data Mining and Warehousing - 3

## Introduction to Data Mining Concepts and Techniques

## Introduction to Data and Data Mining

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.

The goal of data mining is to extract information from a data set and transform it into an understandable structure for further use.

**Data**

Data refers to values of qualitative or quantitative variables about objects.

Types of data:

Numerical - values represented by numbers like temperature, weight, etc.

Categorical - values represented by categories like gender, colour, etc.

Text - unstructured data like documents, articles, and social media posts.

Image - photos, videos, diagrams, etc.

Audio - recordings of sounds, voices, music, etc.

Data quality aspects:

Accuracy - free from errors

Completeness - contains all required information

Consistency - uses the same definitions and formats

Timeliness - up-to-date and recent

**Data mining**

Data mining involves six major tasks:

Classification - predicting class labels

Regression - predicting continuous values

Clustering - grouping similar data points

Association rule learning - finding correlations

Anomaly detection - finding unusual patterns

Feature engineering - constructing features for modelling

## Data Types

Data can be categorized into different types based on its structure and format:

**Structured data**

Structured data has a predefined format and is organized in a tabular format like rows and columns. It is easy to query, analyze and store in databases.

Examples of structured data:

Data in relational databases

Data in spreadsheets

Data from sensors and RFID tags

Features of structured data:

Organized in tables with fields, records and relationships

Follows predefined data models

Easy to query, search and aggregate

Advantages:

Well-organized

Easy to integrate with data mining and machine learning algorithms

**Unstructured data**

Unstructured data does not have a predefined data model, format or schema. It includes text, images, audio and video.

Examples:

Text documents, emails, social media posts

Images, videos, audio recordings

Sensor data, log files

Features:

Raw format with no structure

Difficult to query and analyze directly

Advantages:

- Contains valuable information not captured in structured data

**Semi-structured data**

Semi-structured data has some structure but not a predefined schema. It uses tags or other markers to denote fields, elements or attributes within the data.

Examples:

XML and JSON documents

Web pages

Features:

Some structure defined by tags

More flexible than structured data

Advantages:

- Easier to analyze than unstructured data

## Data Quality

For data to be useful for data mining and analysis, it needs to have high quality. The main aspects of data quality are:

**Accuracy**

Accuracy refers to the correctness of the data and the degree to which it is free from errors.

Some ways to ensure accuracy:

Data validation and checks

Data cleaning to correct or remove incorrect values

Using trusted and reliable sources

**Completeness**

Completeness means the data contains all relevant information and attributes required for analysis.

Some ways to improve completeness:

Identify missing data and fields

Impute missing values using means, medians or other techniques

Collect additional data to fill gaps

**Consistency**

Consistency means the data uses the same definitions, formats, codes and units of measurement.

Some ways to ensure consistency:

Define standards and rules for data collection

Standardize data entry forms and templates

Perform data integration and consolidation

**Timeliness**

Timeliness refers to how recent or up-to-date the data is. Stale data loses its value.

Some ways to maintain timeliness:

Set time limits for data refresh and update

Collect data in real time through sensors, apps, etc.

Archive old data that is no longer relevant

## Data Preprocessing

Data preprocessing refers to preparing raw data for data mining and machine learning algorithms. It involves cleaning, transforming and reducing data.

**Data cleaning**

Data cleaning involves:

Removing noise and outliers

Handling missing data

Correcting inconsistencies

Resolving duplicates

Techniques:

Imputation for missing values

Smoothing for noisy data

Winsorization to cap outliers

**Removal of noise and outliers:**

Noise and outliers refer to data points that deviate from the main data distribution. They need to be identified and removed during data cleaning.

**Noise:**

Noise refers to data points that are erroneous or irrelevant. Common types of noise are:

• Typographical errors - Due to mistakes while entering data • Measurement errors - Due to faulty sensors or equipment • Subjective biases - Due to personal judgements while labelling data

Noise can be removed using:

• Thresholding - Removing points outside a certain range

• Clustering - Isolating noisy points into a separate cluster • Domain knowledge - Removing points that are known to be incorrect

**Outliers:**

Outliers are data points that are very different from the rest of the data. They can be:

Extreme values - Data points that are unusually high or low

Isolated points - Data points that are far from the main distribution

Outliers can be identified using:

Standard deviation - Points more than 2-3 standard deviations away

Interquartile range - Points above 1.5 times the interquartile range

Distance-based - Points far from the mean or centroid of clusters

Once identified, outliers can be removed completely or capped to a threshold.

Some pros and cons of removing outliers:

Pros - Reduce noise, improve model accuracy

Cons - Can remove potentially useful information, distort data distribution

**Data integration**

Data integration combines data from multiple sources by:

Resolving schema differences

Matching common attributes

Eliminating duplicates

Techniques:

Schema mapping

Data fusion

Record linkage

**Merging multiple data sources:**

Merging data from multiple sources is an important task in data integration during data preprocessing. It involves combining relevant information from different datasets.

The main steps in merging multiple data sources are:

Identify relevant sources: Select data sources that contain complementary information about the same entities.

Assess data quality: Check for inconsistencies, missing values, outliers and noise in each data source. Perform necessary data cleaning.

Match records: Identify records that refer to the same real-world entity across different data sources. This is done using record linkage techniques.

Resolve conflicts: Handle inconsistent information about the same entity from different sources. Various conflict resolution techniques can be used.

Merge data: Combine the relevant information from matched records into a single integrated record. Redundant information is removed.

Evaluate quality: Evaluate the quality of the merged dataset using metrics like completeness, consistency and accuracy.

Some techniques used for merging data are:

• Record linkage - Compare attributes like name, ID, etc. to match records

• Data fusion - Combine information from matched records

• Ensemble methods - Combine predictions from multiple models to get a better result

**Data transformation**

Data transformation changes the format and scale of data:

Normalization scales numeric attributes

Discretization converts continuous attributes to discrete intervals

Feature construction generates new attributes from existing ones

Techniques:

Min-max normalization

Z-score normalization

Binning

**Normalization:**

Normalization is a process of transforming attributes in the dataset to make them fall within a small specified range, typically 0 to 1.

It is an important data transformation technique used during data preprocessing.

**Min-max normalization**

The original values are scaled to fall between a minimum and maximum value (typically 0 to 1):

x' = (x - min(x)) / (max(x) - min(x))

where x' is the normalized value and x is the original value.

Min-max normalization preserves all relationships in the original data.

**Z-score normalization**

The original values are scaled to have a mean of 0 and a standard deviation of 1:

x' = (x - mean(x)) / standard deviation(x)

Z-score normalization preserves relative differences between values and makes them comparable.

**Decimal scaling**

The original values are divided by a scaling factor (usually a power of 10) to reduce their range.

**Other techniques:**

Tanh normalization

Sigmoid normalization

The benefits of normalization include:

Improved model accuracy

Faster convergence for optimization algorithms

Ease of comparison between attributes

**Data Reduction**

Data reduction reduces data dimensionality by:

Removing irrelevant or redundant attributes

Grouping highly correlated attributes

Techniques:

Principal component analysis (PCA)

Linear discriminant analysis (LDA)

Feature selection methods

**Dimensionality reduction techniques:**

Dimensionality reduction refers to the process of reducing the number of random variables under consideration by obtaining a set of principal variables that contain most of the information in the original high-dimensional data.

Common dimensionality reduction techniques are:

**Feature selection**

Involves selecting a subset of relevant features that contain the most information.

Techniques:

Filter methods - Select features based on metrics like correlation, information gain, etc.

Wrapper methods - Use machine learning to evaluate feature subsets.

Embedded methods - Perform feature selection as part of model training.

**Feature extraction**

Involves transforming the features into a new set of uncorrelated variables called principal components.

Techniques:

Principal Component Analysis (PCA) - Finds orthogonal principal components that capture the most variance in data.

Linear Discriminant Analysis (LDA) - Finds components that best discriminate different classes.

Multidimensional Scaling (MDS) - Preserves the pairwise distances between data points.

Other techniques:

Autoencoders

T-distributed Stochastic Neighbor Embedding (t-SNE)

## Similarity Measures

Similarity measures are used to quantify how similar or different two data objects are. They play an important role in data mining tasks like clustering, classification and recommendation systems.

There are two main types of similarity measures:

Distance measures

Correlation measures

**Distance measures**

Distance measures quantify similarity as the inverse of distance between two data objects. Larger distances imply lower similarity and vice versa.

Common distance measures are:

**Euclidean distance**

It is the ordinary distance between two data points measured using the Pythagorean Theorem. Given two n-dimensional data points x and y, the Euclidean distance is calculated as:

d(x, y) = √∑(xi - yi)2

for i = 1 to n

Where n is the number of attributes.

**Manhattan distance**

It is the sum of the absolute differences of corresponding attributes. Given two n-dimensional data points x and y, the Manhattan distance is calculated as:

d(x, y) = ∑|xi - yi|

for i = 1 to n

Manhattan distance is faster to compute compared to Euclidean distance.

**Correlation measures**

Correlation measures quantify similarity as the strength of association or correlation between two data objects. A higher positive correlation implies higher similarity.

Common correlation measures are:

**Pearson correlation**

It measures the linear correlation between two variables X and Y. The Pearson correlation coefficient r is calculated as:

r = ∑(xi - x̄)(yi - ȳ)/√∑(xi - x̄)2 ∑(yi - ȳ)2

**Spearman correlation**

It measures the monotonic correlation between two variables. It is calculated by first ranking the values of both variables and then calculating the Pearson correlation coefficient of the ranks.

## Summary Statistics

Summary statistics are used to summarize the main characteristics of a data set in a simple and concise form. They help provide an overview of the data distribution and identify any outliers or anomalies.

Common summary statistics used in data mining are:

**Mean, Median and Mode**

**Mean**

The mean or average is the sum of all values divided by the number of values. It is calculated as:

Mean = (x1 + x2 + ... + xn) / n

The mean is highly influenced by outliers.

**Median**

The median is the middle value of an ordered list of values. It is the value that has an equal number of values above and below it.

The median is more robust to outliers compared to the mean.

**Mode**

The mode is the most frequent value in the data set. There may be multiple modes for a data set.

The mode provides information about the shape of the distribution.

**Variance and Standard Deviation**

**Variance**

The variance is the average of the squared differences from the mean. It measures how spread out the values are. It is calculated as:

Variance = Σ(xi - Mean)2/n

where n is the number of values.

**Standard Deviation**

The standard deviation is the square root of the variance. It represents the typical distance between data points and the mean.

**Percentiles**

Percentiles divide the data set into 100 equal parts. The nth percentile is the value below which n% of the observations fall.

For example, the 75th percentile is the value for which 75% of the observations have a lower value and 25% have a higher value.

Percentiles provide a simple summary of the data distribution.

## Data Distributions

Data distributions refer to the patterns in which data values are distributed in a data set. Understanding the distribution of data is important for data mining tasks like regression, clustering and outlier detection.

Common data distributions are:

**Normal distribution**

Also known as Gaussian distribution.

It is a continuous probability distribution that is symmetric and bell-shaped. It is characterized by two parameters:

Mean (μ): The average of the distribution

Standard deviation (σ): How spread out the values are

The normal distribution is used widely as it approximates many natural phenomena.

It is represented by the probability density function:

f(x) = (1/(σ√(2π))) e^(-((x-μ)/2σ)^2)

Where:

μ is the mean or expected value

σ is the standard deviation

e is the base of the natural logarithm (approximately 2.71828)

**Binomial distribution**

It describes the number of successes in a fixed number of independent yes/no experiments, each of which has a constant probability of success.

It is characterized by two parameters:

n: Number of experiments

p: Probability of success in a single experiment

The binomial distribution is used when modelling the number of successes in a sample.

The probability mass function is given by:

P(k) = (n!/(x!(n-x)!))p^x (1-p)^(n-x)

Where:

n is the number of trials

x is the number of successes

p is the probability of success in a single trial

**Poisson distribution**

It describes the number of events occurring in a fixed interval of time and/or space when these events happen with a known average rate and independently of the time since the last event.

It is characterized by one parameter:

- λ: The expected number of occurrences during the interval.

The Poisson distribution is used when modelling the number of rare events occurring in an interval.

The probability mass function is given by:

P(k) = e^(-λ) λ^x /x!

Where:

λ is the mean number of occurrences during the interval

x is the actual number of occurrences

## Basic Data Mining Tasks

The main data mining tasks are:

**Classification**

Classification is the task of assigning data points to predefined categories or classes.

It involves building a model based on labelled training data where the class labels are known. The model can then be used to predict the class labels of new, unlabeled data.

Examples:

Predicting if an email is spam or not spam

Detecting fraudulent credit card transactions

Algorithms: Decision trees, Naive Bayes, KNN, SVM, Neural networks, etc.

**Clustering**

Clustering is the task of grouping unlabeled data points into meaningful clusters such that data points within a cluster are similar and different clusters are dissimilar.

The number of clusters is not predefined. The algorithm determines the clusters automatically.

Examples:

Grouping customers based on purchasing behaviour

Categorizing web pages

Algorithms: K-means, Hierarchical clustering, DBSCAN, etc.

**Association Rule Mining**

It finds interesting relationships or associations among variables in a large data set.

It analyzes the frequency of occurrence of items together and derives rules that specify which items tend to co-occur.

Examples:

Customers who buy bread and butter also tend to buy jam.

People who view product A also tend to view product B.

Algorithms: Apriori, Eclat

**Anomaly Detection**

It identifies rare items, events or observations that raise suspicions by deviating from normal behaviour.

It involves modelling normal behaviour and then looking for significant deviations from that behaviour.

Examples:

Detecting credit card fraud

Identifying system performance issues

Algorithms: Density-based methods, Clustering-based methods, Statistical methods

## Data Mining vs KDD

**Data Mining**

Data mining refers specifically to the algorithmic processes involved in extracting patterns from data.

It involves applying machine learning, statistical and visualization techniques to uncover hidden patterns.

Data mining focuses on the algorithms and techniques for discovering patterns.

**Knowledge Discovery in Databases (KDD)**

KDD refers to the overall process of discovering useful knowledge from data.

It includes data cleaning, data integration, data selection, data mining and interpretation of the discovered patterns.

KDD involves more data preparation and preprocessing steps before applying data mining algorithms.

Data Mining | Knowledge Discovery in Databases |

Focuses on the algorithmic part of pattern extraction from data | Focuses on the entire process of discovering useful knowledge from data |

Involves less human intervention | Involves human decisions at various stages |

Works with ready-to-mine data | Includes data preparation and preprocessing steps |

Covers pattern extraction | Covers data understanding, cleansing and integration |

Discovery of patterns | Evaluation and interpretation of patterns to create knowledge |

Uses machine learning and statistical techniques | Uses machine learning, statistics, visualization and database techniques |

## Issues in Data Mining

Data mining promises many benefits, but there are also some issues and challenges that need to be addressed for effective data mining.

**Data Quality Issues**

Data quality issues arise due to incomplete, noisy or irrelevant data. This affects mining accuracy and results.

**Incomplete or Missing Data**

Data may be missing values for certain attributes due to errors in data collection or storage.

This can lead to inaccurate or incomplete patterns during mining since the full picture is not available.

Techniques like imputation can be used to fill in missing values but introduce some inaccuracy.

**Noisy or Inconsistent Data**

Real-world data often contains errors, outliers and inconsistencies that reduce its quality.

Noisy data makes it difficult for mining algorithms to identify true patterns and relationships.

Data cleaning and filtering techniques can improve data quality but are not perfect.

**Irrelevant Attributes**

Datasets often contain attributes that are not relevant to the mining task.

Irrelevant attributes introduce "noise" that reduces the effectiveness of mining algorithms.

Feature selection techniques can identify and remove irrelevant attributes but require domain knowledge.

**Ethical Issues**

Data mining raises some ethical concerns regarding privacy, security and proper use of results.

**Privacy and Security**

Data mining can reveal sensitive information about individuals present in the data.

Proper security measures and access controls are needed to protect personal data during and after mining.

**Misuse of Data**

Data mining results could potentially be misused for fraudulent purposes such as target marketing scams.

There are also concerns about possible discrimination based on patterns found in the data.

**Lack of Transparency**

The complex algorithms used in data mining are often seen as a "black box" lacking transparency.

It is difficult for users to verify and audit the mining process and results.

**Scalability Issues**

Large and complex datasets pose scalability challenges for data mining algorithms.

**High Dimensionality**

Data with many attributes (high dimensionality) can reduce mining performance.

Mining algorithms have difficulty identifying meaningful patterns in high-dimensional data.

Dimensionality reduction techniques can help address the "curse of dimensionality".

**Large Databases**

Mining very large databases can be computationally intensive and time-consuming.

Traditional algorithms may need to be scaled using techniques like sampling, clustering and grid computing.

**Evaluation Issues**

Issues related to correctly evaluating the quality of patterns found during mining.

**Overfitting**

Models that are too tuned to the training data may not generalize well to new, unseen data.

This is known as overfitting and results in poor performance on new data.

**Subjectivity of Patterns**

The patterns found depend on the selected mining algorithm and its parameters.

Different algorithms may discover different patterns in the same data.

**Interpreting Results**

Users may find it difficult to understand and correctly interpret the discovered patterns.

This is due to the complexity of patterns and lack of context about the mining process.

## Introduction to Fuzzy Sets and Fuzzy Logic

**Fuzzy sets**

Fuzzy sets are a generalization of ordinary sets where elements can have partial membership.

In ordinary sets, elements either belong (membership = 1) or do not belong (membership = 0).

In fuzzy sets, elements have a membership value between 0 and 1 indicating the degree of belonging.

Fuzzy sets are useful for representing imprecise, vague or uncertain concepts.

**Example**

Consider the set of "tall people". In an ordinary set, a person either belongs (is tall) or does not belong (is not tall).

In a fuzzy set, we can represent degrees of tallness using a membership value between 0 and 1:

A person of height 180 cm may have a tallness membership value of 0.9

A person of height 160 cm may have a tallness membership value of 0.3

A person of height 140 cm may have a tallness membership value of 0.1

**Membership functions**

Membership functions are used to represent the degree of membership of elements in a fuzzy set.

Common types of membership functions are:

Triangular

Trapezoidal

Gaussian (Bell-shaped)

Sigmoidal

The x-axis represents the input variable and the y-axis represents the membership value from 0 to 1.

Membership functions are used to map crisp input values to fuzzy membership values.

**Fuzzy Logic Operations**

Basic fuzzy logic operations are used to combine fuzzy sets.

Common operations are:

Union (OR): Takes the maximum membership value.

Intersection (AND): Takes the minimum membership value.

Complement (NOT): Subtracts the membership value from 1.

These operations allow us to perform fuzzy inference and reasoning.