Concept
data mining function:
- generalization
- pattern discovery
- classification
- Cluster Analysis
- Outliers Analysis
- Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
- Structure and Network Analysis Graph mining
Data
characteristic of structured data:
Dimensionality
- Curse of dimensionality
Sparsity
- Only presence counts
Resolution
- Patterns depend on the scale
Distribution
- Centrality and dispersion
- Data sets are made up of data objects
- Data objects are described by attributes
attribute:
- dimensions, features, variables
type:
- Nominal:
auburn, black, blond, brown, grey, red, white
- Binary:
0/1
- ordinal:
small, medium, large
- Interval: Measured on a scale of equal sized units
temperature
- Ratio: inherent zero-point
temperaturer in kelven, count
type 2:
- Discrete Attribute: a finite or countably infinite set of values
zip code, profession
- Continuous Attribute: Has real numbers as attribute values
height, weight
statistical measurement:
mean: $\bar x=\frac1n\sum^n_{i=1}x_i$ or $\mu=\frac1N\sum x$
weighted mean: $\bar x=\frac{\sum^n_{i=1}w_ix_i}{\sum^n_{i=1}w_i}$
median (approx): $L_1 + (\frac{n/2-\sum{freq}_l}{freq_{median}})width$
$\sum{freq}_l$: sum before the median interval
$width$: interval width: $L_2 -L_1$
$L_1$: low interval limit
mode: Value that occurs most frequently in the data
data matrix:
- A data matrix of n data points with l dimensions generate a matrix with shape $n\cdot l$
- Dissimilarity (distance) matrix: triangular matrix
standardizing:
- z-score: $z=\frac{x-\mu}{\sigma}$, or using mean absolute deviation
distance:
- Minkowski distance (L-p norm):
properties:
Model
type:
unimodal:
- Empirical formula: $mean-mode = 3\times (mean-median)$
multi model:
- include bimodal and trimodal, etc. depend on peak number
distribution:
- symmetric
- skewed: include positive skewed and negative skewed, their mean/median have opposite direction
normal distribution curve
measurement:
Variance ($s^2 \text{ or } \sigma^2$) and standard deviation ($s \text{ or } \sigma$) use to measure data distribution
$$ s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar x)^2\\ \sigma^2=\frac1N\sum^N_{i=1}(x_i-\mu)^2 $$
n: sample size, N: population size
Graph
- Boxplot: graphic display of five number summary
- Histogram: x axis are values, y axis are frequencies
- Quantile plot: each value x i is paired with f indicating that approximately 100 f% of data are $\leq$ x i
- Quantile-quantile (q-q) plot : graphs the quantiles of one univariate distribution against the corresponding quantiles of another
- Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
box plot:
Quartiles: Q1
(25 th percentile), Q3
(75 th percentile)
IQR: Q3
- Q1
Five number summary: min
, Q1
, Q3
, max
Histogram:
Graph display of tabulated frequencies, shown as bars
- Differences between histograms and bar charts: Histograms are used to show distributions of variables while bar charts are used to compare variables
- Histograms Often Tell More than Boxplots: different histogram may have the same boxplot representation
Correlation
cosine Similarity:
chi-square test:
- The larger the Χ2 value, the more likely the variables are related
- Null hypothesis: The two distributions are independent
- Correlation does not imply causality
variance:
variance for single variable: $E((X-\mu)^2)$
covariance for two variable: $E((X_1-\mu_1)(X_2-\mu_2))=E(X_1X_2)-\mu_1\mu_2=E[X_1X_2]-E[X_1]E[X_2]$
- the sign of covariance indicate the relation direction
- if
X1
andX2
are independent, $\sigma_{12}=0$, but reverse is not true
correlation:
if $\rho_{12}>0$, positive correlation, $\rho_{12}=0$, uncorrelated, $\rho_{12}<0$, negative correlated
$$ \rho_{12}=\frac{\sigma_{12}}{\sqrt{\sigma_1^2\sigma_2^2}} $$
Kullback Leibler (KL) divergence:
Measure the difference between two probability distributions over the same variable x
$$ D_{KL}(p(x)||q(x))=\sum_{x\in X}p(x)ln\frac{p(x)}{q(x)}\\ D_{KL}(p(x)||q(x))=\int_{-\infty}^{\infty}p(x)ln\frac{p(x)}{q(x)} $$
- when $p \not=0$ but $q=0$, the $D_{KL}$ is given as $\infty$, because one predict possible and one predict impossible
Data cleaning
missing data:
- Incomplete:
Salary = ""
- Noisy:
Salary = 10” (an error)
- Inconsistent:
Age=“42”, Birthday = “03/07/2022
- Intentional:
Jan. 1 as everyone’s birthday?
data Integration:
Combining data from multiple sources into a coherent store
Shell Fragment Cubes
the way to handle multi-dimensional data cube
space requirement:
Given a database of T tuples, D dimensions, and F shell fragment size, the fragment cubes’ space requirement is:
query: