CS412: Note for Midterm1

请注意，本文编写于 856 天前，最后修改于 612 天前，其中某些信息可能已经过时。

Concept

data mining function:

generalization
pattern discovery
classification
Cluster Analysis
Outliers Analysis
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
Structure and Network Analysis Graph mining

Data

characteristic of structured data:

Dimensionality
1. Curse of dimensionality
Sparsity
1. Only presence counts
Resolution
1. Patterns depend on the scale
Distribution
1. Centrality and dispersion

Data sets are made up of data objects
Data objects are described by attributes

attribute:

dimensions, features, variables

type:

Nominal: auburn, black, blond, brown, grey, red, white
Binary: 0/1
ordinal: small, medium, large
Interval: Measured on a scale of equal sized units temperature
Ratio: inherent zero-point temperaturer in kelven, count

type 2:

Discrete Attribute: a finite or countably infinite set of values zip code, profession
Continuous Attribute: Has real numbers as attribute values height, weight

statistical measurement:

mean: $\bar x=\frac1n\sum^n_{i=1}x_i$ or $\mu=\frac1N\sum x$

weighted mean: $\bar x=\frac{\sum^n_{i=1}w_ix_i}{\sum^n_{i=1}w_i}$

median (approx): $L_1 + (\frac{n/2-\sum{freq}_l}{freq_{median}})width$

$\sum{freq}_l$: sum before the median interval
$width$: interval width: $L_2 -L_1$
$L_1$: low interval limit

mode: Value that occurs most frequently in the data

data matrix:

A data matrix of n data points with l dimensions generate a matrix with shape $n\cdot l$
Dissimilarity (distance) matrix: triangular matrix

standardizing:

z-score: $z=\frac{x-\mu}{\sigma}$, or using mean absolute deviation

distance:

Minkowski distance (L-p norm):

properties:

Model

type:

unimodal:

Empirical formula: $mean-mode = 3\times (mean-median)$

multi model:

include bimodal and trimodal, etc. depend on peak number

distribution:

symmetric
skewed: include positive skewed and negative skewed, their mean/median have opposite direction

normal distribution curve

measurement:

Variance ($s^2 \text{ or } \sigma^2$) and standard deviation ($s \text{ or } \sigma$) use to measure data distribution

$$ s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar x)^2\\ \sigma^2=\frac1N\sum^N_{i=1}(x_i-\mu)^2 $$

n: sample size, N: population size

Graph

Boxplot: graphic display of five number summary
Histogram: x axis are values, y axis are frequencies
Quantile plot: each value x i is paired with f indicating that approximately 100 f% of data are $\leq$ x i
Quantile-quantile (q-q) plot : graphs the quantiles of one univariate distribution against the corresponding quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

box plot:

Quartiles: Q1 (25 th percentile), Q3 (75 th percentile)

IQR: Q3 - Q1

Five number summary: min, Q1 , Q3 , max

Histogram:

Graph display of tabulated frequencies, shown as bars

Differences between histograms and bar charts: Histograms are used to show distributions of variables while bar charts are used to compare variables
Histograms Often Tell More than Boxplots: different histogram may have the same boxplot representation

Correlation

cosine Similarity:

chi-square test:

The larger the Χ2 value, the more likely the variables are related
Null hypothesis: The two distributions are independent
Correlation does not imply causality

variance:

variance for single variable: $E((X-\mu)^2)$

covariance for two variable: $E((X_1-\mu_1)(X_2-\mu_2))=E(X_1X_2)-\mu_1\mu_2=E[X_1X_2]-E[X_1]E[X_2]$

the sign of covariance indicate the relation direction
if X1 and X2 are independent, $\sigma_{12}=0$, but reverse is not true

correlation:

if $\rho_{12}>0$, positive correlation, $\rho_{12}=0$, uncorrelated, $\rho_{12}<0$, negative correlated

$$ \rho_{12}=\frac{\sigma_{12}}{\sqrt{\sigma_1^2\sigma_2^2}} $$

Kullback Leibler (KL) divergence:

Measure the difference between two probability distributions over the same variable x

$$ D_{KL}(p(x)||q(x))=\sum_{x\in X}p(x)ln\frac{p(x)}{q(x)}\\ D_{KL}(p(x)||q(x))=\int_{-\infty}^{\infty}p(x)ln\frac{p(x)}{q(x)} $$

when $p \not=0$ but $q=0$, the $D_{KL}$ is given as $\infty$, because one predict possible and one predict impossible

Data cleaning

missing data:

Incomplete: Salary = ""
Noisy: Salary = 10” (an error)
Inconsistent: Age=“42”, Birthday = “03/07/2022
Intentional: Jan. 1 as everyone’s birthday?

data Integration:

Combining data from multiple sources into a coherent store

Shell Fragment Cubes

the way to handle multi-dimensional data cube

space requirement:

Given a database of T tuples, D dimensions, and F shell fragment size, the fragment cubes’ space requirement is:

query:

Download Note

附件

附件名称：CS341 Mid1.pdf

文件大小：235.1 KB

下载次数: 509

最后修改: 2023-03-29 22:52

点击下载

CC BY-ND

This license enables reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.

cs412

CS412: Note for Midterm1

Concept

Data

Model

Graph

Correlation

Data cleaning

Shell Fragment Cubes

Download Note

CC BY-ND

添加新评论

评论列表

CS412: Note for Midterm1

Concept

Data

Model

Graph

Correlation

Data cleaning

Shell Fragment Cubes

Download Note

CC BY-ND

CS361: Note for Midterm1

CS412: Note for Midterm2

添加新评论

评论列表