Dataset
Type of data:
- Record Data: table
- Graphs and Networks: relation
- ordered data: time sequence, DNA, video stream
- spatial, image and multimedia data: image, video, map
characteristic:
- dimensionality
- sparsity
- resolution
- distribution
data object: represent a entity in dataset
Attribute
or dimension, features, variables
- Nominal (e.g., red, blue)
- Binary (e.g., {true, false})
- Ordinal (e.g., {freshman, sophomore, junior, senior})
- Numeric: quantitative
- Interval: Measured on a scale of equal-sized units
- Ratio: Inherent zero-point
Types
discrete attribute: finite or countable set of value
continuous attribute: no limit, infinite
Measuring
Median
approximate median:
sym | note |
---|---|
n | total sample number |
L1 | interval limit |
width | interval width (L2 - L1) |
freq_l | sum before median interval |
$$ median = L_1 + (\frac{\frac n2-(\sum freq )_l}{freq_{median}})\cdot width $$
Mode
value occur most frequently in the data
Empirical formula:
works ONLY in unimodal
$$ mean - mode = 3\cdot(mode - median) $$
Variance&Std
there are two version of variance:
-> n: the size of the sample
$$ s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2 $$
-> N: the size of the population
$$ \sigma ^2 = \frac{1}{N}\sum(x_i - \bar{\mu})^2 $$
standard deviation is the square root if variance, notation via: $s$ or $\sigma$
Plot
histogram analysis: Graph display of tabulated frequencies, shown as bars
Box graph: Data is represented with a box
different between bar chart and histogram:
- Histograms are used to show distributions of variables while bar charts are used to compare variables
- Histograms plot binned quantitative data while bar charts plot categorical data
- Bars can be reordered in bar charts but not in histograms
- Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width