Dataset

Type of data:

  • Record Data: table
  • Graphs and Networks: relation
  • ordered data: time sequence, DNA, video stream
  • spatial, image and multimedia data: image, video, map

characteristic:

  • dimensionality
  • sparsity
  • resolution
  • distribution

data object: represent a entity in dataset

Attribute

or dimension, features, variables
  • Nominal (e.g., red, blue)
  • Binary (e.g., {true, false})
  • Ordinal (e.g., {freshman, sophomore, junior, senior})
  • Numeric: quantitative
  • Interval: Measured on a scale of equal-sized units
  • Ratio: Inherent zero-point

Types

discrete attribute: finite or countable set of value

continuous attribute: no limit, infinite

Measuring

Median

approximate median:

symnote
ntotal sample number
L1interval limit
widthinterval width (L2 - L1)
freq_lsum before median interval

$$ median = L_1 + (\frac{\frac n2-(\sum freq )_l}{freq_{median}})\cdot width $$

Mode

value occur most frequently in the data

Empirical formula:

works ONLY in unimodal

$$ mean - mode = 3\cdot(mode - median) $$

Variance&Std

there are two version of variance:

-> n: the size of the sample

$$ s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2 $$

-> N: the size of the population

$$ \sigma ^2 = \frac{1}{N}\sum(x_i - \bar{\mu})^2 $$

standard deviation is the square root if variance, notation via: $s$ or $\sigma$

Plot

histogram analysis: Graph display of tabulated frequencies, shown as bars

Box graph: Data is represented with a box

box graph
box graph

different between bar chart and histogram:

  • Histograms are used to show distributions of variables while bar charts are used to compare variables
  • Histograms plot binned quantitative data while bar charts plot categorical data
  • Bars can be reordered in bar charts but not in histograms
  • Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width

Resources

附件
附件名称:03DW_OLAP.pdf
文件大小:5152.5 KB
下载次数: 192
最后修改: 2023-01-24 14:25