Itemset

Itemset: A set of one or more items

k-itemset: An itemset containing k items

support:

  1. absolute support: sup{x} occurrences of an itemset X
  2. relative support: s{x} probability that a transaction contains X

frequent: An itemset (or a pattern) X is frequent if the support of X is no less than a minsup threshold σ

sample itemset
sample itemset

{Beer, Diaper}: 3/5 (60%) Beer: 3/5 (60%), if σ = 0.3, they are frequent

confident: The conditional probability that a transaction containing X also contains Y

Compress Pattern

find close pattern instead of all pattern

close pattern: pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X

T1: {a1, ..., a50}; T2: {a1, ..., a100}, close pattern: P1: “{a1, ..., a50}: 2”; P2: “{a1, ..., a100}: 1”
  • Closed pattern is a lossless compression of frequent patterns

max pattern: pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X

T1: {a1, ..., a50}; T2: {a1, ..., a100}, max pattern: P: “{a1, ..., a100}: 1”
  • Max-pattern is a lossy compression, that is it will cause data loss
  • in many application, mining closed-patterns is more desirable than mining max-patterns

Mining Methods

Apriori Algorithm

  • BFS algorithm to find itemset frequent pattern

downward closure: Any subset of a frequent itemset must be frequent, If ANY subset of an itemset S is infrequent, then there is no chance for S to be frequent (use for pruning).

Apriori process
Apriori process

candidate generation:

  1. joining: the set A, B can only be joined if their first k-1 items are the same
  2. pruning: use downward closure

partitioning:

partitioning is a way to reduce passes of transaction database scans

  • Any itemset that is potentially frequent in TDB must be frequent in at least one of the partitions of TDB
  • $min_{support}$ in partitioned zone = $min_{support}$ $\times$ events number in this partition zone
  • only need to scan the DB twice to find frequent item

partition
partition

FP-Growth

  • DFS algorithm to find itemset frequent pattern, can use to deal with huge amount of data

process:

  1. scan DB, find single item frequent pattern
  2. sort frequent item in frequent descending
  3. scan DB again, find the ordered frequent itemlist for each transaction

    FP growth tree
    FP growth tree

  • FP-growth is faster than Apriori for most of situations
  • both FP and Apriori are mining from horizontal data format (TID: data)

Pattern Evaluate

lift: Measure of dependent/correlated events

$$ lift=\frac{s(B\cup C)}{s(B)\times s(C)} $$

  • $lift(B, C) = 1$: B and C are independent
  • $> 1$: positively correlated
  • $< 1$: negatively correlated

$\chi^2$: another way to measure test correlated events

$$ \chi^2 =\sum\frac{(Observed -Expected)^2}{Expected} $$

sample data
sample data

chi square calculation
chi square calculation

for both $\chi ^ 2$ and lift measure, they might be influenced by null transactions

Null invariance: The number of null transactions does not matter, does not change the measure value.

correlation formulas
correlation formulas

Imbalance Ratio (IR)

measure the imbalance of two itemsets A and B in rule implications

$$ IR(A,B)=\frac{|s(A)-s(B)|}{s(A)+s(B)-s(A\cup B)} $$

  • if $IR=0$, means two events are balanced
  • Lift and χ2 are good measures if null transactions are not predominant
  • Otherwise, Kulczynski + Imbalance Ratio should be used to judge the interestingness of a pattern

Mining Multilevel Patterns

min-support: there are two min-support thresholds

  1. uniform: same min-support across multiple levels
  2. level-reduced: min-support reduced as level become lower

redundancy filtering: redundant due to “ancestor” relationships

A rule is redundant if its support is close to the “expected” value, according to its “ancestor” rule, and it has a similar confidence as its “ancestor”

  • milk $\rightarrow$ wheat bread [support = 8%, confidence = 70%] (1)
  • 2% milk $\rightarrow$ wheat bread [support = 2%, confidence = 72%] (2)
  • Suppose the 2% milk sold is about ¼ of milk sold in gallons, then, the data above is redundant

customized min-supports: to analyze the item with less frequency (low support), we can use group-based min-support

{diamond, watch}: 0.05%; {bread, milk}: 5%

quantitative associations:

methods used when mining associations with numerical attributes

  • Static discretization: age: {0-10, 10-20, ..., 90-100} → {young, mid-aged, old}
  • Dynamic discretization based on data distribution
  • Clustering: Distance-based association
  • Deviation analysis: Gender = female $\rightarrow$ Wage: mean=$7/hr

mining extraordinary phenomena:

check if LHS (a subset of the population) behave extraordinarily compared to RHS (an extraordinary behavior of this subset)

  • The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence

negative pattern: Unlikely to happen together

If itemsets A and B are both frequent but rarely occur together, then:

$$ s(A\cup B) << s(A)\times s(B) $$

the equation above will be influenced by the null transaction, we can use Kulczynski-measure:

If itemsets A and B are frequent but: ($є$: negative pattern threshold)

$$ (\frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)})\cdot \frac{1}{2} < є $$

then A and B are negatively correlated.

Constraint-Based Mining

Mine together with user-provided constraints

type of constrains:

  • Knowledge type constraint—Specifying what kinds of knowledge to mine
  • Data constraint—using SQL-like queries
  • Dimension/level constraint—similar to projection in relational database
  • Interestingness constraint—various kinds of thresholds
  • Rule (or pattern) constraint

characteristic of constrain:

  1. Anti-Monotonicity: set violated, then its superset must violate as well e.g. $support(S) \rightarrow \sigma$
  2. Monotone: if set satisfied, then its superset must satisfied as well. e.g. $sum(S.Price) \rightarrow v$
  3. Convertible: Convert tough constraints into (anti-)monotone by proper ordering of items in transactions. e.g. $avg(S.profit)>20$
  4. Data anti-monotone: In the mining process, if a data entry t cannot contribute to a pattern p satisfying c, t cannot contribute to p’s superset
  5. Succinctness: If the constraint c can be enforced by directly manipulating the data
  6. Data succinct: Data space can be pruned at the initial pattern mining process

pruning
pruning

constrain type p1
constrain type p1

constrain type p2
constrain type p2

Download Note

附件
附件名称:midterm2 note.zip
文件大小:754.1 KB
下载次数: 426
最后修改: 2023-03-29 22:41