Itemset
Itemset: A set of one or more items
k-itemset: An itemset containing k items
support:
- absolute support: sup{x} occurrences of an itemset X
- relative support: s{x} probability that a transaction contains X
frequent: An itemset (or a pattern) X is frequent if the support of X is no less than a minsup
threshold σ
{Beer, Diaper}: 3/5 (60%) Beer: 3/5 (60%), if σ = 0.3, they are frequent
confident: The conditional probability that a transaction containing X also contains Y
Compress Pattern
find close pattern instead of all pattern
close pattern: pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X
T1: {a1, ..., a50}; T2: {a1, ..., a100}, close pattern: P1: “{a1, ..., a50}: 2”; P2: “{a1, ..., a100}: 1”
- Closed pattern is a lossless compression of frequent patterns
max pattern: pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X
T1: {a1, ..., a50}; T2: {a1, ..., a100}, max pattern: P: “{a1, ..., a100}: 1”
- Max-pattern is a lossy compression, that is it will cause data loss
- in many application, mining closed-patterns is more desirable than mining max-patterns
Mining Methods
Apriori Algorithm
- BFS algorithm to find itemset frequent pattern
downward closure: Any subset of a frequent itemset must be frequent, If ANY subset of an itemset S is infrequent, then there is no chance for S to be frequent (use for pruning).
candidate generation:
- joining: the set A, B can only be joined if their first k-1 items are the same
- pruning: use downward closure
partitioning:
partitioning is a way to reduce passes of transaction database scans
- Any itemset that is potentially frequent in TDB must be frequent in at least one of the partitions of TDB
- $min_{support}$ in partitioned zone = $min_{support}$ $\times$ events number in this partition zone
- only need to scan the DB twice to find frequent item
FP-Growth
- DFS algorithm to find itemset frequent pattern, can use to deal with huge amount of data
process:
- scan DB, find single item frequent pattern
- sort frequent item in frequent descending
scan DB again, find the ordered frequent itemlist for each transaction
- FP-growth is faster than Apriori for most of situations
- both FP and Apriori are mining from horizontal data format (TID: data)
Pattern Evaluate
lift: Measure of dependent/correlated events
$$ lift=\frac{s(B\cup C)}{s(B)\times s(C)} $$
- $lift(B, C) = 1$: B and C are independent
- $> 1$: positively correlated
- $< 1$: negatively correlated
$\chi^2$: another way to measure test correlated events
$$ \chi^2 =\sum\frac{(Observed -Expected)^2}{Expected} $$
for both $\chi ^ 2$ and lift measure, they might be influenced by null transactions
Null invariance: The number of null transactions does not matter, does not change the measure value.
Imbalance Ratio (IR)
measure the imbalance of two itemsets A and B in rule implications
$$ IR(A,B)=\frac{|s(A)-s(B)|}{s(A)+s(B)-s(A\cup B)} $$
- if $IR=0$, means two events are balanced
- Lift and χ2 are good measures if null transactions are not predominant
- Otherwise, Kulczynski + Imbalance Ratio should be used to judge the interestingness of a pattern
Mining Multilevel Patterns
min-support: there are two min-support thresholds
- uniform: same min-support across multiple levels
- level-reduced: min-support reduced as level become lower
redundancy filtering: redundant due to “ancestor” relationships
A rule is redundant if its support is close to the “expected” value, according to its “ancestor” rule, and it has a similar confidence as its “ancestor”
- milk $\rightarrow$ wheat bread [support = 8%, confidence = 70%] (1)
- 2% milk $\rightarrow$ wheat bread [support = 2%, confidence = 72%] (2)
- Suppose the 2% milk sold is about ¼ of milk sold in gallons, then, the data above is redundant
customized min-supports: to analyze the item with less frequency (low support), we can use group-based min-support
{diamond, watch}: 0.05%; {bread, milk}: 5%
quantitative associations:
methods used when mining associations with numerical attributes
- Static discretization: age: {0-10, 10-20, ..., 90-100} → {young, mid-aged, old}
- Dynamic discretization based on data distribution
- Clustering: Distance-based association
- Deviation analysis: Gender = female $\rightarrow$ Wage: mean=$7/hr
mining extraordinary phenomena:
check if LHS (a subset of the population) behave extraordinarily compared to RHS (an extraordinary behavior of this subset)
- The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence
negative pattern: Unlikely to happen together
If itemsets A and B are both frequent but rarely occur together, then:
$$ s(A\cup B) << s(A)\times s(B) $$
the equation above will be influenced by the null transaction, we can use Kulczynski-measure:
If itemsets A and B are frequent but: ($є$: negative pattern threshold)
$$ (\frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)})\cdot \frac{1}{2} < є $$
then A and B are negatively correlated.
Constraint-Based Mining
Mine together with user-provided constraints
type of constrains:
- Knowledge type constraint—Specifying what kinds of knowledge to mine
- Data constraint—using SQL-like queries
- Dimension/level constraint—similar to projection in relational database
- Interestingness constraint—various kinds of thresholds
- Rule (or pattern) constraint
characteristic of constrain:
- Anti-Monotonicity: set violated, then its superset must violate as well e.g. $support(S) \rightarrow \sigma$
- Monotone: if set satisfied, then its superset must satisfied as well. e.g. $sum(S.Price) \rightarrow v$
- Convertible: Convert tough constraints into (anti-)monotone by proper ordering of items in transactions. e.g. $avg(S.profit)>20$
- Data anti-monotone: In the mining process, if a data entry t cannot contribute to a pattern p satisfying c, t cannot contribute to p’s superset
- Succinctness: If the constraint c can be enforced by directly manipulating the data
- Data succinct: Data space can be pruned at the initial pattern mining process
412之光🥲
这课要给我变成光了💔