CS412: Note for Midterm2

请注意，本文编写于 922 天前，最后修改于 680 天前，其中某些信息可能已经过时。

Itemset

Itemset: A set of one or more items

k-itemset: An itemset containing k items

support:

absolute support: sup{x} occurrences of an itemset X
relative support: s{x} probability that a transaction contains X

frequent: An itemset (or a pattern) X is frequent if the support of X is no less than a minsup threshold σ

{Beer, Diaper}: 3/5 (60%) Beer: 3/5 (60%), if σ = 0.3, they are frequent

confident: The conditional probability that a transaction containing X also contains Y

Compress Pattern

find close pattern instead of all pattern

close pattern: pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y כ X, with the same support as X

T1: {a1, ..., a50}; T2: {a1, ..., a100}, close pattern: P1: “{a1, ..., a50}: 2”; P2: “{a1, ..., a100}: 1”

Closed pattern is a lossless compression of frequent patterns

max pattern: pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X

T1: {a1, ..., a50}; T2: {a1, ..., a100}, max pattern: P: “{a1, ..., a100}: 1”

Max-pattern is a lossy compression, that is it will cause data loss
in many application, mining closed-patterns is more desirable than mining max-patterns

Mining Methods

Apriori Algorithm

BFS algorithm to find itemset frequent pattern

downward closure: Any subset of a frequent itemset must be frequent, If ANY subset of an itemset S is infrequent, then there is no chance for S to be frequent (use for pruning).

candidate generation:

joining: the set A, B can only be joined if their first k-1 items are the same
pruning: use downward closure

partitioning:

partitioning is a way to reduce passes of transaction database scans

Any itemset that is potentially frequent in TDB must be frequent in at least one of the partitions of TDB
$min_{support}$ in partitioned zone = $min_{support}$ $\times$ events number in this partition zone
only need to scan the DB twice to find frequent item

FP-Growth

DFS algorithm to find itemset frequent pattern, can use to deal with huge amount of data

process:

scan DB, find single item frequent pattern
sort frequent item in frequent descending
scan DB again, find the ordered frequent itemlist for each transaction
FP growth tree

FP-growth is faster than Apriori for most of situations
both FP and Apriori are mining from horizontal data format (TID: data)

Pattern Evaluate

lift: Measure of dependent/correlated events

$$ lift=\frac{s(B\cup C)}{s(B)\times s(C)} $$

$lift(B, C) = 1$: B and C are independent
$> 1$: positively correlated
$< 1$: negatively correlated

$\chi^2$: another way to measure test correlated events

$$ \chi^2 =\sum\frac{(Observed -Expected)^2}{Expected} $$

sample data
chi square calculation

for both $\chi ^ 2$ and lift measure, they might be influenced by null transactions

Null invariance: The number of null transactions does not matter, does not change the measure value.

Imbalance Ratio (IR)

measure the imbalance of two itemsets A and B in rule implications

$$ IR(A,B)=\frac{|s(A)-s(B)|}{s(A)+s(B)-s(A\cup B)} $$

if $IR=0$, means two events are balanced
Lift and χ2 are good measures if null transactions are not predominant
Otherwise, Kulczynski + Imbalance Ratio should be used to judge the interestingness of a pattern

Mining Multilevel Patterns

min-support: there are two min-support thresholds

uniform: same min-support across multiple levels
level-reduced: min-support reduced as level become lower

redundancy filtering: redundant due to “ancestor” relationships

A rule is redundant if its support is close to the “expected” value, according to its “ancestor” rule, and it has a similar confidence as its “ancestor”

milk $\rightarrow$ wheat bread [support = 8%, confidence = 70%] (1)
2% milk $\rightarrow$ wheat bread [support = 2%, confidence = 72%] (2)
Suppose the 2% milk sold is about ¼ of milk sold in gallons, then, the data above is redundant

customized min-supports: to analyze the item with less frequency (low support), we can use group-based min-support

{diamond, watch}: 0.05%; {bread, milk}: 5%

quantitative associations:

methods used when mining associations with numerical attributes

Static discretization: age: {0-10, 10-20, ..., 90-100} → {young, mid-aged, old}
Dynamic discretization based on data distribution
Clustering: Distance-based association
Deviation analysis: Gender = female $\rightarrow$ Wage: mean=$7/hr

mining extraordinary phenomena:

check if LHS (a subset of the population) behave extraordinarily compared to RHS (an extraordinary behavior of this subset)

The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with high confidence

negative pattern: Unlikely to happen together

If itemsets A and B are both frequent but rarely occur together, then:

$$ s(A\cup B) << s(A)\times s(B) $$

the equation above will be influenced by the null transaction, we can use Kulczynski-measure:

If itemsets A and B are frequent but: ($є$: negative pattern threshold)

$$ (\frac{s(A \cup B)}{s(A)} + \frac{s(A \cup B)}{s(B)})\cdot \frac{1}{2} < є $$

then A and B are negatively correlated.

Constraint-Based Mining

Mine together with user-provided constraints

type of constrains:

Knowledge type constraint—Specifying what kinds of knowledge to mine
Data constraint—using SQL-like queries
Dimension/level constraint—similar to projection in relational database
Interestingness constraint—various kinds of thresholds
Rule (or pattern) constraint

characteristic of constrain:

Anti-Monotonicity: set violated, then its superset must violate as well e.g. $support(S) \rightarrow \sigma$
Monotone: if set satisfied, then its superset must satisfied as well. e.g. $sum(S.Price) \rightarrow v$
Convertible: Convert tough constraints into (anti-)monotone by proper ordering of items in transactions. e.g. $avg(S.profit)>20$
Data anti-monotone: In the mining process, if a data entry t cannot contribute to a pattern p satisfying c, t cannot contribute to p’s superset
Succinctness: If the constraint c can be enforced by directly manipulating the data
Data succinct: Data space can be pruned at the initial pattern mining process

Download Note

附件

附件名称：midterm2 note.zip

文件大小：754.1 KB

下载次数: 582

最后修改: 2023-03-29 22:41

点击下载

CC BY-ND

This license enables reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.

cs412

CS412: Note for Midterm2

Itemset

Compress Pattern