What is the most frequently bought item with Infant Formula and Infant Diapers? This is one of the most important questions for a retailer to answer to be able to suggest other products based on the items already in the Shopping Cart. Association Rule Mining is a key to Market Basket Analysis.
An association rule is a implication expression of the form X → Y
, where X
and Y
are itemset. X is the Antecedent and Y is the Consequent
Example of Association Rules:
{Infant Diaper, Infant Milk} → {Coke},
{Milk, Bread} → {Eggs, Coke},
{Coke, Bread} → {Infant Milk}
Antecedent → Consequent
Now let’s assume we take {Infant Diaper, Infant Milk}
as the Antecedent, and we want to figure out the Consequent i.e. what ITEM is most often bought with Infant Diaper and Infant Milk. Or simply we need to find out, given N items in Cart, what is the N+1 item that the Customer will purchase in the same Order
In it’s simplest form the relationship between a Order and an Item can be represented as following
There are two relationship in play here. We need to include them both in the Property Graph to perform Association Rule Mining.
Association Rules are, by definition, extracted from the data i.e. they are core properties of the Data. We are not looking for correlation in the Itemsets in a Rule. This makes Property Graphs a good platform for Association Rule Mining
Let’s start with some sample Orders:
ORDER_NUMBER | Items |
---|---|
3 | infant formula, infant diaper, coffee, coke |
7 | infant formula, infant diaper, mountain dew, coke, coffee |
1 | bread, infant formula |
4 | bread, infant formula, infant diaper, coffee |
6 | infant formula, infant diaper, mountain dew |
5 | infant formula, infant diaper, coke |
2 | bread, infant diaper, coffee, eggs |
Now let’s create this dataset in the database
create table orders (order_number number);
create table items (item_name varchar(80));
insert into orders values (1), (2), (3), (4), (5), (6), (7);
insert into items values ('bread'), ('infant formula'), ('infant diaper'), ('coffee'), ('eggs'), ('coke'), ('mountain dew');
create table order_items(order_number number, item_name varchar(80)) ;
insert into order_items values
(1, 'bread')
, (1, 'infant formula')
, (2, 'bread')
, (2, 'infant diaper')
, (2, 'coffee')
, (2, 'eggs')
, (3, 'infant formula')
, (3, 'infant diaper')
, (3, 'coffee' )
, (3, 'coke')
, (4, 'bread')
, (4, 'infant formula' )
, (4, 'infant diaper')
, (4, 'coffee')
, (5, 'infant formula')
, (5, 'infant diaper')
, (5, 'coke')
, (6, 'infant formula')
, (6, 'infant diaper')
, (6, 'mountain dew')
, (7, 'infant formula')
, (7, 'infant diaper')
, (7, 'mountain dew')
, (7, 'coke')
, (7, 'coffee')
;
Now let’s create the Property Graph to represent the two relationship above. Note that we have two edges to represent the two relationships is_included_in
and includes
.
CREATE or replace PROPERTY GRAPH order_item_graph
VERTEX TABLES (
orders KEY (order_number) PROPERTIES ARE ALL COLUMNS,
items KEY (item_name) PROPERTIES ARE ALL COLUMNS
)
EDGE TABLES (
order_items as is_included_in
KEY (order_number, item_name)
SOURCE KEY (item_name) REFERENCES items(item_name)
DESTINATION KEY (order_number ) REFERENCES orders(order_number)
PROPERTIES ARE ALL COLUMNS,
order_items as includes
KEY (order_number, item_name)
SOURCE KEY (order_number ) REFERENCES orders(order_number)
DESTINATION KEY (item_name) REFERENCES items(item_name)
PROPERTIES ARE ALL COLUMNS
);
Next let’s query this Property Graph using sql-pgq
SELECT item1_name, item2_name, item3_name, count(*) as frequency
FROM GRAPH_TABLE( order_item_graph
MATCH (item1 is items) -> (o1 is orders) ->
(item2 is items) -> (o2 is orders) ->
(item3 is items)
where item1.item_name != item2.item_name and item2.item_name != item3.item_name and item1.item_name != item3.item_name
and o1.order_number = o2.order_number
COLUMNS ( item1.item_name as item1_name, item2.item_name as item2_name, item3.item_name as item3_name)
)
group by item1_name, item2_name, item3_name
order by frequency desc
;
This will output