"Visualization of Questionnaire Results by Association Rules Using Correspondence Analysis", which was published in [Theory and Application of Data Analysis, Vol. Can it be used for other than questionnaires? I thought that I tried it with Python.
To briefly explain the treatise, an image of connecting the combinations of association analysis results with a line on the plot of the correspondence analysis results. In the paper, it is called "attribute-specific feature extraction association plot".
As an example in the paper, we are visualizing the results of questionnaires on health concerns by media layer (attribute).
As a result of visualization, for example, "The F3 layer tends to have problems with XX, while the F1 and F2 layers tend to have problems with XX, but the problems of ~~ are common to the F2 and F3 layers." You can see that.
For example, using ID-POS data, if the part corresponding to the attribute is the area or store and the part corresponding to the questionnaire is the number of purchases of the purchased product category or brand, the purchasing tendency of the area or store can be visualized. In fact, Computer Statistics Vol. 29, No. 22, "Analysis and Visualization of Store Classification and Purchasing Trends by Sales Trends," analyzes purchasing trends by store. This trial was also carried out on POS data.
Use kaggle superstore_data. Retail data for a four-year global supermarket, including Customer ID, Product ID, City, etc., but no store ID. Since there is no store ID, this time we will visualize the purchasing tendency of the country and product sub-category.
First import the required packages
Read the data. This time, 80% of the data is discarded to make the processing lighter.
First, data processing such as encoding is performed in order to perform association analysis.
#Country df for each Customer ID and Sub for each Customer ID-Concat and encode Category df
#Encodings are numbered in ascending order
List the numbers encoded for each Customer ID
#List the numbers encoded for each Customer ID
Conducted association analysis.
#Support 5%Narrow down to the above items
I want Country to come to the condition part, so I extracted the data so that Country comes to the condition part and Sub-Category comes to the conclusion part.
#County in the condition part, Sub in the conclusion part-Extract rules so that Category comes
Later, I want to visualize the combination of Sub-Categories, so I will extract the data so that Sub-Category comes to both the condition part and the conclusion part.
#Sub for both the condition part and the conclusion part-Extract rules so that Category comes
Decode the table of association analysis results.
#Make a table of association analysis results
Let's express the result of association analysis in a network diagram.
#Count the number of people by country
#Representing the results of association analysis in a network diagram
This is also a hard-to-see figure ...
The size of the red square is proportional to the number of customers in the country, and the thickness of the gray line is proportional to the lift value.
It can be said that various categories of products are being bought in developed countries, and some categories of products are being bought in developing countries.
However, the positional relationship between the nodes does not make sense in this figure yet. By combining correspondence analysis, the purpose of this time is to create a diagram in which the positional relationship of nodes is also meaningful.
Created a mart for performing correspondence analysis.
Conducted correspondence analysis.
Although they overlap and are almost invisible, you can create a diagram that has meaning in the positional relationship of each country and product category. (Zambia, Mauritania, Afghanistan and Singapore have similar purchasing trends. Is that true ...?)
If the edge of the result of the association analysis can be added to this figure, the attribute-specific feature extraction association plot is completed.
First, perform correspondence analysis as before.
I used plt.scatter to plot the results of correspondence analysis earlier, but here I use networkx.
#Representing the results of correspondence analysis and association analysis in a network diagram
It's time to implement.
mca_association_plot(df_corre, df_label, rows, cols, new_labels2, strong_node_row=None, strong_node_col=None, xlim=xlim, ylim=ylim)
A diagram was created that reflected the results of the association analysis in the results of the correspondence analysis.
The gray edges represent the relationship between countries and product subcategories, and the light blue edges represent the relationships between product subcategories.
Only the edges with lift values of 1.6 or more (gray) and 1.5 or more (light blue) are drawn.
The red square and blue cross increase in proportion to the unique number of Customer IDs.
However, I'm not sure because it overlaps, so I'll zoom in on the center a little.
mca_association_plot(df_corre, df_label, rows, cols, new_labels2, strong_node_row=None, strong_node_col=None, xlim=xlim, ylim=ylim)
I'm not sure even if I enlarge it.
But in the US, we have the most customers to buy, and we are in a position where Bindings tend to sell, but the edges are not connected, and we can see that Tables and Edge are connected.
It may be Bindings that stands out in terms of sales, but it seems that the percentage of customers who bought Tables in the United States is actually more than 1.6 times higher than the percentage of customers who bought Tables in all countries. I'm not sure.
In order to reduce the number of nodes to be plotted, try plotting again only for items with a support rating of 0.05 or higher.
#Support 0.Use only 05 or more items
strong_single_product=list(set([[j for j in i][0] for i in frequent_itemsets['itemsets']]))
row_word = 'Country_'
strong_node_row = [v for i, v in enumerate(strong_single_product) if row_word in v]
strong_node_col = [v for i, v in enumerate(strong_single_product) if row_word not in v]
mca_association_plot(df_corre, df_label, rows, cols, new_labels2, strong_node_row=strong_node_row, strong_node_col=strong_node_col, xlim=xlim, ylim=ylim)
After all, the edges are flying too much and it's hard to see ... If you forcibly interpret it ...
・ The tendency is considered to be different for the United States on the left, Germany / United Kingdom in the middle, Brazil / Mexico on the upper right, Spain / Italy on the bottom, etc. ・ The percentage of customers who purchased Tables in Mexico is more than 1.6 times higher than the percentage of customers who purchased Tables in all countries. ・ Brazil is a country with the same tendency as Mexico, so you may recommend Tables. ・ It can be said that many people buy a combination of Tables, Chairs, and Paper, so the United States may recommend Chairs more.
And? No, it's difficult to interpret. It may be easier to interpret if it is an analysis of each store in a certain country.
Attribute-specific feature extraction association plots were performed. However, it was quite difficult to interpret ... An approach similar to "Computer Statistics Vol. 29, No. 22" (https://www.jstage.jst.go.jp/browse/jscswabun/29/2/_contents/-char/ja) "Analysis and Visualization of Store Classification and Purchasing Trends by Sales Trends" may also provide meaningful analysis. Personally, I'm glad I practiced networkx.
that's all!