Python: Clustering with grouped data - python

With grouped data I mean the following: Assume we have a data set which is grouped by a single feature, e.g. customer data, which is grouped by the single customer:
Customer | Purchase Nr | Item | Paid Amount ($)
1 1 TShirt 15
1 2 Trousers 25
1 3 Scarf 10
2 1 Underwear 5
2 2 Dress 35
2 3 Trousers 30
2 4 TShirt 10
3 1 TShirt 8
3 2 Socks 5
4 1 Shorts 13
I want to find clusters in a way, that a customers purchases are in one single cluster, in other words, that that a customer is not appearing in two clusters.
I thought about grouping the data set by the customer with a groupby, though it is difficult to express all the information of the columns for one customer in only one column. Futher, the order of purchases is important to me, e.g. if a T-Shirt was bought first or second.
Is there any cluster algorithm which includes information about groups like this?
Thank you!

Related

How to count text event type and transform it into country-year data using pandas?

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to count the number of events where the "target" is a government building. One of the columns is called "targettype" or "targettype_txt" and there are 5 different entries in this column I want to count (government building, military, police, diplomatic building etc). The targettype is also coded as a number if that is easier (i.e. there is another column where gov't building is 2, military installation is 4 etc..)
FYI This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
targettype_txt
10000102
2000
Nigeria
3
10
0
government building
10000103
2000
Mali
1
3
15
military installation
10000103
2000
Nigeria
15
0
0
government building
10000103
2001
Benin
1
0
0
police
10000103
2001
Nigeria
1
3
15
private business
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
total public_target
Nigeria
2000
200
300
300
15
Nigeria
2001
250
450
15
17
I was able to get the total number for nkill,nwounded, and nhostages using this super simple line:
df2 = cdf.groupby(['country','country_txt', 'iyear'])['nkill', 'nwound','nhostkid'].sum()
But this is a little different because I want to only count certain entries and sum up the total number of times they occur. Any thoughts or suggestions are really appreciated!
Try:
cdf['CountCondition'] = (cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')
df2 = cdf[cdf['CountCondition']].groupby(['country','country_txt', 'iyear', 'CountCondition']).count()
You create a new column 'CountCondition' which just marks as true or false if the condition in the statement holds. Then you just count the number of times the CountCondition is True. Hope this makes sense.
It is possible to combine all this into one statement and NOT create an additional column but the statement gets quite convaluted and more difficult to understand how it works:
df2 = cdf[(cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')].groupby(['country','country_txt', 'iyear']).count()

how to creat column with sub colum (under) in pandas

I have a long data sheet with many questions. There are many questions with two or more answers, like below:
[![question format in sheet][1]][1]
Q:1 is there electricity in your home Q:2 What are the electric appliances in your home
yes tv
yes fridge
no laptop
no computer
yes tv
yes laptop
I want the output result as below:
[![answer][2]][2]
Q:1 is there electricity in your home Q:2 What are the electric appliances in your home
total yes no total tv fridge laptop computer
6 4 2 6 2 1 2 1
I want an additional column of "total" and a "total of Yes or No or TV" in other columns as well, as shown in the photo above.
Thank you all for your help.
Edit: The first column is a question (Q1 & Q2). The below rows are the answers from different people in the survey. It is a sample for your understanding.
This is a possible approach. You can iterate each column, calculate the frequency of each value in this column, and create a new multi-index dataframe:
new_df = list()
for column in df:
column_count = df[column].value_counts().to_frame().stack()
column_count.loc[("total", column)] = column_count.sum()
new_df.append(column_count)
Now, let's create a single dataframe with all those counts (one per column) and pivot the table to format the output:
new_df = pd.concat(new_df).reset_index()
new_df = new_df.pivot_table(index=["level_1", "level_0"], values=0).T
This is the output of the code with the sample input:
# Sample input
Q1 Q2
0 yes tv
1 yes fridge
2 no laptop
3 no tv
# Sample output
level_1 Q1 Q2
level_0 no total yes fridge laptop total tv
0 2 4 2 1 1 4 2

Aggregating in pandas with two different identification columns

I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.
I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.
What I tried so far is:
df_new = df.groupby(by = ['id'], as_index=False).agg(aggregate)
df_new = df.groupby(by = ['id','ref'], as_index=False).agg(aggregate)
Let me know if you have any idea what I can do in pandas, or otherwise in Python.
This is basically what I have,
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
2
SELL
4500
1
Alex
2
SELL
4500
1
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
What I am trying to achieve is the following,
Name
Side
Size
ID
Alex
BUY
5400
0
Alex
SELL
4500
1
Sam
BUY
1500
2
P.S. the records are not duplicates of each other, what I provide is a simplified version, but in reality 'Name' is 20 more columns identifying each row.
P.S. P.S. My solution was to first aggregate by Reference then by ID.
Use drop_duplicates, groupby, and agg:
new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()
Output:
>>> new_df
Name Side Size ID
0 Alex BUY 5400 0
1 Alex SELL 4500 1
2 Sam BUY 1500 2
Edit: richardec's solution is better as this will also sum the ID column.
This double groupby should achieve the output you want, as long as names are unique.
df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()
Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
1
BUY
3000
0
2
SELL
4500
1
Sam
3
BUY
1500
2
Then we group this data by Name and Side with a ".sum()" operation to get the final result.
Name
Side
Size
ID
Alex
BUY
5400
0
SELL
4500
1
Sam
BUY
1500
2
Just drop duplicates first and then aggregate with a list
something like this should do (not tested)
I always like to reset the index after
i.e
df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()
or
# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()

Checking top values for dataframe columns in Python

I have a large dataset that looks like:
Shop Date Hour Ending Hours Operating Produced
Cornerstop 01-01-2010 0 1 9
Cornerstop 01-01-2010 1 1 11
Cornerstop 01-01-2010 2 1 10
.
.
Cornerstop 01-01-2010 23 1 0
Leaf Grove 01-01-2010 0 1 7
Leaf Grove 01-01-2010 1 1 4
Leaf Grove 01-01-2010 2 1 2
I want to find out which shops are the top 20 shops by how much they've produced. I've used data.describe() to check the top percentiles but this doesn't help me because if I threshold on the top percentile of 'Produced' some days are lost in the data.
This is a newbie question but how can I easily pick and target these top shops based on this criteria? Perhaps use the percentile just to create a range of the top shops and just cut those out in the dataset? Feels like there's a much better way to do this.
Use sort_values() and head():
df.sort_values('Produced', ascending=False).head(20)
If you want to sum the production values for each shop and then sort, you can do:
df.groupby('Shop').agg({'Produced': 'sum'}).sort_values('Produced', ascending=False).head(20)
Use .nlargest
df.groupby('Shop').Produced.sum().nlargest(20)
Add .index.tolist() if you just need a list of Shops.
What about the following to sort the column and then take the top 20?
df= df.sort_values(['Produced'], ascending=[False])
df.head(20)

Which ML model for Customer Segmentation based on the products used

I am trying to run machine learning models on Customers trying to segment customers using similar products together. My dataset is huge with 2.4 million records and is in the following format:
customer_id prod_1 prod_2 prod_3 prod_4 ..... prod_10
000 1 0 0 1 ..... 1
001 0 0 1 1 ..... 1
011 0 1 0 1 ..... 0
021 1 0 1 1 ..... 0
...
Each row has customer number and 1 or 0 based on whether or not they have a product. I ran k-means and the results did not look impressive.
Any other suggestions on what type of models can be run on such data to segment customers based on the products they use together?
Use frequent itemset mining.
Abandon the idea that each customer belongs to exactly one segment. That doesn't hold in reality.
Instead, there are typical product combinations that identify segments. These can also overlap. One customer can be electronics affine and a Star Wars fan at the same time.

Categories