how to creat column with sub colum (under) in pandas - python

I have a long data sheet with many questions. There are many questions with two or more answers, like below:
[![question format in sheet][1]][1]
Q:1 is there electricity in your home Q:2 What are the electric appliances in your home
yes tv
yes fridge
no laptop
no computer
yes tv
yes laptop
I want the output result as below:
[![answer][2]][2]
Q:1 is there electricity in your home Q:2 What are the electric appliances in your home
total yes no total tv fridge laptop computer
6 4 2 6 2 1 2 1
I want an additional column of "total" and a "total of Yes or No or TV" in other columns as well, as shown in the photo above.
Thank you all for your help.
Edit: The first column is a question (Q1 & Q2). The below rows are the answers from different people in the survey. It is a sample for your understanding.

This is a possible approach. You can iterate each column, calculate the frequency of each value in this column, and create a new multi-index dataframe:
new_df = list()
for column in df:
column_count = df[column].value_counts().to_frame().stack()
column_count.loc[("total", column)] = column_count.sum()
new_df.append(column_count)
Now, let's create a single dataframe with all those counts (one per column) and pivot the table to format the output:
new_df = pd.concat(new_df).reset_index()
new_df = new_df.pivot_table(index=["level_1", "level_0"], values=0).T
This is the output of the code with the sample input:
# Sample input
Q1 Q2
0 yes tv
1 yes fridge
2 no laptop
3 no tv
# Sample output
level_1 Q1 Q2
level_0 no total yes fridge laptop total tv
0 2 4 2 1 1 4 2

Related

Creating ID for every row based on the observations in variable

A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number.
The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.
You can use pd.factorize:
df['UniqueID'] = (df.apply(lambda x: (1+pd.factorize(x)[0]).astype(str))
.agg(''.join, axis=1))
print(df)
# Output
Fruit Toy Letter Car Country UniqueID
0 Apple Bear A Ferrari Brazil 11111
1 Strawberry Blocks B Peugeot Chile 22222
2 Apple Blocks C Renault China 12333
3 Orange Bear D Saab China 31443
4 Orange Bear D Ferrari India 31414

How to groupby and count binomial variable in python? and make a plot of this

i have a dataframe like this:
country
question1
question2
france
yes
no
italy
yes
yes
france
yes
no
germany
no
yes
italy
no
yes
i would like to get an output like a pivot table or a group with a count of yes/no for each Question and each country(similar countifs of excel).
I tried many methods as df.groupby(country).value_counts() or df.groupby(country).sum("Yes")
but i cannot get the result wanted.
And i would like to make a chart of this result obtained, only for the YES answer.
Someone can give me an advice?
Thanks
How to groupby and count binomial variables?
We can encode the values in the columns question1 and question2 using get_dummies then sum the encoded values per unique country to get the counts of number of Yes and No for each question per country
counts = pd.get_dummies(df.set_index('country')).sum(level=0)
question1_no question1_yes question2_no question2_yes
country
france 0 2 2 0
italy 1 1 0 2
germany 1 0 0 1
How to make a plot of this?
Filter the question columns containing _Yes suffixed in their names, then call the plot method of pandas dataframe with kind=bar to create a bar chart showing the counts of questions having Yes corresponding to each country
counts.filter(like='_yes').plot(kind='bar')

Is there a way to count and calculate mean for text columns using groupby?

I have been using pandas.groupby to pivot data and create descriptive charts and tables for my data. While doing groupby for three variables, I keep running into a DataError: No numeric types to aggregate error while working with the cancelled column.
To describe my data, Year and Month contain yearly and monthly data for multiple columns (multiple years, all months), Type contains the type of order item (Clothes, Appliances, etc.), and cancelled contains yes or no string values to determine whether a order was cancelled or not.
I am hoping to plot a graph and show a table to show what the cancellation rate (and success rate) is by order item. The following is what I'm using so far
df.groupby(['Year', 'Month', 'Type'])['cancelled'].mean()
But this doesn't seem to be working.
Sample
Year Month Type cancelled
2012 1 electronics yes
2012 10 fiber yes
2012 9 clothes no
2013 4 vegetables yes
2013 5 appliances no
2016 3 fiber no
2017 1 clothes yes
Use:
df = pd.DataFrame({
'Year':[2020] * 6,
'Month':[7,8,7,8,7,8],
'cancelled':['yes','no'] * 3,
'Type':list('aaaaba')
})
print (df)
Get counts per Year, Month, Type columns:
df1 = df.groupby(['Year', 'Month', 'Type','cancelled']).size().unstack(fill_value=0)
print (df1)
cancelled no yes
Year Month Type
2020 7 a 0 2
b 0 1
8 a 3 0
And then divide by sum of values for ratio:
df2 = df1.div(df1.sum()).mul(100)
print (df2)
cancelled no yes
Year Month Type
2020 7 a 0.0 66.666667
b 0.0 33.333333
8 a 100.0 0.000000
It's possible I have misunderstood what you want your output to look like, but to find the cancellation rate for each item type, you could do something like this:
# change 'cancelled' to numeric values
df.loc[df['cancelled'] == 'yes', 'cancelled'] = 1
df.loc[df['cancelled'] == 'no', 'cancelled'] = 0
# get the mean of 'cancelled' for each item type
res = {}
for t in df['Type'].unique():
res[t] = df.loc[df['Type'] == t, 'cancelled'].mean()
# if desired, put it into a dataframe
results = pd.DataFrame([res], index=['Rate']).T
Output:
Rate
electronics 1.0
fiber 0.5
clothes 0.5
vegetables 1.0
appliances 0.0
Note: If you want to specify specific years or months, you can do that with loc as well, but given that your example data did not have any repeats within a given year or month, this would return your original dataframe for your given example.

Python: Clustering with grouped data

With grouped data I mean the following: Assume we have a data set which is grouped by a single feature, e.g. customer data, which is grouped by the single customer:
Customer | Purchase Nr | Item | Paid Amount ($)
1 1 TShirt 15
1 2 Trousers 25
1 3 Scarf 10
2 1 Underwear 5
2 2 Dress 35
2 3 Trousers 30
2 4 TShirt 10
3 1 TShirt 8
3 2 Socks 5
4 1 Shorts 13
I want to find clusters in a way, that a customers purchases are in one single cluster, in other words, that that a customer is not appearing in two clusters.
I thought about grouping the data set by the customer with a groupby, though it is difficult to express all the information of the columns for one customer in only one column. Futher, the order of purchases is important to me, e.g. if a T-Shirt was bought first or second.
Is there any cluster algorithm which includes information about groups like this?
Thank you!

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Categories