pandas pivot table to data frame [duplicate]

pandas pivot table to data frame [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe (df) that looks like this:
+---------+-------+------------+----------+
| subject | pills | date | strength |
+---------+-------+------------+----------+
| 1 | 4 | 10/10/2012 | 250 |
| 1 | 4 | 10/11/2012 | 250 |
| 1 | 2 | 10/12/2012 | 500 |
| 2 | 1 | 1/6/2014 | 1000 |
| 2 | 1 | 1/7/2014 | 250 |
| 2 | 1 | 1/7/2014 | 500 |
| 2 | 3 | 1/8/2014 | 250 |
+---------+-------+------------+----------+
When I use reshape in R, I get what I want:
reshape(df, idvar = c("subject","date"), timevar = 'strength', direction = "wide")
+---------+------------+--------------+--------------+---------------+
| subject | date | strength.250 | strength.500 | strength.1000 |
+---------+------------+--------------+--------------+---------------+
| 1 | 10/10/2012 | 4 | NA | NA |
| 1 | 10/11/2012 | 4 | NA | NA |
| 1 | 10/12/2012 | NA | 2 | NA |
| 2 | 1/6/2014 | NA | NA | 1 |
| 2 | 1/7/2014 | 1 | 1 | NA |
| 2 | 1/8/2014 | 3 | NA | NA |
+---------+------------+--------------+--------------+---------------+
Using pandas:
df.pivot_table(df, index=['subject','date'],columns='strength')
+---------+------------+-------+----+-----+
| | | pills |
+---------+------------+-------+----+-----+
| | strength | 250 | 500| 1000|
+---------+------------+-------+----+-----+
| subject | date | | | |
+---------+------------+-------+----+-----+
| 1 | 10/10/2012 | 4 | NA | NA |
| | 10/11/2012 | 4 | NA | NA |
| | 10/12/2012 | NA | 2 | NA |
+---------+------------+-------+----+-----+
| 2 | 1/6/2014 | NA | NA | 1 |
| | 1/7/2014 | 1 | 1 | NA |
| | 1/8/2014 | 3 | NA | NA |
+---------+------------+-------+----+-----+
How do I get exactly the same output as in R with pandas? I only want 1 header.

After pivoting, convert the dataframe to records and then back to dataframe:
flattened = pd.DataFrame(pivoted.to_records())
# subject date ('pills', 250) ('pills', 500) ('pills', 1000)
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
You can now "repair" the column names, if you want:
flattened.columns = [hdr.replace("('pills', ", "strength.").replace(")", "") \
for hdr in flattened.columns]
flattened
# subject date strength.250 strength.500 strength.1000
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
It's awkward, but it works.

Related

sumif and countif on Python for multiple columns , On row level and not column level

I'm trying to figure a way to do:
COUNTIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
SUMIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
My attempt:
import pandas as pd
df=pd.read_excel(r'C:\\Users\\Downloads\\Prepped.xls') #Please use: https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float) #changing dtype to float
#unconditional sum
df['sum']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum(axis=1)
whatever goes below won't work
#sum if
df['greater-than-0.05']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum([c for c in col if c >= 0.05])
| | # | word | B64684807 | B64684807Measure | B649845471 | B649845471Measure | B83344143 | B83344143Measure | B67400624 | B67400624Measure | B85229235 | B85229235Measure | B85630406 | B85630406Measure | B82615898 | B82615898Measure | B87558236 | B87558236Measure | B00000009 | B00000009Measure | 有效竞品数 | 关键词抓取时间 | 搜索量排名 | 月搜索量 | 在售商品数 | 竞争度 |
|---:|----:|:--------|------------:|:-------------------|-------------:|:-------------------------|------------:|:-------------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|-------------------:|------------:|:-------------------|-------------:|:--------------------|-------------:|-----------:|-------------:|---------:|
| 0 | 1 | word 1 | 0.055639 | [主要流量词] | 0.049416 | nan | 0.072298 | [精准流量词, 主要流量词] | 0.00211 | nan | 0.004251 | nan | 0.007254 | nan | 0.074409 | [主要流量词] | 0.033597 | nan | 0.000892 | nan | 9 | 2022-10-06 00:53:56 | 5726 | 326188 | 3810 | 0.01 |
| 1 | 2 | word 2 | 0.045098 | nan | 0.005472 | nan | 0.010791 | nan | 0.072859 | [主要流量词] | 0.003423 | nan | 0.012464 | nan | 0.027396 | nan | 0.002825 | nan | 0.060989 | [主要流量词] | 9 | 2022-10-07 01:16:21 | 9280 | 213477 | 40187 | 0.19 |
| 2 | 3 | word 3 | 0.02186 | nan | 0.05039 | [主要流量词] | 0.007842 | nan | 0.028832 | nan | 0.044385 | [精准流量词] | 0.001135 | nan | 0.003866 | nan | 0.021035 | nan | 0.017202 | nan | 9 | 2022-10-07 00:28:31 | 24024 | 81991 | 2275 | 0.03 |
| 3 | 4 | word 4 | 0.000699 | nan | 0.01038 | nan | 0.001536 | nan | 0.021512 | nan | 0.007658 | nan | 5e-05 | nan | 0.048682 | nan | 0.001524 | nan | 0.000118 | nan | 9 | 2022-10-07 00:52:12 | 34975 | 53291 | 30970 | 0.58 |
| 4 | 5 | word 5 | 0.00984 | nan | 0.030248 | nan | 0.003006 | nan | 0.014027 | nan | 0.00904 | [精准流量词] | 0.000348 | nan | 0.000414 | nan | 0.006721 | nan | 0.00153 | nan | 9 | 2022-10-07 02:36:05 | 43075 | 41336 | 2230 | 0.05 |
| 5 | 6 | word 6 | 0.010029 | [精准流量词] | 0.120739 | [精准流量词, 主要流量词] | 0.014359 | nan | 0.002796 | nan | 0.002883 | nan | 0.028747 | [精准流量词] | 0.007022 | nan | 0.017803 | nan | 0.001998 | nan | 9 | 2022-10-07 00:44:51 | 49361 | 34791 | 517 | 0.01 |
| 6 | 7 | word 7 | 0.002735 | nan | 0.002005 | nan | 0.005355 | nan | 6.3e-05 | nan | 0.000772 | nan | 0.000237 | nan | 0.015149 | nan | 2.1e-05 | nan | 2.3e-05 | nan | 9 | 2022-10-07 09:48:20 | 53703 | 31188 | 511 | 0.02 |
| 7 | 8 | word 8 | 0.003286 | [精准流量词] | 0.058161 | [主要流量词] | 0.013681 | [精准流量词] | 0.000748 | [精准流量词] | 0.002684 | [精准流量词] | 0.013916 | [精准流量词] | 0.029376 | nan | 0.019792 | nan | 0.005602 | nan | 9 | 2022-10-06 01:51:53 | 58664 | 27751 | 625 | 0.02 |
| 8 | 9 | word 9 | 0.004273 | [精准流量词] | 0.025581 | [精准流量词] | 0.014784 | [精准流量词] | 0.00321 | [精准流量词] | 0.000892 | nan | 0.00223 | nan | 0.005315 | nan | 0.02211 | nan | 0.027008 | [精准流量词] | 9 | 2022-10-07 01:34:28 | 73640 | 20326 | 279 | 0.01 |
| 9 | 10 | word 10 | 0.002341 | [精准流量词] | 0.029604 | nan | 0.007817 | [精准流量词] | 0.000515 | [精准流量词] | 0.001865 | [精准流量词] | 0.010128 | [精准流量词] | 0.015378 | nan | 0.019677 | nan | 0.003673 | nan | 9 | 2022-10-07 01:17:44 | 80919 | 17779 | 207 | 0.01 |
So my question is,
How can i do the sumif and countif on the exact table (Should use col2,col4... etc, because every file will have the same format but different header, so using df['B64684807'] isn't helpful )
Sample file can be found at:
https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls

IIUC, you can use a boolean mask:
df2 = df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float)
m = df2.ge(0.05)
df['countif'] = m.sum(axis=1)
df['sumif'] = df2.where(m).sum(axis=1)
output (last 3 columns only):
sum countif sumif
0 0.299866 3 0.202346
1 0.241317 2 0.133848
2 0.196547 1 0.050390
3 0.092159 0 0.000000
4 0.075174 0 0.000000
5 0.206376 1 0.120739
6 0.026360 0 0.000000
7 0.147246 1 0.058161
8 0.105403 0 0.000000
9 0.090998 0 0.000000

Split a column into multiple columns with condition

I have a question about splitting columns into multiple rows at Pandas with conditions.
For example, I tend to do something as follows but takes a very long time using for loop
| Index | Value |
| ----- | ----- |
| 0 | 1 |
| 1 | 1,3 |
| 2 | 4,6,8 |
| 3 | 1,3 |
| 4 | 2,7,9 |
into
| Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| ----- | - | - | - | - | - | - | - | - | - |
| 0 | 1 | | | | | | | | |
| 1 | 1 | | 3 | | | | | | |
| 2 | | | | 4 | | 6 | | 8 | |
| 3 | 1 | | 3 | | | | | | |
| 4 | | 2 | | | | | 7 | | 9 |
I wonder if there are any packages that can help this out rather than to write a for loop to map all indexes.

Assuming the "Value" column contains strings, you can use str.split and pivot like so:
value = df["Value"].str.split(",").explode().astype(int).reset_index()
output = value.pivot(index="index", columns="Value", values="Value")
output = output.reindex(range(value["Value"].min(), value["Value"].max()+1), axis=1)
>>> output
Value 1 2 3 4 5 6 7 8 9
index
0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN 4.0 NaN 6.0 NaN 8.0 NaN
3 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
4 NaN 2.0 NaN NaN NaN NaN 7.0 NaN 9.0
Input df:
df = pd.DataFrame({"Value": ["1", "1,3", "4,6,8", "1,3", "2,7,9"]})

Pandas Add New Column using Lookup using Multiple Columns from another DataFrame

I have two dataframes.
df1 = pd.DataFrame({
'id':[1,1,1,1,1,1,2,2,2,2,2,2],
'pp':[3,'',2,'',1,0,4, 3, 2, 1, '', 0],
'pc':[6,5,4,3,2,1,6,5,4,3,2,1]
})
| | id | pp | pc |
|---:|-----:|:-----|-----:|
| 0 | 1 | 3 | 6 |
| 1 | 1 | | 5 |
| 2 | 1 | 2 | 4 |
| 3 | 1 | | 3 |
| 4 | 1 | 1 | 2 |
| 5 | 1 | 0 | 1 |
| 6 | 2 | 4 | 6 |
| 7 | 2 | 3 | 5 |
| 8 | 2 | 2 | 4 |
| 9 | 2 | 1 | 3 |
| 10 | 2 | | 2 |
| 11 | 2 | 0 | 1 |
df2 = pd.DataFrame({
'id':[1,1,1,2,2,2],
'pp':['', 3, 4, 1, 2, ''],
'yu':[1,2,3,4,5,6]
})
| | id | pp | yu |
|---:|-----:|:-----|-----:|
| 0 | 1 | | 1 |
| 1 | 1 | 3 | 2 |
| 2 | 1 | 4 | 3 |
| 3 | 2 | 1 | 4 |
| 4 | 2 | 2 | 5 |
| 5 | 2 | | 6 |
I'd like to merge the two so that final results look like this.
| | id | pp | pc | yu |
|---:|-----:|:-----|:-----|-----:|
| 0 | 1 | | | 1 |
| 1 | 1 | 0 | 1 | 2 |
| 2 | 1 | 3 | 6 | 3 |
| 3 | 2 | 1 | 3 | 4 |
| 4 | 2 | 2 | 4 | 5 |
| 5 | 2 | | | 6 |
Basically, the df1 has the value that I need to lookup from.
df2 is the has id and pp column that are used to lookup.
However when I do
pd.merge(df2, df1, on=['id', 'pp'], how='left') results in
| | id | pp | pc | yu |
|---:|-----:|:-----|-----:|-----:|
| 0 | 1 | | 5 | 1 |
| 1 | 1 | | 3 | 1 |
| 2 | 1 | 3 | 6 | 2 |
| 3 | 1 | 4 | nan | 3 |
| 4 | 2 | 1 | 3 | 4 |
| 5 | 2 | 2 | 4 | 5 |
| 6 | 2 | | 2 | 6 |
This is not correct because it looks at empty rows as well.
If the value in df2 is empty, there should be no mapping.
I do want to keep the empty rows in df2 as it showed so can't use inner join

We can dropna for empty row in df1
out = pd.merge(df2, df1.replace({'':np.nan}).dropna(), on=['id', 'pp'], how='left')
Out[121]:
id pp yu pc
0 1 1 NaN
1 1 3 2 6.0
2 1 4 3 NaN
3 2 1 4 3.0
4 2 2 5 4.0
5 2 6 NaN

Multiplying pandas columns based on multiple conditions

I have a df like this
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 0 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 0 |
| total | | 6 | | 0 |
I want a output like below
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 8 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 2 |
| total | | 6 | | 0 |
The goal is calculate column C by mulytiplying A and B only when the count value is "yes" but if the column People values are same that is yes for dia and no for also dia , then we have to calculate for the count value "no"
I tried this much so far
df.C= df.groupby("Host", as_index=False).apply(lambda dfx : df.A *
df.B if (df['count'] == 'no') else df.A *df.B)
But not able to achieve the goal, any idea how can I achieve the output

import numpy as np
#Set Condtions
c1=df.groupby('people')['count'].transform('nunique').eq(1)&df['count'].eq('yes')
c2=df.groupby('people')['count'].transform('nunique').gt(1)&df['count'].eq('no')
#Put conditions in list
c=[c1,c2]
#Mke choices corresponding to condition list
choice=[df['A']*df['B'],len(df[df['count'].eq('no')])]
#Apply np select
df['C']= np.select(c,choice,0)
print(df)
count people A B C
0 yes siya 4 2.0 8.0
1 no aish 4 3.0 0.0
2 total NaN 4 0.0 0.0
3 yes dia 6 4.0 0.0
4 no dia 6 2.0 2.0
5 total NaN 6 NaN 0.0

Modify column in according another column dataframe python

I have two dataframes. One is the master dataframe and the other df is used to fil my master dataframe.
what I want is fil one column in according another column without alter the others columns.
This is example of master df
| id | Purch. order | cost | size | code |
| 1 | G918282 | 8283 | large| hchs |
| 2 | EE18282 | 1283 | small| ueus |
| 3 | DD08282 | 5583 | large| kdks |
| 4 | GU88912 | 8232 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
This is example of the another df
| id | Purch. order | cost |
| 1 | G918282 | 7728 |
| 2 | EE18282 | 2211 |
| 3 | DD08282 | 5321 |
| 4 | GU88912 | 4778 |
| 5 | NaN | 4283 |
| 6 | Nan | 9993 |
| 7 | Nan | 3442 |
This is the result I'd like
| id | Purch. order | cost | size | code |
| 1 | G918282 | 7728 | large| hchs |
| 2 | EE18282 | 2211 | small| ueus |
| 3 | DD08282 | 5321 | large| kdks |
| 4 | GU88912 | 4778 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
Where only the cost column is modified only if the secondary df coincides with the purch. order and if it's not NaN.
I hope you can help me... and I'm sorry if my english is so basic, not is my mother language. Thanks a lot.

lets try Update which works along indexes, by default overwrite is set to True which will overwrite overlapping values in your target dataframe. use overwrite=False if you only want to change NA values.
master_df = master_df.set_index(['id','Purch. order'])
another_df = another_df.dropna(subset=['Purch. order']).set_index(['id','Purch. order'])
master_df.update(another_df)
print(master_df)
cost size code
id Purch. order
1 G918282 7728.0 large hchs
2 EE18282 2211.0 small ueus
3 DD08282 5321.0 large kdks
4 GU88912 4778.0 large jdhd
5 NaN 1283.0 large jdjd
6 Nan 5583.0 large qqas
7 Nan 8232.0 large djjs

You can do it with merge followed by updating the cost column based on where the Nan are:
final_df = df1.merge(df2[~df2["Purch. order"].isna()], on = 'Purch. order', how="left")
final_df.loc[~final_df['Purch. order'].isnull(), "cost"] = final_df['cost_y'] # not nan
final_df.loc[final_df['Purch. order'].isnull(), "cost"] = final_df['cost_x'] # nan
final_df = final_df.drop(['id_y','cost_x','cost_y'],axis=1)
Output:
id _x Purch. order size code cost
0 1 G918282 large hchs 7728.0
1 2 EE18282 small ueus 2211.0
2 3 DD08282 large kdks 5321.0
3 4 GU88912 large jdhd 4778.0
4 5 NaN large jdjd 1283.0
5 6 NaN large qqas 5583.0
6 7 NaN large djjs 8232.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas pivot table to data frame [duplicate] - python

Related

sumif and countif on Python for multiple columns , On row level and not column level

Split a column into multiple columns with condition

Pandas Add New Column using Lookup using Multiple Columns from another DataFrame

Multiplying pandas columns based on multiple conditions

Modify column in according another column dataframe python

Categories

Resources