Pandas dataframe pivot table and grouping

Pandas dataframe pivot table and grouping - python

I have a DataFrame which I made into a pivot table, but now I want to order the pivot table so that common values based on a particular column are aligned beside each other. For e.g. order DataFrame so that all common countries align to same row:
data = {'dt': ['2016-08-22', '2016-08-21', '2016-08-22', '2016-08-21', '2016-08-21'],
'country':['uk', 'usa', 'fr','fr','uk'],
'number': [10, 21, 20, 10,12]
}
df = pd.DataFrame(data)
print df
country dt number
0 uk 2016-08-22 10
1 usa 2016-08-21 21
2 fr 2016-08-22 20
3 fr 2016-08-21 10
4 uk 2016-08-21 12
#pivot table by dt:
df['idx'] = df.groupby('dt')['dt'].cumcount()
df_pivot = df.set_index(['idx','dt']).stack().unstack([1,2])
print df_pivot
dt 2016-08-22 2016-08-21
country number country number
idx
0 uk 10 usa 21
1 fr 20 fr 10
2 NaN NaN uk 12
#what I really want:
dt 2016-08-22 2016-08-21
country number country number
0 uk 10 uk 12
1 fr 20 fr 10
2 NaN NaN usa 21
or even better:
2016-08-22 2016-08-21
country number number
0 uk 10 12
1 fr 20 10
2 usa NaN 21
i.e. uk values from both 2016-08-22 and 2016-08-21 are aligned on same row

You can use:
df_pivot = df.set_index(['dt','country']).stack().unstack([0,2]).reset_index()
print (df_pivot)
dt country 2016-08-22 2016-08-21
number number
0 fr 20.0 10.0
1 uk 10.0 12.0
2 usa NaN 21.0
#change first value of Multiindex from first to second level
cols = [col for col in df_pivot.columns]
df_pivot.columns = pd.MultiIndex.from_tuples([('','country')] + cols[1:])
print (df_pivot)
2016-08-22 2016-08-21
country number number
0 fr 20.0 10.0
1 uk 10.0 12.0
2 usa NaN 21.0
Another simplier solution is with pivot:
df_pivot = df.pivot(index='country', columns='dt', values='number')
print (df_pivot)
dt 2016-08-21 2016-08-22
country
fr 10.0 20.0
uk 12.0 10.0
usa 21.0 NaN

Related

Pandas: Extract the row before and the row after based on a given value

I just started getting into pandas. I have searched through many sources and could not find a solution to my problem. Hope to learn from the specialists here.
This is the original dataframe:
Country
Sales
Item_A
Item_B
UK
28
20
30
Asia
75
15
20
USA
100
30
40
Assume that the Sales column is always sorted in ascending order from lowest to highest.
Let say given Sales = 50 and Country = 'UK', how do I
Identify the two rows that have the closest Sales value w.r.t. 50?
Insert a new row between the two rows with the given Country and Sales?
Interpolate the values for Item_A and Item_B?
This is the expected result:
Country
Sales
Item_A
Item_B
UK
28
20
30
UK
50
17.7
25.3
Asia
75
15
20

First, I would recommend you to just add the new row at the bottom and sort the column so that it would go to your preferred postion.
new = {'Country': ['UK'], 'Sales': [50]}
df = pd.concat([df, pd.DataFrame(new)]).sort_values(by=["Sales"]).reset_index(drop=True)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 NaN NaN
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
The second line will add the new line (concat), then sort your concerned column (sort_values) and the row will move to the preferred index (reset_index).
But if you have your reasons of adding directly to the index, I am not aware of pandas insert for rows, only columns. So, my recommendation would be to rip the original dataframe into before and after rows. To do so, you would need to find the index to put your new row.
def check_index(value):
ruler = sorted(df["Sales"])
ruled = [i for i in range(len(ruler)) if ruler[i] < 50]
return max(ruled)+1
This function will sort the concerned column of the original dataframe, compare the value and get the index your new row should go.
df = pd.concat([df[: check_index(new["Sales"])], pd.DataFrame(new), df[check_index(new["Sales"]):]]).reset_index(drop=True)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 NaN NaN
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
This will rip your dataframe, and concat before, new row, then after dataframe. For your second part of the request, you can apply the same funtion directly by naming the columns, but here I make sure to select the numeric columns first since we are going to do arithmetics on this. We use shift to select the previous and subsequent values then half the value.
for col in df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns.tolist():
df[col] = df[col].fillna((df[col].shift() + df[col].shift(-1))/2)
Country Sales Item_A Item_B
0 UK 28 20.0 30.0
1 UK 50 17.5 25.0
2 Asia 75 15.0 20.0
3 USA 100 30.0 40.0
But please be noted that if the new row is going to the first row of the dataframe, the value will be still Na since it does not have a before row to calculate with. For that, I added a second new fillna function, you can replace with the value/calculation of your choice.
Country Sales Item_A Item_B
0 UK 10 NaN NaN
1 UK 28 20.0 30.0
2 UK 50 NaN NaN
3 Asia 75 15.0 20.0
4 USA 100 30.0 40.0
for col in df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns.tolist():
df[col] = df[col].fillna((df[col].shift() + df[col].shift(-1))/2)
df[col] = df[col].fillna(df[col].shift(-1)/2) #this
Country Sales Item_A Item_B
0 UK 10 10.0 15.0
1 UK 28 20.0 30.0
2 UK 50 17.5 25.0
3 Asia 75 15.0 20.0
4 USA 100 30.0 40.0

pandas writing to a subset based on condition till end of the group category

I have a pandas dataframe like below and the expected output column is 'check':
Country Temp check
0 Canada 25 0.0
1 Canada 26 0.0
2 Canada 27 1.0
3 Canada 25 1.0
4 Canada 24 1.0
5 USA 25 0.0
6 USA 26 0.0
7 USA 27 1.0
8 USA 23 1.0
9 USA 22 1.0
The check column turns one when the temperature exceeds 26 degrees and remains 1 till the country changes. I did this with a loop:
check = 0
country_old = ''
for r in range(len(df)):
country_new = df.iloc[r]['Country']
if country_new!=country_old:
check = 0
country_old = country_new
if df.iloc[r]['Temp']>26:
check = 1
df.loc[r,'check'] = check
But is too slow for my dataframe (200k+ rows). is there a faster way to do this?

Try groupby.transform with gt and cumsum:
>>> df['check'] = df.groupby('Country')['Temp'].transform(lambda x: x.gt(26).cumsum())
>>> df
Country Temp check
0 Canada 25 0
1 Canada 26 0
2 Canada 27 1
3 Canada 25 1
4 Canada 24 1
5 USA 25 0
6 USA 26 0
7 USA 27 1
8 USA 23 1
9 USA 22 1
>>>

I think the cumsum() from #U12-Forward will return 0,0,1,2,3. I suggest using astype('int') to convert T/F value into 0/1.
df['check'] = df.groupby('Country')['Temp'].transform(lambda x: x.gt(26).astype('int'))

Add the values of several columns when the number of columns exceeds 3 - Pandas

I have a pandas dataframe with several columns of dates, numbers and bill amounts. I would like to add the amounts of the other invoices with the 3rd one and change the invoice number by "1111".
Here is an example:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Bill4
Date 4
ID Bill 4
Bill5
Date 5
ID Bill 5
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
3
23
2001-11-23
6
76
2011-08-19
12
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
21
2009-05-04
9
Nan
Nan
Nan
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
And I would like to get this :
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
4
6
2000-10-04
1
45
2000-11-05
2
150
1999-12-05
1111
6
8
2016-05-03
7
39
2017-08-09
8
59
2018-07-14
1111
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
This example is a sample of my data, I may have many more than 5 columns.
Thanks for your help

with a little of data manipulation, you should be able to do it as:
df = df.replace('Nan', np.nan)
idx_col_bill3 = 7
step = 3
idx_col_bill3_id = 10
cols = df.columns
bills = df[cols[range(idx_col_bill3,len(cols), step)]].sum(axis=1)
bills.replace(0, nan, inplace=True)
df = df[cols[range(idx_col_bill3_id)]]
df['Bill3'] = bills
df['ID Bill 3'].iloc._setitem_with_indexer(df['ID Bill 3'].notna(),1111)

Pandas: How do I normalize COVID-19 dataframe with different countries having different day of outbreak

In order to meaningfully compare across territories, I would like to normalize the COVID-19 confirmed cases by the starting date of the outbreak in different countries. For any territory, the day that territory reaches or exceeds 10 confirmed cases is considered as 'day 0 of the outbreak.
Example dataframe:
[in]
import pandas as pd
confirmed_cases = {'Date':['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20'], 'Australia':[0, 0, 0, 30, 50], 'Albania':[0, 20, 25, 30, 50], 'Algeria':[25, 40, 50, 50, 70]}
df = pd.DataFrame(confirmed_cases)
df
[out]
Date Australia Albania Algeria
0 1/22/20 0 0 25
1 1/23/20 0 20 40
2 1/24/20 0 25 50
3 1/25/20 30 30 50
4 1/26/20 50 50 70
Desired Results:
Day Since Outbreak Australia Albania Algeria
0 0 30 20 25
1 1 50 25 40
2 2 NaN 30 50
3 3 NaN 50 50
4 4 NaN NaN 70
Are there any ways to perform this task with simple lines of Python/Panda code?

find the index value of the first value exceeding threshold (10) for each country and shift each column up by that many
df2 = df[['Australia', 'Albania', 'Algeria']].apply(lambda x: x.shift(-(x > 10).idxmax()))
# df2
Australia Albania Algeria
0 30.0 20.0 25
1 50.0 25.0 40
2 NaN 30.0 50
3 NaN 50.0 50
4 NaN NaN 70
reset the index to get the day-since column
df2.reset_index().rename(columns={'index': 'Day Since Outbreak'})
Day Since Outbreak Australia Albania Algeria
0 0 30.0 20.0 25
1 1 50.0 25.0 40
2 2 NaN 30.0 50
3 3 NaN 50.0 50
4 4 NaN NaN 70

Determine how many times you need to shift each column based on the first run of values < 10. Then shift them. The cummin ensures that if there's an intermittent value < 10 it doesn't get counted in the shift
df = df.drop(columns='Date') # Wont need
s = df.lt(10).cummin().sum()
for col, shift in s.iteritems():
df[col] = df[col].shift(-shift)
df['Days Since'] = range(len(df)) # Duplicative with index...
Australia Albania Algeria Days Since
0 30.0 20.0 25 0
1 50.0 25.0 40 1
2 NaN 30.0 50 2
3 NaN 50.0 50 3
4 NaN NaN 70 4

iterate the rows and join in python pandas

i have master dataset like this
master = pd.DataFrame({'Channel':['1','1','1','1','1'],'Country':['India','Singapore','Japan','United Kingdom','Austria'],'Product':['X','6','7','X','X']})
and user table like this
user = pd.DataFrame({'User':['101','101','102','102','102','103','103','103','103','103'],'Country':['India','Brazil','India','Brazil','Japan','All','Austria','Japan','Singapore','United Kingdom'],'count':['2','1','3','2','1','1','1','1','1','1']})
i wanted master table left join with user table for each user. like below for one user
merge_101 = pd.merge(master,user[(user.User=='101')],how='left',on=['Country'])
merge_102 = pd.merge(master,user[(user.User=='102')],how='left',on=['Country'])
merge_103 = pd.merge(master,user[(user.User=='103')],how='left',on=['Country'])
merge_all = pd.concat([merge_101, merge_102,merge_103], ignore_index=True)
how to iterate each user here i am first filtering the dataset and creating another data set and appending the whole data set later.
is there any better way to do this task like for loop or any joins?
Thanks

IIUC, you need:
pd.concat([pd.merge(master,user[(user.User==x)],how='left',on=['Country']) for x in list(user['User'].unique())], ignore_index=True)
Output:
Channel Country Product User count
0 1 India X 101 2
1 1 Singapore 6 NaN NaN
2 1 Japan 7 NaN NaN
3 1 United Kingdom X NaN NaN
4 1 Austria X NaN NaN
5 1 India X 102 3
6 1 Singapore 6 NaN NaN
7 1 Japan 7 102 1
8 1 United Kingdom X NaN NaN
9 1 Austria X NaN NaN
10 1 India X NaN NaN
11 1 Singapore 6 103 1
12 1 Japan 7 103 1
13 1 United Kingdom X 103 1
14 1 Austria X 103 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe pivot table and grouping - python

Related

Pandas: Extract the row before and the row after based on a given value

pandas writing to a subset based on condition till end of the group category

Add the values of several columns when the number of columns exceeds 3 - Pandas

Pandas: How do I normalize COVID-19 dataframe with different countries having different day of outbreak

iterate the rows and join in python pandas

Categories

Resources