pandas: pivoting on rank - python

Given this data:
pd.DataFrame({'id':['aaa','aaa','abb','abb','abb','acd','acd','acd'],
'loc':['US','UK','FR','US','IN','US','CN','CN']})
id loc
0 aaa US
1 aaa UK
2 abb FR
3 abb US
4 abb IN
5 acd US
6 acd CN
7 acd CN
How do I pivot it to this:
id loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I am looking for the most idiomatic method.

I think you can create new column cols with groupby, cumcount and convert to string by astype, last use pivot:
df['cols'] = 'loc' + (df.groupby('id')['id'].cumcount() + 1).astype(str)
print df
id loc cols
0 aaa US loc1
1 aaa UK loc2
2 abb FR loc1
3 abb US loc2
4 abb IN loc3
5 acd US loc1
6 acd CN loc2
7 acd CN loc3
print df.pivot(index='id', columns='cols', values='loc')
cols loc1 loc2 loc3
id
aaa US UK None
abb FR US IN
acd US CN CN
If you want remove index and columns names use rename_axis:
print df.pivot(index='id', columns='cols', values='loc').rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
All together, thank you Colin:
print pd.pivot(df['id'], 'loc' + (df.groupby('id').cumcount() + 1).astype(str), df['loc'])
.rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I try rank, but I get error in version 0.18.0:
print df.groupby('id')['loc'].transform(lambda x: x.rank(method='first'))
#ValueError: first not supported for non-numeric data

Related

pandas groupby count co-existances

I want to get the countries affinities by products.
I have such a df:
cntr prod
0 fr cheese
1 ger potato
2 it cheese
3 it tomato
4 fr wine
5 it wine
6 ger cabbage
7 fr cabbage
I was trying to get a co-existence matrix of number of products, which would tell me the countries affinities, as such:
fr ger it
fr 1 2
ger 1 0
it 2 0
my test was first to proceed to do a cross groupby trying to add a 3rd dimension so to get
fr fr
ger 1
it 2
ger fr 1
ger
it 0
it fr 2
ger 0
it
this is what I tryied, but it is failing to add the second layer..
any suggestion?
I believe you need merge for cross join with crosstab and if necessary set diagonal to NaN by numpy.fill_diagonal:
df = pd.merge(df, df, on='prod')
df = pd.crosstab(df['cntr_x'], df['cntr_y']).astype(float)
np.fill_diagonal(df.values, np.nan)
print (df)
cntr_y fr ger it
cntr_x
fr NaN 1.0 2.0
ger 1.0 NaN 0.0
it 2.0 0.0 NaN

Pandas partial transpose

I want to reformat a dataframe by transeposing some columns with fixing other columns.
original data :
ID subID values_A
-- ----- --------
A aaa 10
B baa 20
A abb 30
A acc 40
C caa 50
B bbb 60
Pivot once :
pivot_table( df, index = ["ID", "subID"] )
Output:
ID subID values_A
-- ----- --------
A aaa 10
abb 30
acc 40
B baa 20
bbb 60
C caa 50
What I want to do ( Fix ['ID'] columns and partial transpose ) :
ID subID_1 value_1 subID_2 value_2 subID_3 value_3
-- ------- ------- -------- ------- ------- -------
A aaa 10 abb 30 acc 40
B baa 20 bbb 60 NaN NaN
C caa 50 NaN NaN NaN NaN
what I know max subIDs count value which are under each IDs.
I don't need any calculating value when pivot and transepose dataframe.
Please help
Use cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort first level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
#python 3.6+
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
#python bellow
#df.columns = ['{}_{}'.format(a, b+1) for a, b in df.columns]
df = df.reset_index()
print (df)
ID subID_1 values_A_1 subID_2 values_A_2 subID_3 values_A_3
0 A aaa 10.0 abb 30.0 acc 40.0
1 B baa 20.0 bbb 60.0 NaN NaN
2 C caa 50.0 NaN NaN NaN NaN

Resample rows for missing dates and forward fill values in all columns except one

I currently have the following sample dataframe:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1112 6/4/2017 BBB O
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I want to fill in missing dates for two rows right on top of each other. I want to also fill in the values of the non-date columns with the values in the top row BUT leave 'Type' column blank for filled in rows.
Please see desired output:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1826 6/2/2017 AAA
20 1826 6/3/2017 AAA
20 1112 6/4/2017 BBB O
20 1112 6/5/2017 BBB
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I have searched all around Google and stackoverflow but did not find any date fill in answers for pandas dataframe.
First, convert DATE to a datetime column using pd.to_datetime,
df.DATE = pd.to_datetime(df.DATE)
Option 1
Use resample + ffill, and then reset the Type column later. First, store the unique dates in some list:
dates = df.DATE.unique()
Now,
df = df.set_index('DATE').resample('1D').ffill().reset_index()
df.Type = df.Type.where(df.DATE.isin(dates), '')
df
DATE No FlNo Loc Type
0 2017-06-01 20 1826 AAA O
1 2017-06-02 20 1826 AAA
2 2017-06-03 20 1826 AAA
3 2017-06-04 20 1112 BBB O
4 2017-06-05 20 1112 BBB
5 2017-06-06 20 1234 CCC O
6 2017-06-07 20 43 DDD O
7 2017-06-08 20 1840 EEE O
If needed, you may bring DATE back to its original state;
df.DATE = df.DATE.dt.strftime('%m/%d/%Y')
Option 2
Another option would be asfreq + ffill + fillna:
df = df.set_index('DATE').asfreq('1D').reset_index()
c = df.columns.difference(['Type'])
df[c] = df[c].ffill()
df['Type'] = df['Type'].fillna('')
df
DATE No FlNo Loc Type
0 2017-06-01 20.0 1826.0 AAA O
1 2017-06-02 20.0 1826.0 AAA
2 2017-06-03 20.0 1826.0 AAA
3 2017-06-04 20.0 1112.0 BBB O
4 2017-06-05 20.0 1112.0 BBB
5 2017-06-06 20.0 1234.0 CCC O
6 2017-06-07 20.0 43.0 DDD O
7 2017-06-08 20.0 1840.0 EEE O

Pandas dataframe pivot table and grouping

I have a DataFrame which I made into a pivot table, but now I want to order the pivot table so that common values based on a particular column are aligned beside each other. For e.g. order DataFrame so that all common countries align to same row:
data = {'dt': ['2016-08-22', '2016-08-21', '2016-08-22', '2016-08-21', '2016-08-21'],
'country':['uk', 'usa', 'fr','fr','uk'],
'number': [10, 21, 20, 10,12]
}
df = pd.DataFrame(data)
print df
country dt number
0 uk 2016-08-22 10
1 usa 2016-08-21 21
2 fr 2016-08-22 20
3 fr 2016-08-21 10
4 uk 2016-08-21 12
#pivot table by dt:
df['idx'] = df.groupby('dt')['dt'].cumcount()
df_pivot = df.set_index(['idx','dt']).stack().unstack([1,2])
print df_pivot
dt 2016-08-22 2016-08-21
country number country number
idx
0 uk 10 usa 21
1 fr 20 fr 10
2 NaN NaN uk 12
#what I really want:
dt 2016-08-22 2016-08-21
country number country number
0 uk 10 uk 12
1 fr 20 fr 10
2 NaN NaN usa 21
or even better:
2016-08-22 2016-08-21
country number number
0 uk 10 12
1 fr 20 10
2 usa NaN 21
i.e. uk values from both 2016-08-22 and 2016-08-21 are aligned on same row
You can use:
df_pivot = df.set_index(['dt','country']).stack().unstack([0,2]).reset_index()
print (df_pivot)
dt country 2016-08-22 2016-08-21
number number
0 fr 20.0 10.0
1 uk 10.0 12.0
2 usa NaN 21.0
#change first value of Multiindex from first to second level
cols = [col for col in df_pivot.columns]
df_pivot.columns = pd.MultiIndex.from_tuples([('','country')] + cols[1:])
print (df_pivot)
2016-08-22 2016-08-21
country number number
0 fr 20.0 10.0
1 uk 10.0 12.0
2 usa NaN 21.0
Another simplier solution is with pivot:
df_pivot = df.pivot(index='country', columns='dt', values='number')
print (df_pivot)
dt 2016-08-21 2016-08-22
country
fr 10.0 20.0
uk 12.0 10.0
usa 21.0 NaN

drop column based on a string condition

How can I delete a dataframe column based on a certain string in its name?
Example:
house1 house2 chair1 chair2
index
1 foo lee sam han
2 fowler smith had sid
3 cle meg mag mog
I want to drop the columns that contain 'chair' in the string.
How can this be done in an efficient way?
Thanks.
df.drop([col for col in df.columns if 'chair' in col],axis=1,inplace=True)
UPDATE2:
In [315]: df
Out[315]:
3M110% 3M80% 6M90% 6M95% 1N90% 2M110% 3M95%
1 foo lee sam han aaa aaa fff
2 fowler smith had sid aaa aaa fff
3 cle meg mag mog aaa aaa fff
In [316]: df.loc[:, ~df.columns.str.contains('90|110')]
Out[316]:
3M80% 6M95% 3M95%
1 lee han fff
2 smith sid fff
3 meg mog fff
UPDATE:
In [40]: df
Out[40]:
house1 house2 chair1 chair2 door1 window1 floor1
1 foo lee sam han aaa aaa fff
2 fowler smith had sid aaa aaa fff
3 cle meg mag mog aaa aaa fff
In [41]: df.filter(regex='^(?!(chair|door|window).*?)')
Out[41]:
house1 house2 floor1
1 foo lee fff
2 fowler smith fff
3 cle meg fff
Original answer:
here a few alternatives:
In [37]: df.drop(df.filter(like='chair').columns, 1)
Out[37]:
house1 house2
1 foo lee
2 fowler smith
3 cle meg
In [38]: df.filter(regex='^(?!chair.*)')
Out[38]:
house1 house2
1 foo lee
2 fowler smith
3 cle meg
This should do it:
df.drop(df.columns[df.columns.str.match(r'chair')], axis=1)
Timing
MaxU method 2
One more alternative:
import pandas as pd
df = pd.DataFrame({'house1':['foo','fowler','cle'],
'house2':['lee','smith','meg'],
'chair1':['sam','had','mag'],
'chair2':['han','sid','mog']})
mask = ['chair' not in x for x in df]
df = df[df.columns[mask]]

Categories