Pandas: Column medians based on column names - python

I have the following pandas DataFrame.
df = pd.DataFrame(np.random.randn(3,6), columns=['A1','A2','A3','B1','B2','B3'])
df
A1 A2 A3 B1 B2 B3
0 -0.409420 2.382457 1.151565 0.625461 0.224453 -0.351573
1 -0.676554 -1.485376 0.597227 0.240113 0.033963 1.224241
2 0.678698 1.392778 1.031625 0.388137 -0.566746 -0.798156
How do I get the median of the columns like this
medA medB
0 ... ...
1 ... ...
2 ... ...
My actual data frame has 300 columns, so I would like to differentiate by similarity in column name.

this looks like pd.wide_to_long:
(pd.wide_to_long(df.reset_index(),['A','B'],'index','idx')
.groupby('index').median().add_prefix('med_').rename_axis(None))
Or groupby with the first string on axis=1
df.groupby(df.columns.str[0],axis=1).median().add_prefix('med_')
med_A med_B
0 -0.075465 -0.317335
1 -0.355822 -0.517270
2 0.279270 -1.134389

Here's a for loop answer:
dict = {}
dff = pd.DataFrame()
for letter in ['A', 'B']:
dict[letter] = []
for col in df.columns:
if col.startswith(letter):
dict[letter].append(col)
dff[f'med_{letter}'] = df[dict[letter]].median(axis=1)
I'm not sure what you mean by "to differentiate by similarity in column name", here it just compares the beginning of each column name with the entries in the primer list (['A', 'B']).

Related

Explode data frame columns into multiple rows

I have a large dataframe a that I would like to split or explode to become dataframe b (the real dataframe a contains 90 columns).
I tried to look up for solutions to a problem similar to this but I did not find since it is not related to the values in cells but to the column names.
Any pointer to the solution or to using an existing function in the pandas library would be appreciated.
Thank you in advance.
from pandas import DataFrame
import numpy as np
# current df
a = DataFrame([{'ID': 'ID_1', 'A-1': 'a1', 'B-1':'b1','C-1':'c1', 'A-2': 'a2', 'B-2':'b2','C-2':'c2'}])
# desired df
b = DataFrame([{'ID': 'ID_1', 'A': 'a1', 'B':'b1', 'C':'c1'},
{'ID': 'ID_1','A': 'a2', 'B':'b2','C':'c2'}])
current df
desired df
One idea I have is to to split this dataframe into two dataframes (Dataframe 1 will contain columns from A1 to C1 and Dataframe 2 will contain columns from A2 to C2 ) rename the columns to A/B/C and than concatenate both. But I am not sure in terms of efficiency since I have 90 Columns that will grow over time.
This approach will generate some intermediate columns which will be removed later on.
First bring down those labels (A-1,...) from the header into a column
df = pd.melt(a, id_vars=['ID'], var_name='label')
Then split the label into character and number
df[['char', 'num']] = df['label'].str.split('-', expand=True)
Finally drop the label, set_index before unstack, and take care of the final table formats.
df.drop('label', axis=1)\
.set_index(['ID', 'num', 'char'])\
.unstack()\
.droplevel(0, axis=1)\
.reset_index()\
.drop('num', axis=1)
pd.wide_to_long works well here assuming a small number of known stubnames:
b = (
pd.wide_to_long(a, stubnames=['A', 'B', 'C'], sep='-', i='ID', j='to_drop')
.droplevel(level='to_drop')
.reset_index()
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Alternatively set_index, split the columns on '-' with str.split and stack:
b = a.set_index('ID')
b.columns = b.columns.str.split('-', expand=True)
b = b.stack().droplevel(-1).reset_index()
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
One option is with the pivot_longer function from pyjanitor, which abstracts the reshaping process and is also efficient:
# pip install pyjanitor
import janitor
import pandas as pd
a.pivot_longer(index="ID", names_to=".value", names_pattern="(.).+")
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
The .value tells the function which part of the columns to retain. It takes its cue from the names_pattern, which should be a regular expression with groups, the grouped regex are what stay as headers. In this case, the first letter of each column is what we are interested in, which is represented by (.).
Another option, with pivot_longer, is to use the names_sep parameter:
(a.pivot_longer(index="ID", names_to=(".value", "num"), names_sep="-")
.drop(columns="num")
)
ID A B C
0 ID_1 a1 b1 c1
1 ID_1 a2 b2 c2
Again, only values in the columns associated with .value remain as headers.
import pandas as pd
import math
df=pd.DataFrame(data={k:[i*k for i in range(1,5)] for k in range (1,9)})
assert(df.shape[1]%2==0)
df_1=df.iloc[:,0:math.floor(df.shape[1]/2)]
df_2=df.iloc[:,math.floor(df.shape[1]/2):]
df_2.columns=df_1.columns
df_sum=pd.concat((df_1,df_2),axis=0)
display(df_sum)
Like this?

How to compare a value of a single column over multiple columns in the same row using pandas?

I have a dataframe that looks like this:
np.random.seed(21)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B1', 'B2', 'B3'])
df['current_State'] = [df['B1'][0], df['B1'][1], df['B2'][2], df['B2'][3], df['B3'][4], df['B3'][5], df['B1'][6], df['B2'][7]]
df
I need to create a new column that contains the name of the column where the value of 'current_State' is the same, this is the desired output:
I tried many combinations of apply and lambda functions but without success. Any help is very welcome!
You can compare the current_State column with all the remaining columns to create a boolean mask, then use idxmax along axis=1 on this mask to get the name of the column where the value in the given row equal to corresponding value in current_State:
c = 'current_State'
df['new_column'] = df.drop(c, 1).eq(df[c], axis=0).idxmax(1)
In case if there is a possibility that there are no matching values we can instead use:
c = 'current_State'
m = df.drop(c, 1).eq(df[c], axis=0)
df['new_column'] = m.idxmax(1).mask(~m.any(1))
>>> df
A B1 B2 B3 current_State new_column
0 -0.051964 -0.111196 1.041797 -1.256739 -0.111196 B1
1 0.745388 -1.711054 -0.205864 -0.234571 -1.711054 B1
2 1.128144 -0.012626 -0.613200 1.373688 -0.613200 B2
3 1.610992 -0.689228 0.691924 -0.448116 0.691924 B2
4 0.162342 0.257229 -1.275456 0.064004 0.064004 B3
5 -1.061857 -0.989368 -0.457723 -1.984182 -1.984182 B3
6 -1.476442 0.231803 0.644159 0.852123 0.231803 B1
7 -0.464019 0.697177 1.567882 1.178556 1.567882 B2

How to remove double quotes while assigning columns to dataframe

I have below list
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
While i am trying to read above columns and assign inside dataframe i am getting extra double quotes
df = pd.dataframe(data,columns=[ColumnName])
columns=[ColumnName]
i am getting columns = ["'Emp_id','Emp_Name','EmpAGe'"]
how can i handle these extra double quotes and remove them while assigning header to data
This code
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
Is a tuple and not a list.
In case you want three columns, each with values on the tuple above you gonna need
df = pd.dataframe(data,columns=list(ColumnName))
The problem is how you define the columns for pandas DataFrame.
The example below will build a correct data frame :
import pandas as pd
ColumnName1 = 'Emp_id','Emp_Name','EmpAGe'
df1 = [['A1','A1','A2'],['1','2','1'],['a0','a1','a3']]
df = pd.DataFrame(data=df1,columns=ColumnName1 )
df
Result :
Emp_id Emp_Name EmpAGe
0 A1 A1 A2
1 1 2 1
2 a0 a1 a3
A print screen of the code I wrote with the result, with no double quotations
Just for the shake of the understanding, where you can use col.replace to get the desired ..
Let take an example..
>>> df
col1" col2"
0 1 1
1 2 2
Result:
>>> df.columns = [col.replace('"', '') for col in df.columns]
# df.columns = df.columns.str.replace('"', '') <-- can use this as well
>>> df
col1 col2
0 1 1
1 2 2
OR
>>> df = pd.DataFrame({ '"col1"':[1, 2], '"col2"':[1,2]})
>>> df
"col1" "col2"
0 1 1
1 2 2
>>> df.columns = [col.replace('"', '') for col in df.columns]
>>> df
col1 col2
0 1 1
1 2 2
Your input is not quite right. ColumnName is already list-like and it should be passed on directly rather than wrapped in another list. In the latter case it would be interpreted as one single column.
df = pd.DataFrame(data, columns=ColumnName)

Conditionally merge pd.DataFrames

I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)

How to build index from multiple columns and set to a column pandas data frame?

I´d like to learn how to data frame column as code maped from multiple columns.
In the partial example below I was trying what would could be a clumsy way folowing the path: get unique values as a temporary data frame; concatenate some prefix string to temp row number as a new column and them join the 2 data frames.
df = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
'col2' : ['B1', 'B2', 'B1', 'B1'],
'value' : [100, 200, 300, 400],
})
tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])
# col1 col2
# 0 A1 B1
# 1 A2 B2
# 3 A3 B1
The first question is how to get 'temp' row number and its value to a tmp column?
And what is the clever pythonic way to achieve the result below from df?
dfnew = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
'col2' : ['B1', 'B2', 'B1', 'B1'],
'code' : ['CODE0','CODE1', 'CODE0', 'CODE3'],
'value' : [100, 200, 300, 400],
})
code col1 col2 value
0 CODE0 A1 B1 100
1 CODE1 A2 B2 200
2 CODE0 A1 B1 300
3 CODE3 A3 B1 400
thanks.
After the answers and just as an exercise I kept working on the non-pythonic version I had in mind with insights I got from great answers, and reached this:
tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])
tmp.reset_index(inplace=True)
tmp.drop('index', axis=1, inplace=True)
tmp['code'] = tmp.index.to_series().apply(lambda x: 'code' + format(x, '04d'))
dfnew = pd.merge(df, tmp, on=['col1', 'col2'])
At the time of posting this question, I did not realize that would be nicer to have the index reset to have a fresh sequence instead of their original index numbers.
I tried some variations but I did not get how to chain 'reset_index' and 'drop' in just one command.
I´m starting to enjoy Python. Thank you all.
groupby on df.index with ['col1', 'col2'] using transform('first') and map
df.assign(
code=df.index.to_series().groupby(
[df.col1, df.col2]
).transform('first').map('CODE{}'.format)
)[['code'] + df.columns.tolist()]
code col1 col2 value
0 CODE0 A1 B1 100
1 CODE1 A2 B2 200
2 CODE0 A1 B1 300
3 CODE3 A3 B1 400
explanation
# turn index to series so I can perform a groupby on it
idx_series = df.index.to_series()
# groupby col1 and col2 to establish uniqueness
idx_gb = idx_series.groupby([df.col1, df.col2])
# get first index value in each unique group
# and broadcast over entire group with transform
idx_tf = idx_gb.transform('first')
# map a format function to get desired string
code = idx_tf.map('code{}'.format)
# use assign to create new column
df.assign(code=code)
You can first sort_values of columns col1 and col2 where by duplicated find all duplicates:
df = df.sort_values(['col1', 'col2'])
mask = df.duplicated(['col1','col2'])
print (mask)
0 False
2 True
1 False
3 False
dtype: bool
Then use insert if need specify position of output column code with numpy.where and fillna missing values. Last sort_index:
df.insert(0, 'code', np.where(mask, np.nan, 'CODE' + df.index.astype(str)))
df.code = df.code.ffill()
df = df.sort_index()
print (df)
code col1 col2 value
0 CODE0 A1 B1 100
1 CODE1 A2 B2 200
2 CODE0 A1 B1 300
3 CODE3 A3 B1 400
How to get 'temp' row number and its value to a tmp column?
Value column is not propagating because you filter it out at the beginning: df[['col1','col2']]. Hence, this is fixed by changing it to tmp = df.drop_duplicates(['col1', 'col2']).
Index is preserved in the index column, if you want to copy it explicitly into data column, just do tmp['index'] = tmp.index.
What is the clever pythonic way to achieve the result below from df?
I do not know if it is particularly clever or not, as this is subjective, but one way of achieving that is
pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])])
Finally, to achieve the result in a form you specified, you can add .sort_index() and [['code', 'col1', 'col2', 'value']] to the above, in order to specify ordering of columns. Giving:
newdf = pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])]).sort_index()[['code', 'col1', 'col2', 'value']]
Possible performance bottleneck may be groupby and concat which may matter if you operate on large data sets.
If you have df DataFrame like this:
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561
you can create indexes from state and year columns with:
df2 = df.set_index(['state','year'])
which will give you dataframe with multi-index constructed from columns state and year:
Accessing Multindexed dataframe
df['California',2000]
Result: 33871648
df[:,2010]
Result:
state
California 37253956
New York 19378102
Texas 25145561
dtype: int64
pop.loc['California':'New York']
Result:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64

Categories