How to broadcast-map by columns from one dataframe to another? - python

I'd like to broadcast or expand a dataframe columns-wise from a smaller set index to a larger set index based on a mapping specification. I have the following example, please accept small mistakes as this is untested
import pandas as pd
# my broadcasting mapper spec
mapper = pd.Series(data=['a', 'b', 'c'], index=[1, 2, 2])
# my data
df = pd.DataFrame(data={1: [3, 4], 2: [5, 6]})
print(df)
# 1 2
# --------
# 0 3 5
# 1 4 6
# and I would like to get
df2 = ...
print(df2)
# a b c
# -----------
# 0 3 5 5
# 1 4 6 6
Simply mapping the columns will not work as there are duplicates, I would like to instead expand to the new values as defined in mapper:
# this will of course not work => raises InvalidIndexError
df.columns = df.columns.as_series().map(mapper)
A naive approach would just iterate the spec ...
df2 = pd.DataFrame(index=df.index)
for i, v in df.iteritems():
df2[v] = df[i]

Use reindex and set_axis:
out = df.reindex(columns=mapper.index).set_axis(mapper, axis=1)
Output:
a b c
0 3 5 5
1 4 6 6

You can use pd.concat + df.get:
pd.concat({v:df.get(k) for k,v in mapper.items()},axis=1)
a b c
0 3 5 5
1 4 6 6

Related

What is the most efficient way to swap the values of two columns of a 2D list in python when the number of rows is in the tens of thousands?

for example if I have an original list:
A B
1 3
2 4
to be turned into
A B
3 1
4 2
two cents worth:
3 ways to do it
you could add a 3rd column C, copy A to C, then delete A. This would take more memory.
you could create a swap function for the values in a row, then wrap it into a loop.
you could just swap the labels of the columns. This is probably the most efficient way.
You could use rename:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})
output:
B A
0 1 3
1 2 4
If order matters:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})[df.columns]
output:
A B
0 3 1
1 4 2
Use DataFrame.rename with dictionary for swapping columnsnames, last check orcer by selecting columns:
df = df.rename(columns=dict(zip(df.columns, df.columns[::-1])))[df.columns]
print (df)
A B
0 3 1
1 4 2
You can also just simple use masking to change the values.
import pandas as pd
df = pd.DataFrame({"A":[1,2],"B":[3,4]})
df[["A","B"]] = df[["B","A"]].values
df
A B
0 3 1
1 4 2
for more than 2 columns:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9], 'D':[10,11,12]})
print(df)
'''
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
'''
df = df.set_axis(df.columns[::-1],axis=1)[df.columns]
print(df)
'''
A B C D
0 10 7 4 1
1 11 8 5 2
2 12 9 6 3
I assume that your list is like this:
my_list = [[1, 3], [2, 4]]
So you can use this code:
print([[each_element[1], each_element[0]] for each_element in my_list])
The output is:
[[3, 1], [4, 2]]

How to remove row and rename multiindex table

I have multi index data frame like below and I would like remove the row above 'A' (like shift the dataframe up)
metric data data
F K
C B
A 2 3
B 4 5
C 6 7
D 8 9
desired output
ALIAS data data
metric F K
A 2 3
B 4 5
C 6 7
D 8 9
I looked multiple post but could not find anything closer to create desired outcome. How can I achive the desired output ?
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
Let's try DataFrame.droplevel to remove level 2 from the columns, and DataFrame.rename_axis to update column axis names:
df = df.droplevel(level=2, axis=1).rename_axis(['ALIAS', 'metric'], axis=1)
Or with the index equivalent methods Index.droplevel and Index.rename:
df.columns = df.columns.droplevel(2).rename(['ALIAS', 'metric'])
df:
ALIAS data
metric F K
A 2 3
B 4 5
C 6 7
D 8 9
Setup:
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.arange(2, 10).reshape(-1, 2),
index=list('ABCD'),
columns=pd.MultiIndex.from_arrays([
['data', 'data'],
['F', 'K'],
['C', 'B']
], names=['metric', None, None])
)
df:
metric data
F K
C B
A 2 3
B 4 5
C 6 7
D 8 9

Getting the total for some columns (independently) in a data frame with python [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Get rows based on my given list without revising the order or unique the list

I have a df looks like below, I would like to get rows from 'D' column based on my list without changing or unique the order of list. .
A B C D
0 a b 1 1
1 a b 1 2
2 a b 1 3
3 a b 1 4
4 c d 2 5
5 c d 3 6 #df
My list
l = [4, 2, 6, 4] # my list
df.loc[df['D'].isin(l)].to_csv('output.csv', index = False)
When I use isin() the result would change the order and unique my result, df.loc[df['D'] == value only print the last line.
A B C D
3 a b 1 4
1 a b 1 2
5 c d 3 6
3 a b 1 4 # desired output
Any good way to do this? Thanks,
A solution without loop but merge:
In [26]: pd.DataFrame({'D':l}).merge(df, how='left')
Out[26]:
D A B C
0 4 a b 1
1 2 a b 1
2 6 c d 3
3 4 a b 1
You're going to have to iterate over your list, get copies of them filtered and then concat them all together
l = [4, 2, 6, 4] # you shouldn't use list = as list is a builtin
cache = {}
masked_dfs = []
for v in l:
try:
filtered_df = cache[v]
except KeyError:
filtered_df = df[df['D'] == v]
cache[v] = filtered_df
masked_dfs.append(filtered_df)
new_df = pd.concat(masked_dfs)
UPDATE: modified my answer to cache answers so that you don't have to do multiple searches for repeats
just collect the indices of the values you are looking for, put in a list and then use that list to slice the data
import pandas as pd
df = pd.DataFrame({
'C' : [6, 5, 4, 3, 2, 1],
'D' : [1,2,3,4,5,6]
})
l = [4, 2, 6, 4]
i_locs = [ind for elem in l for ind in df[df['D'] == elem].index]
df.loc[i_locs]
results in
C D
3 3 4
1 5 2
5 1 6
3 3 4

rename index of a pandas dataframe

I have a pandas dataframe whose indices look like:
df.index
['a_1', 'b_2', 'c_3', ... ]
I want to rename these indices to:
['a', 'b', 'c', ... ]
How do I do this without specifying a dictionary with explicit keys for each index value?
I tried:
df.rename( index = lambda x: x.split( '_' )[0] )
but this throws up an error:
AssertionError: New axis must be unique to rename
Perhaps you could get the best of both worlds by using a MultiIndex:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(8).reshape(4,2), index=['a_1', 'b_2', 'c_3', 'c_4'])
print(df)
# 0 1
# a_1 0 1
# b_2 2 3
# c_3 4 5
# c_4 6 7
index = pd.MultiIndex.from_tuples([item.split('_') for item in df.index])
df.index = index
print(df)
# 0 1
# a 1 0 1
# b 2 2 3
# c 3 4 5
# 4 6 7
This way, you can access things according to first level of the index:
In [30]: df.ix['c']
Out[30]:
0 1
3 4 5
4 6 7
or according to both levels of the index:
In [31]: df.ix[('c','3')]
Out[31]:
0 4
1 5
Name: (c, 3)
Moreover, all the DataFrame methods are built to work with DataFrames with MultiIndices, so you lose nothing.
However, if you really want to drop the second level of the index, you could do this:
df.reset_index(level=1, drop=True, inplace=True)
print(df)
# 0 1
# a 0 1
# b 2 3
# c 4 5
# c 6 7
That's the error you'd get if your function produced duplicate index values:
>>> df = pd.DataFrame(np.random.random((4,3)),index="a_1 b_2 c_3 c_4".split())
>>> df
0 1 2
a_1 0.854839 0.830317 0.046283
b_2 0.433805 0.629118 0.702179
c_3 0.390390 0.374232 0.040998
c_4 0.667013 0.368870 0.637276
>>> df.rename(index=lambda x: x.split("_")[0])
[...]
AssertionError: New axis must be unique to rename
If you really want that, I'd use a list comp:
>>> df.index = [x.split("_")[0] for x in df.index]
>>> df
0 1 2
a 0.854839 0.830317 0.046283
b 0.433805 0.629118 0.702179
c 0.390390 0.374232 0.040998
c 0.667013 0.368870 0.637276
but I'd think about whether that's really the right direction.

Categories