Assign to *new* subset of a pandas DataFrame - python

Say I have some data in a DataFrame df. In particular, df.columns is a MultiIndex where the first level indicates "what kind of data" we are dealing with, and the second level indicates some sort of ID. To begin with, there is only a single unique value in the outermost column level:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(400, 5), columns=list('abcde'))
df.columns = pd.MultiIndex.from_tuples([('raw', c) for c in df.columns],
names=['datum', 'id'])
So say I want to compute a 10 period moving average of this chunk of data. I can easily do that with
df['raw'].rolling(window=10, min_periods=10).mean()
I'd like to assign this to a new section of the existing data frame. I wish the syntax were simply:
df['avg_10'] = df['raw'].rolling(window=10, min_periods=10).mean()
But that doesn't work. Instead, to get the equivalent, I need to do something clunky like:
a = df['raw'].rolling(window=10, min_periods=10).mean()
a.columns = pd.MultiIndex.from_tuples([('avg_10', c) for c in a.columns],
names=['datum', 'id'])
df = pd.concat([df, a], axis=1)
Is there a concise way to do this?

you can add new columns in one shot like this:
df[df.columns.get_level_values(1)] = df['raw'].rolling(window=10, min_periods=10).mean()
and now let's bring order to columns levels:
df.columns = pd.MultiIndex.from_tuples(
[t if t[0]=='raw' else ('avg_10', t[0]) for t in df.columns.tolist()]
)
Output:
In [121]: df.tail()
Out[121]:
raw avg_10 \
a b c d e a b
35 -0.036381 -0.202369 0.728408 -1.149906 -0.888169 0.174578 0.244956
36 1.700182 -0.957104 -0.005931 -1.035258 0.916398 0.304429 0.025519
37 1.142203 0.198508 -0.568147 0.006620 1.912575 0.408570 0.029939
38 -1.360093 0.638533 -0.899154 1.120311 1.702436 0.109886 0.155383
39 -1.860319 0.863798 0.876608 1.292301 0.547762 -0.069686 0.141820
c d e
35 -0.046456 -0.291078 0.176360
36 0.128143 -0.670730 0.213351
37 0.041724 -0.542027 0.301774
38 -0.147804 -0.363713 0.400007
39 0.005854 -0.164190 0.483140

Because of df.rolling as in your example, this solution only works with Pandas 0.18.0+.
# Create sample data with three columns.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(400, 3), columns=list('abc'))
df.columns = pd.MultiIndex.from_tuples([('raw', c) for c in df.columns],
names=['datum', 'id'])
# Have two window periods (e.g. 10, 30).
windows = [10, 30]
cols = df.columns.get_level_values(1)
for window in windows:
for col in cols:
df.loc[:, ('avg_{0}'.format(window), col)] = \
df.xs(col, axis=1, level=1).rolling(window=window, min_periods=window).mean()
>>> df.tail()
datum raw avg_10 avg_30
id a b c a b c a b c
395 -0.177813 0.250998 1.054758 0.528226 0.266558 0.123020 0.046781 0.365069 0.233943
396 0.960048 -0.416499 -0.276823 0.459380 0.379910 0.140920 0.067177 0.329077 0.261536
397 1.123905 -0.173464 -0.510030 0.429155 0.268950 0.022079 0.105671 0.270666 0.271052
398 1.392518 1.037586 0.018792 0.485142 0.340002 -0.139202 0.170970 0.315509 0.262711
399 -0.593777 -2.011880 0.589704 0.387988 0.114828 -0.096127 0.133680 0.206199 0.265718

Related

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

Looping over data frame to cap and sum another data frame

I am trying to use entries from df1 to limit amounts in df2, then add them up based on their type and summarize in df3. I'm not sure how to get it, the for loop using iterrows would be my best guess but it's not complete.
Code:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','50','100']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85','105','80'], \
'Type': ['a' ,'b' ,'b' ,'c' ,'a' , 'b' ,'d' ]})
df3 = pd.DataFrame({'Type': ['a' ,'b' ,'c' ,'d']})
df1['Caps'] = df1['Caps'].astype(float)
df2['Amounts'] = df2['Amounts'].astype(float)
for index1, row1 in df1.iterrows():
for index2, row2 in df3.iterrows():
df3[str(row1['Caps']+'limit')] = df2['Amounts'].where(
df2['Type'] == row2['Type']).where(
df2['Amounts']<= row1['Caps'], row1['Caps']).sum()
# My ideal output would be this:
df3 = pd.DataFrame({'Type':['a','b','c','d'],
'Total':['130','195','35','80'],
'25limit':['50','75','25','25'],
'50limit':['95','125','35','50'],
'100limit':['130','190','35','80'],
})
Output:
>>> df3
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Use numpy for compare all values Amounts with Caps by broadcasting to 2d array a, then create DataFrame by constructor with sum per columns, transpose by DataFrame.T and DataFrame.add_prefix.
For aggregated column use DataFrame.insert for first column with GroupBy.sum:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
a = np.where(am <= ca[:, None], am[None, :], ca[:, None])
df1 = (pd.DataFrame(a,columns=df2['Type'],index=df1['Caps'])
.sum(axis=1, level=0).T.add_suffix('limit'))
df1.insert(0, 'Total', df2.groupby('Type')['Amounts'].sum())
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Here is my solution without numpy, however it is two times slower than #jezrael's solution, 10.5ms vs. 5.07ms.
limcols= df1.Caps.to_list()
df2=df2.reindex(columns=["Amounts","Type"]+limcols)
df2[limcols]= df2[limcols].transform( \
lambda sc: np.where(df2.Amounts.le(sc.name),df2.Amounts,sc.name))
# Summations:
g=df2.groupby("Type")
df3= g[limcols].sum()
df3.insert(0,"Total", g.Amounts.sum())
# Renaming columns:
c_dic={ lim:f"{lim:.0f}limit" for lim in limcols}
df3= df3.rename(columns=c_dic).reset_index()
# Cleanup:
#df2=df2.drop(columns=limcols)

How do I filter an empty DataFrame and still keep the columns of that DataFrame?

Here is an example of why pandas is a terribly designed hacked together library:
import pandas as pd
df = pd.DataFrame()
df['A'] = [1,2,3]
df['B'] = [4,5,6]
print(df)
df1 = df[df.A.apply(lambda x:x == 4)]
df2 = df1[df1.B.apply(lambda x:x == 1)]
print(df2)
This will print
df
A B
0 1 4
1 2 5
2 3 6
df2
Empty DataFrame
Columns: []
Index: []
Note how Columns: [] , which means any further/selecting on df2 will fail. This is a huge issue, because it means I now have to always check if any table is empty before attempting to select from it, which is garbage behaviour.
For clarity, the sensible, thoughtful, reasonable, not totally broken behaviour would be to preserve the columns.
Anyone care to offer some hack I can apply on top of the collection of hacks which is the dataframe API?
Pandas almost consider all situations we need, especially for those simple cases
PS: Nothing wrong with pandas
df1 = df.loc[df.A.apply(lambda x:x == 4)]
df2 = df1.loc[df1.B.apply(lambda x:x == 1)]
df1
Out[53]:
Empty DataFrame
Columns: [A, B]
Index: []
df2
Out[54]:
Empty DataFrame
Columns: [A, B]
Index: []
df2 = df1[df1.B.apply(lambda x:x == 1).astype(bool)]
All other answers are missing the point (except for Wen's, which is an ok alternative)

Compare 2 Pandas dataframes, row by row, cell by cell

I have 2 dataframes, df1 and df2, and want to do the following, storing results in df3:
for each row in df1:
for each row in df2:
create a new row in df3 (called "df1-1, df2-1" or whatever) to store results
for each cell(column) in df1:
for the cell in df2 whose column name is the same as for the cell in df1:
compare the cells (using some comparing function func(a,b) ) and,
depending on the result of the comparison, write result into the
appropriate column of the "df1-1, df2-1" row of df3)
For example, something like:
df1
A B C D
foo bar foobar 7
gee whiz herp 10
df2
A B C D
zoo car foobar 8
df3
df1-df2 A B C D
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
I've started with this:
for r1 in df1.iterrows():
for r2 in df2.iterrows():
for c1 in r1:
for c2 in r2:
but am not sure what to do with it, and would appreciate some help.
So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy. Ideally, you shouldn't ever be calling iterrows(). To be a little more explicit with my suggestion:
# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']
# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0 foofoofoozoo
1 NaN
Name: A, dtype: object
# more generally
# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns)
for colName in df1:
df3[colName] = func(df1[colName], df2[colName])
Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:
# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]
# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
df3[colName] = func(df1[colName], df2[colName])
The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns. For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined. Just something to be aware of.
Edit, re: your comment: That is doable as well. Iterate over the dfX.columns that is larger, so you don't run into a KeyError, and throw an if statement in there:
# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
if colName not in df1:
df3[colName] = np.nan # be sure to import numpy as np
else:
df3[colName] = func(df1[colName], df2[colName])

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

Categories