Confusion about Pandas DataFrame.apply() - python

Let's say I write the code:
df2 = pd.DataFrame(np.random.randint(1,5,4).reshape((2,2)), columns=['A', 'B'])
print("dataframe: \n", df2)
print("mean: \n", df2.mean(0))
df2 = df2.apply(lambda x: x - [1, 2], axis=0)
print("altered df2: \n", df2)
It gives me the results:
dataframe:
A B
0 1 1
1 2 2
mean:
A 1.5
B 1.5
dtype: float64
altered df2:
A B
0 0 0
1 0 0
So first, I've asked it to give me the mean along axis=0. In my mind, this means count each row as a vector and find the average of those vectors. It seems that Pandas agrees with me on this one!
However, I then use the DataFrame.apply() function, and I again specify axis=0. Here I expect the same logic to work, which is for an operation on each row to be executed. In this case lambda x: x - [1, 2]. The result I expect is :
A B
0 0 -1
1 1 0
But instead, when I specify axis=0 (rows), it actually does the operation on the columns.
I'm having a very hard time with pandas, specifically wrapping my head around how it indexes rows and columns, and this has further added to the confusion. As it stands, every time I think I get how things work, I find out I'm wrong, so nothing ever sticks in my brain. I'm asking for an easy way to think about this stuff so that it sticks in my dyslexic little head.

Related

Vector arithmetic by conditional selection from multiple columns in a dataframe

I'm trying to do arithmetic among different cells in my dataframe and can't figure out how to operate on each of my groups. I'm trying to find the difference in energy_use between a baseline building (in this example upgrade_name == b is the baseline case) and each upgrade, for each building. I have an arbitrary number of building_id's and arbitrary number of upgrade_names.
I can do this successfully for a single building_id. Now I need to expand this out to a full dataset and am stuck. I will have 10's of thousands of buildings and dozens of upgrades for each building.
The answer to this question Iterating within groups in Pandas may be related, but I'm not sure how to apply it to my problem.
I have a dataframe like this:
df = pd.DataFrame({'building_id': [1,2,1,2,1], 'upgrade_name': ['a', 'a', 'b', 'b', 'c'], 'energy_use': [100.4, 150.8, 145.1, 136.7, 120.3]})
In [4]: df
Out[4]:
building_id upgrade_name energy_use
0 1 a 100.4
1 2 a 150.8
2 1 b 145.1
3 2 b 136.7
4 1 c 120.3
For a single building_id I have the following code:
upgrades = df.loc[df.building_id == 1, ['upgrade_name', 'energy_use']]
starting_point = upgrades.loc[upgrades.upgrade_name == 'b', 'energy_use']
upgrades['diff'] = upgrades.energy_use - starting_point.values[0]
In [8]: upgrades
Out[8]:
upgrade_name energy_use diff
0 a 100.4 -44.7
2 b 145.1 0.0
4 c 120.3 -24.8
How do I write this for arbitrary numbers of building_id's, instead of my hard-coded building_id == 1?
The ideal solution looks like this (doesn't matter if the baseline differences are 0 or NaN):
In [17]: df
Out[17]:
building_id upgrade_name energy_use ideal
0 1 a 100.4 -44.7
1 2 a 150.8 14.1
2 1 b 145.1 0.0
3 2 b 136.7 0.0
4 1 c 120.3 -24.8
Define the function counting the difference in energy usage (for
a group of rows for the current building) as follows:
def euDiff(grp):
euBase = grp[grp.upgrade_name == 'b'].energy_use.values[0]
return grp.energy_use - euBase
Then compute the difference (for all buildings), applying it to each group:
df['ideal'] = df.groupby('building_id').apply(euDiff)\
.reset_index(level=0, drop=True)
The result is just as you expected.
thanks for sharing that example data! Made things a lot easier.
I suggest solving this in two parts:
1. Make a dictionary from your dataframe that contains that baseline energy use for each building
2. Apply a lambda function to your dataframe to subtract each energy use value from the baseline value associated with that building.
# set index to building_id, turn into dictionary, filter out energy use
building_baseline = df[df['upgrade_name'] == 'b'].set_index('building_id').to_dict()['energy_use']
# apply lambda to dataframe, use axis=1 to access rows
df['diff'] = df.apply(lambda row: row['energy_use'] - building_baseline[row['building_id']])
You could also write a function to do this. You also don't necessarily need the dictionary, it just makes things easier. If you're curious about these alternative solutions let me know and I can add them for you.

Indexing a dataframe from a dataframe of row indexes

I have two python dataframes with equal shape, for example:
df1 = pd.DataFrame(np.random.randn(3,2), index=np.arange(3), columns=['a','b'] )
df2 = pd.DataFrame(np.random.randint(0, high=3, size=(3,2)), index=np.arange(3), columns=['a','b'] )
print df1
a b
0 0.336811 -2.132993
1 -1.492770 0.278024
2 -2.355762 -0.894376
print df2
a b
0 1 2
1 0 2
2 2 1
I would like to use the values in df2 as row indexes to select the values in df1 and create a new dataframe of equal shape.
Expected result:
print df3
a b
0 -1.492770 -0.894376
1 0.336811 -0.894376
2 -2.355762 0.278024
I have tried using .loc and it works well for a single column:
df3 = df1.loc[df2['a'], 'a']
print df3
0 -1.492770
1 0.336811
2 -2.355762
But I was not able to use .loc or .iloc on all columns at the same time.
I would like to avoid loops to optimize performance since I am working on a large dataframe.
Any ideas?
Using numpy selection
pd.DataFrame([df1[col].values[df2[col]] for col in df1.columns], index=['a','b']).T
a b
0 -1.492770 -0.894376
1 0.336811 -0.894376
2 -2.355762 0.278024
If you want to avoid for loops, you have to play with raveling and unraveling. In a nutshell, you flatten all your data frame in a single vector, sum len(df1) at each block to jump indexes to the beginning of the next column, and then reshape back to the original size. All operations in this context are vectorized, so should be fast.
For example,
df1.T.values.ravel()[df2.T.values.ravel() + np.repeat(np.arange(0, len(df1)+1, len(df1)), len(df1))].reshape(df1.T.shape).T
Gives
array([[-1.49277 , -0.894376],
[ 0.336811, -0.894376],
[-2.355762, 0.278024]])

using pandas apply and inplace dataframe

I have dataframe like below and want to change as below result df using below def by 'apply method' in pandas.
As far as I know, 'apply' method makes a series not inplacing original df.
id a b
-------
a 1 4
b 2 5
c 6 2
if df['a'] > df['b'] :
df['a'] = df['b']
else :
df['b'] = df['a']
result df :
id a b
-------
a 4 4
b 5 5
c 6 6
I am not sure what you need,since the expected output is different from your condition, here I can only fix your code
for x,y in df.iterrows():
if y['a'] > y['b']:
df.loc[x,'a'] = df.loc[x,'b']
else:
df.loc[x,'b'] = df.loc[x,'a']
df
Out[40]:
id a b
0 a 1 1
1 b 2 2
2 c 2 2
If I understand your problem correctly
df.assign(**dict.fromkeys(['a','b'],np.where(df.a>df.b,df.a,df.b)))
Out[43]:
id a b
0 a 1 1
1 b 2 2
2 c 2 2
Like the rest, not totally sure what you're trying to do, i'm going to ASSUME you are meaning to set the value of either the current "A" or "B" value throughout to be equal to the highest of either column's values in that row.... If that assumption is correct, here's how that can be done with ".apply()".
First thing, is most "clean" applications (remembering that the application of ".apply()" is generally never recommended) of ".apply()" utilize a function that takes the input of the row fed to it by the ".apply()" function and generally returns the same object, but modified/changed/etc as needed. With your dataframe in mind, this is a function to achieve the desired output, followed by the application of the function against the dataframe using ".apply()".
# Create the function to be used within .apply()
def comparer(row):
if row["a"] > row["b"]:
row["b"] = row["a"]
elif row["b"] > row["a"]:
row["a"] = row["b"]
return(row)
# Use .apply() to execute our function against our column values. Returning the result of .apply(), re-creating the "df" object as our new modified dataframe.
df = df.apply(comparer, axis=1)
Most, if not everyone seems to rail against ".apply()" usage however. I'd probably heed their wisdom :)
Try :
df = pd.DataFrame({'a': [1, 2, 6], 'b': [4,5,2]})
df['a'] = df.max(axis=1)
df['b'] = df['a']

Pandas axes explained

I'm trying to understand the axis parameter in python pandas. I understand that it's analogous to the numpy axis, but the following example still confuses me:
a = pd.DataFrame([[0, 1, 4], [1, 2, 3]])
print a
0 1 2
0 0 1 4
1 1 2 3
According to this post, axis=0 runs along the rows (fixed column), while axis=1 runs along the columns (fixed row). Running print a.drop(1, axis=1) yields
0 2
0 0 4
1 1 3
which results in a dropped column, while print a.drop(1, axis=0) drops a row. Why? That seems backwards to me.
It's slightly confusing, but axis=0 operates on rows, axis=1 operates on columns.
So when you use df.drop(1, axis=1) you are saying drop column number 1.
The other post has df.mean(axis=1), which essentially says calculate the mean on columns, per row.
This is similar to indexing numpy arrays, where the first index specifies the row number (0th dimension), the second index the column number (1st dimension), and so on.

Transforming outliers in Pandas DataFrame using .apply, .applymap, .groupby

I'm attempting to transform a pandas DataFrame object into a new object that contains a classification of the points based upon some simple thresholds:
Value transformed to 0 if the point is NaN
Value transformed to 1 if the point is negative or 0
Value transformed to 2 if it falls outside certain criteria based on the entire column
Value is 3 otherwise
Here is a very simple self-contained example:
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})
print(df)
The transformation process created so far:
#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers'
outliers = pd.DataFrame()
for datapoint in df.columns:
tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())])
outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer')
outliers[outliers.isnull() == False] = 2
#Classify everything else as "3"
df[df > 0] = 3
#Classify negative and zero points as a "1"
df[df <= 0] = 1
#Update with the outliers
df.update(outliers)
#Everything else is a "0"
df.fillna(value=0, inplace=True)
Resulting in:
I have tried to use .applymap() and/or .groupby() in order to speed up the process with no luck. I found some guidance in this answer however, I'm still unsure how .groupby() is useful when you're not grouping within a pandas column.
Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer.
>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )
a b
0 NaN 2
1 2 3
2 3 -4
3 4 5
4 5 6
5 0 2
6 -7 7
7 9 9
8 10 NaN
You could also do it with apply, but it will be slower than the np.where approach (but approximately the same speed as what you are currently doing), though much simpler. That's probably a good example of why you should always avoid apply if possible, when you care about speed.
>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2
You could also do this, which is faster than apply but slower than np.where:
>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2
Of course, these things don't always scale linearly, so test them on your real data and see how that compares.

Categories