using pandas apply and inplace dataframe - python

I have dataframe like below and want to change as below result df using below def by 'apply method' in pandas.
As far as I know, 'apply' method makes a series not inplacing original df.
id a b
-------
a 1 4
b 2 5
c 6 2
if df['a'] > df['b'] :
df['a'] = df['b']
else :
df['b'] = df['a']
result df :
id a b
-------
a 4 4
b 5 5
c 6 6

I am not sure what you need,since the expected output is different from your condition, here I can only fix your code
for x,y in df.iterrows():
if y['a'] > y['b']:
df.loc[x,'a'] = df.loc[x,'b']
else:
df.loc[x,'b'] = df.loc[x,'a']
df
Out[40]:
id a b
0 a 1 1
1 b 2 2
2 c 2 2
If I understand your problem correctly
df.assign(**dict.fromkeys(['a','b'],np.where(df.a>df.b,df.a,df.b)))
Out[43]:
id a b
0 a 1 1
1 b 2 2
2 c 2 2

Like the rest, not totally sure what you're trying to do, i'm going to ASSUME you are meaning to set the value of either the current "A" or "B" value throughout to be equal to the highest of either column's values in that row.... If that assumption is correct, here's how that can be done with ".apply()".
First thing, is most "clean" applications (remembering that the application of ".apply()" is generally never recommended) of ".apply()" utilize a function that takes the input of the row fed to it by the ".apply()" function and generally returns the same object, but modified/changed/etc as needed. With your dataframe in mind, this is a function to achieve the desired output, followed by the application of the function against the dataframe using ".apply()".
# Create the function to be used within .apply()
def comparer(row):
if row["a"] > row["b"]:
row["b"] = row["a"]
elif row["b"] > row["a"]:
row["a"] = row["b"]
return(row)
# Use .apply() to execute our function against our column values. Returning the result of .apply(), re-creating the "df" object as our new modified dataframe.
df = df.apply(comparer, axis=1)
Most, if not everyone seems to rail against ".apply()" usage however. I'd probably heed their wisdom :)

Try :
df = pd.DataFrame({'a': [1, 2, 6], 'b': [4,5,2]})
df['a'] = df.max(axis=1)
df['b'] = df['a']

Related

Pandas: Dataframe itertuples boolean series groupby optimization

I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)

Call a function on an object and assign the return to the same object at the same time

Let's say I want to convert a series of date strings to datetime using the following:
>>> import pandas as pd
>>> dataframe.loc[:, 'DATE'] = pd.to_datetime(dataframe.loc[:, 'DATE'])
Now, I see dataframe.loc[:, 'DATE'] as redundant. Is it possible in python that I call a function on an object and assign the return to the same object at the same time?
Something that looks like:
>>> pd.to_datetime(dataframe.loc[:,'DATE'], +)
or
dataframe.loc[:,'DATE'] += pd.to_datetime()
where + (or whatever) assigns the return of the function to its first argument
This question might be due to my lack of understanding on how programming languages are written/function, so please be gentle.
There is no such a thing. But you can achieve the same with:
name = 'DATE'
dataframe[name] = pd.to_datetime(dataframe[name])
No need for .loc
Some methods support an inplace=True keyword argument.
For example, sorting a dataframe gives you new one:
>>> df = pd.DataFrame({'DATE': [10, 7, 1, 2, 3]})
>>> df.sort_values()
>>> df.sort_values('DATE')
DATE
2 1
3 2
4 3
1 7
0 10
The original remains unchanged:
>>> df
DATE
0 10
1 7
2 1
3 2
4 3
Setting inplace=True, modifies the original df:
>>> df.sort_values('DATE', inplace=True)
>>> df
DATE
2 1
3 2
4 3
1 7
0 10
Closest Pandas gets to this is the ad-hoc "inplace" command that exists for a good portion of DataFrame functions.
For example, an inplace datetime operation happens to be hidden in the new set_index functionality.
df.set_index(df['Date'], inplace=True)

How is index modified by dataframe.sort_values()

I have a dataframe that i want to sort on one of my columns (that is a date)
However I have a loop i am running on the index (while i<df.shape[0]), I need the loop to go on my dataframe once it is sorted by date.
Is the current index modified accordingly to the sorting or should I use df.reset_index() ?
Maybe I'm not understanding the question, but a simple check shows that sort_values does modify the index:
df = pd.DataFrame({'x':['a','c','b'], 'y':[1,3,2]})
df = df.sort_values(by = 'x')
Yields:
x y
0 a 1
2 b 2
1 c 3
And a subsequent:
df = df.reset_index(drop = True)
Yields:
x y
0 a 1
1 b 2
2 c 3

Adding new columns to DataFrame Python. SettingWithCopyWarning

I'm trying to add a new column to a data frame. I have column of dates, I turn it into seconds-since-epoch and add that to a new column of the data frame
def addEpochTime(df):
df[7] = np.NaN # Adding empty column.
for n in range(0, len(df)): # Writing to empty column.
df[7][n] = df[0][n] - 5 # Conduct some mathematical mutations...
addEpochTime(df)
What I've written above works, but I do get an error, i.e.: SettingWithCopyWarning
My question is, how can I add a new column in a data frame and write data to it
I don't fully understand the way data frames are indexed, despite having read about the it in the pandas documentation.
Since you say -
I have column of dates, I turn it into seconds-since-epoch and add that to a new column of the data frame
If what you are actually doing is simple like - df[7][n] = df[0][n] -5 , then you can simply use series.apply method to do the same thing, In your case -
def addEpochTime(df):
df[7] = df[0].apply(lambda x: x-5)
.apply method accepts a function as the parameter , which is passed the value of each row and it should return the value after applying the logic.
You can also pass in a function that accepts the date as parameter and returns the seconds since epoch, to .apply() , which might be what you are looking for.
Example -
In [4]: df = pd.DataFrame([[1,2],[3,4]],columns=['A','B'])
In [5]: df
Out[5]:
A B
0 1 2
1 3 4
In [6]: df['C'] = df['A'].apply(lambda x: x-5)
In [7]: df
Out[7]:
A B C
0 1 2 -4
1 3 4 -2
You can do it in a single line and avoid the warning:
df
>> a
0 1
1 2
df['b'] = df['a'] - 5
df
>> a b
0 1 -4
1 2 -3

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories