Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)
Related
I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.
Important notes
The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:
Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!
Code examples
Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.
For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Input 1 column, output 1 column
Basic
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Input 2 columns, output 1 column
Basic
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Input 1 column, output 2 columns
Basic
You could use a 'normal' solution by returning pd.Series:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.
1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.
2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.
Rolling & Groupby
Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Input 2 columns, output 2 columns
Basic
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I encountered this lambda expression today and can't understand how it's used:
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
The line of code doesn't seem to call the lambda function or pass any arguments into it so I'm confused how it does anything at all. The purpose of this is to take two columns CSD and SCHOOL CODE and combine the entries in each row into a new row, DBN. So does this lambda expression ever get used?
You're writing your results incorrectly to a column. data["class_size"]["DBN"] is not the correct way to select the column to write to. You've also selected a column to use apply with but you'd want that across the entire dataframe.
data["DBN"] = data.apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
the apply method of a pandas Series takes a function as one of its arguments.
here is a quick example of it in action:
import pandas as pd
data = {"numbers":range(30)}
def cube(x):
return x**3
df = pd.DataFrame(data)
df['squares'] = df['numbers'].apply(lambda x: x**2)
df['cubes'] = df['numbers'].apply(cube)
print df
gives:
numbers squares cubes
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
...
as you can see, either defining a function (like cube) or using a lambda function works perfectly well.
As has already been pointed out, if you're having problems with your particular piece of code it's that you have data["class_size"]["DBN"] = ... which is incorrect. I was assuming that was an odd typo because you didn't mention getting a key error, which is what that would result in.
if you're confused about this, consider:
def list_apply(func, mylist):
newlist = []
for item in mylist:
newlist.append(func(item))
this is a (not very efficient) function for applying a function to every item in a list. if you used it with cube as before:
a_list = range(10)
print list_apply(cube, a_list)
you get:
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
this is a simplistic example of how the apply function in pandas is implemented. I hope that helps?
Are you using a multi-index dataframe (i.e. There are column hierarchies)? It's hard to tell without seeing your data, but I'm presuming it is the case, since just using data["class_size"].apply() would yield a series on a normal dataframe (meaning the lambda wouldn't be able to find your columns specified and then there would be an error!)
I actually found this answer which explains the problem of trying to create columns in multi-index dataframes, one confusing things with multi-index column creation is that you can try to create a column like you are doing and it will seem to run without any issues, but won't actually create what you want. Instead, you need to change data["class_size"]["DBN"] = ... to data["class_size", "DBN"] = ... So, in full:
data["class_size","DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
Of course, if it isn't a mult-index dataframe then this won't help, and you should look towards one of the other answers.
I think 0:02d means 2 decimal place for "CSD" value. {}{} basically places the 2 values together to form 'DBN'.
I'd like to do some math on a series vector. I'd like to take the difference between two rows in a vector. My first intuition was:
def row_diff(prev, next):
return(next - prev)
and then using it
my_col_vec.apply(row_diff)
but this doesn't do what I'd like. It appears apply is row-wise, which is fine, but I can't seem to find an equivalent operation that will allow me to easy create a new vector from the old one by subtracting the previous row from the next.
Is there a better way to do this? I've been reading this document and it doesn't look like it.
Thanks!
To calculate inter-row differences use diff:
In [6]:
df = pd.DataFrame({'a':np.random.rand(5)})
df
Out[6]:
a
0 0.525220
1 0.031826
2 0.260853
3 0.273792
4 0.281368
In [7]:
df['diff'] = df['a'].diff()
df
Out[7]:
a diff
0 0.525220 NaN
1 0.031826 -0.493394
2 0.260853 0.229027
3 0.273792 0.012940
Also please try to avoid using apply as there is usually a vectorised method available
I'm working on a script that takes in an address and spits out two values: coordinates (as a list) and result (whether the geocoding was successful or not. This works fine, but since the data is returned as a list, I then have to assign new columns based on the indices of that list, which works but returns a warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy.
EDIT: Just to be clear, I think I understand from that page that I should be using .loc to access the nested values. My question is more along the lines of generating two columns directly from a function as opposed to this workaround of having to dig the information out later.
I'd like to know the correct way to approach problems like these, as I actually have this problem twice in this project.
The actual specifics of the problem aren't important, so here's a simple example of how I've been approaching it:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return coords, result
df['output'] = df['address'].apply(geo)
Since this then yields a nested list into my df column, I then extract that into new columns as such:
df['coordinates'] = None
df['gps_status'] = None
for index, row in df.iterrows():
df['coordinates'][index] = df['output'][index][0]
df['gps_status'][index] = df['output'][index][1]
And again, I get the warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Any advice on the correct way to do this would be appreciated.
Usually you want to avoid iterrows() since it is faster to operate on an entire column at once. You can assign the result from output directly to a new column.
import pandas as pd
def geo(x):
return x*2, x*3
df = pd.DataFrame({'address':[1,2,3]})
output = df['address'].apply(geo)
df['a'] = [x[0] for x in output]
df['b'] = [x[1] for x in output]
gives you
address a b
0 1 2 3
1 2 4 6
2 3 6 9
with no copy warning.
Your function should return a Series:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return pd.Series([coords, result], ['coordinates', 'gps_status'])
df['output'] = df['address'].apply(geo)
That said, this may be better written as a merge.