I've been wondering how to solve the following problem. Say I have a dataframe df, which looks like this:
Name quantity price
A 1 10.0
A 3 26.0
B 1 15.0
B 3 30.0
...
Now, say I wanted to extrapolate the price by quantity, and for each Name create a row for quantity = 1,2,3, which is some function of the list of available quantities and respective prices. (I.e. say that I have a function extrapolate(qts, prices, n) that computes a price for quantity=n based on known qts and prices, then the result would look like:
Name quantity price
A 1 10.0
A 2 extrapolate([1, 3], [10.0, 26.0], 2)
A 3 26.0
B 1 15.0
B 2 extrapolate([1, 3], [15.0, 30.0], 2)
B 3 30.0
...
I would appreciate some insight on how to achieve this, or a place to reference to learn more about how groupby can be used for this case
Thank you in advance
What do you want is called missing data imputation. There are many approaches to it.
You may want to check package called fancyimpute. It offers imputing data using MICE, which seems to do what you want.
Other than that, if your case is just as simple in structure as the example is, you can always groupby('Name').mean() and you will get the middle value for each subgroup.
The following should do what you described:
def get_extrapolate_val(group, qts, prices, n):
# do your actual calculations here; now it returns just a dummy value
some_value = (group[qts] * group[prices]).sum() / n
return some_value
# some definitions
n = 2
quan_col = 'quantity'
price_col = 'price'
First we group by Name and then apply the function get_extrapolate_val to each group whereby we pass the additional column names and n as arguments. As this returns a series object, we need an additional reset_index and rename which will make the concatenation easier.
new_stuff = df.groupby('Name').apply(get_extrapolate_val, quan_col, price_col, n).reset_index().rename(columns={0: price_col})
Add n as additional column
new_stuff[quan_col] = n
We concatenate the two dataframes and are done
final_df = pd.concat([df, new_stuff]).sort_values(['Name', quan_col]).reset_index(drop=True)
Name price quantity
0 A 10.0 1
1 A 44.0 2
2 A 26.0 3
3 B 15.0 1
4 B 52.5 2
5 B 30.0 3
The values I now added in are of course meaningless but are just there to illustrate the method.
OLD version
Assuming that there is always only 1 and 3 in your quantity column, the following should work:
new_stuff = df.groupby('Name', as_index=False)['price'].mean()
This gives
Name price
0 A 18.0
1 B 22.5
That - as written - assumes that it is always only 1 and 3, so we can simply calculate the mean.
Then we add the 2
new_stuff['quantity'] = 2
and concatenate the two dataframes with an additional sorting
pd.concat([df, new_stuff]).sort_values(['Name', 'quantity']).reset_index(drop=True)
which gives the desired outcome
Name price quantity
0 A 10.0 1
1 A 18.0 2
2 A 26.0 3
3 B 15.0 1
4 B 22.5 2
5 B 30.0 3
There are probably far more elegant ways to do this though...
Related
Using Pandas 1.1.5, I have a test DataFrame like the following:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': ['a0','a0','a0','a1','a1','a1','a2','a2'],
'a': [4,5,6,1,2,3,7,9],
'b': [3,4,5,3,2,4,1,3],
'c': [7,4,3,8,9,7,4,6],
'denom_a': [7,8,9,7,8,9,7,8],
'denom_b': [10,11,12,10,11,12,10,11]})
I would like to apply the following custom aggregate function on a rolling window where the function's calculation depends on the column name as so:
def custom_func(s, df, colname):
if 'a' in colname:
denom = df.loc[s.index, "denom_a"]
calc = s.sum() / np.max(denom)
elif 'b' in colname:
denom = df.loc[s.index, "denom_b"]
calc = s.sum() / np.max(denom)
else:
calc = s.mean()
return calc
df.groupby('id')\
.rolling(2, 1)\
.apply(lambda x: custom_func(x, df, x.name))
This results in TypeError: argument of type 'NoneType' is not iterable because the windowed subsets of each column do not retain the names of the original df columns. That is, x.name being passed in as an argument is in fact passing None rather than a string of the original column name.
Is there some way of making this approach work (say, retaining the column name being acted on with apply and passing that into the function)? Or are there any suggestions for altering it? I consulted the following reference for having the custom function utilize multiple columns within the same window calculation, among others:
https://stackoverflow.com/a/57601839/6464695
I wouldn't be surprised if there's a "better" solution, but I think could at least be a "good start" (I don't do a whole lot with .rolling(...)).
With this solution, I make two critical assumptions:
All denom_<X> have a corresponding <X> column.
Everything you do with the (<X>, denom_<X>) pairs is the same. (This should be straightforward to customize as needed.)
With that said, I do the .rolling within the function, rather than outside, in part because it seems like .apply(...) on a RollingGroupBy can only work column-wise, which isn't too helpful here (imo).
def cust_fn(df: pd.DataFrame, rolling_args: Tuple) -> pd.Series:
cols = df.columns
denom_cols = ["id"] # the whole dataframe is passed, so place identifiers / uncomputable variables here.
for denom_col in cols[cols.str.startswith("denom_")]:
denom_cols += [denom_col, denom_col.replace("denom_", "")]
col = denom_cols[-1] # sugar
df[f"calc_{col}"] = df[col].rolling(*rolling_args).sum() / df[denom_col].max()
for col in cols[~cols.isin(denom_cols)]:
print(col, df[col])
df[f"calc_{col}"] = df[col].rolling(*rolling_args).mean()
return df
Then the way you'd go about running this is the following (and you get the corresponding output):
>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1))
id a b c denom_a denom_b calc_a calc_b calc_c
0 a0 4 3 7 7 10 0.444444 0.250000 7.0
1 a0 5 4 4 8 11 1.000000 0.583333 5.5
2 a0 6 5 3 9 12 1.222222 0.750000 3.5
3 a1 1 3 8 7 10 0.111111 0.250000 8.0
4 a1 2 2 9 8 11 0.333333 0.416667 8.5
5 a1 3 4 7 9 12 0.555556 0.500000 8.0
6 a2 7 1 4 7 10 0.875000 0.090909 4.0
7 a2 9 3 6 8 11 2.000000 0.363636 5.0
If you need dynamically state which non-numeric/computable columns exist, then it might make sense to define cust_fn as follows:
def cust_fn(df: pd.DataFrame, rolling_args: Tuple, index_cols: List = []) -> pd.Series:
cols = df.columns
denon_cols = index_cols
# ... the rest is unchanged
Then you would adapt your calling of cust_fn as follows:
>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1), index_cols=["id"])
Of course, comment on this if you run into issues adapting it to your uses. 🙂
I have data like below:
id movie details value
5 cane1 good 6
5 wind2 ok 30.3
5 wind1 ok 18
5 cane1 good 2
5 cane22 ok 4
5 cane34 good 7
5 wind2 ok 2
I want the output with below criteria:
If movie name starts with 'cane' - sum the value
If movie name starts with 'wind' - count the occurrence.
So - the final output will be:
id movie value
5 cane1 8
5 cane22 4
5 cane34 7
5 wind1 1
5 wind2 2
I tried to use:
movie_df.groupby(['id']).apply(aggr)
def aggr(x):
if x['movie'].str.startswith('cane'):
y = x.groupby(['value']).sum()
else:
y = x.groupby(['movie']).count()
return y
But It's not working. Can anyone please help?
You should aim for vectorised operations where possible.
You can calculate 2 results and then concatenate them.
mask = df['movie'].str.startswith('cane')
df1 = df[mask].groupby('movie')['value'].sum()
df2 = df[~mask].groupby('movie').size()
res = pd.concat([df1, df2], ignore_index=0)\
.rename('value').reset_index()
print(res)
movie value
0 cane1 8.0
1 cane22 4.0
2 cane34 7.0
3 wind1 1.0
4 wind2 2.0
There might be multiple ways of doing this. One way would to filter by the start of movie name first and then aggregate and merge afterwards.
cane = movie_df[movie_df['movie'].str.startswith('cane1')]
wind = movie_df[movie_df['movie'].str.startswith('wind')]
cane_sum = cane.groupby(['id']).agg({'movie':'first', 'value':'sum'}).reset_index()
wind_count = wind.groupby(['id']).agg({'movie':'first', 'value':'count'}).reset_index()
pd.concat([cane_sum, wind_count])
First of all, you need to perform string operation. I guess in your case you don't want digits in a movie name. Use solution discussed at pandas applying regex to replace values.
Then you call groupby() on new series.
FYI: Some movie names have digits only; in that case, you need to use update function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
I would start by creating a column which defines the required groups. For the example at hand this can be done with
df['group'] = df.movie.transform(lambda x : x[:4])
The next step would be to group by this column
df.groupby('group').apply(agg_fun)
using the following aggregation function
def agg_fun(grp):
if grp.name == "cane":
value=grp.value.sum()
else:
value=grp.value.count()
return value
The output of this code is
group
cane 19.0
wind 3.0
UPDATE:
Please download my full dataset here.
my datatype is:
>>> df.dtypes
increment int64
spread float64
SYM_ROOT category
dtype: object
I have realized that the problem might have been caused by the fact that my SYM_ROOT is a category variable.
To replicate the issue you might want to do the following first:
df=pd.read_csv("sf.csv")
df['SYM_ROOT']=df['SYM_ROOT'].astype('category')
But I am still puzzled as in why my SYM_ROOT will result in the gaps in increment being filled with NA? Unless groupby category and integer value will result in a balanced panel by default.
I noticed that the behaviour of pd.groupby().last is different from that of pd.groupby().tail(1).
For example, suppose I have the following data:
increment is an integer that spans from 0 to 4680. However, for some SYM_ROOT variable, there are gaps in between. For example, 4 could be missing from it.
What I want to do is to keep the last observation per group.
If I do df.groupby(['SYM_ROOT','increment']).last(), the dataframe becomes:
While if I do df.groupby(['SYM_ROOT','increment']).tail(1), the dataframe becomes:
It looks to me that the last() statement will create a balanced time-series data and fill in the gaps with NaN, while the tail(1) statement doesn't. Is it correct?
Update :
Your columns increment is category
df=pd.DataFrame({'A':[1,1,2,2],'B':[1,1,2,3],'C':[1,1,1,1]})
df.B=df.B.astype('category')
df.groupby(['A','B']).last()
Out[590]:
C
A B
1 1 1.0
2 NaN
3 NaN
2 1 NaN
2 1.0
3 1.0
When you using tail it will not make up the miss level since , tail is more like dataframe base , not single columns
df.groupby(['A','B']).tail(1)
Out[593]:
A B C
1 1 1 1
2 2 2 1
3 2 3 1
After hange it using astype
df.B=df.B.astype('int')
df.groupby(['A','B']).last()
Out[591]:
C
A B
1 1 1
2 2 1
3 1
It is actually an issue here at Github, where the problem is mainly caused by groupby categories guessing the values.
If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.
I'm trying to do an, apparently, simple operation in python:
I have some datasets, say 6, and I want to sum the values of one column if the values of the other two columns coincides. After that, I want to divide the values of the column which has been summed by the number of datasets I have, in this case, 6 (i.e. Calculate the arithmetic mean). Also I want to sum 0 if the values of the other columns doesn't coincide.
I write down here two dataframes, as example:
Code1 Code2 Distance
0 15.0 15.0 2
1 15.0 60.0 3
2 15.0 69.0 2
3 15.0 434.0 1
4 15.0 842.0 0
Code1 Code2 Distance
0 14.0 15.0 4
1 14.0 60.0 7
2 15.0 15.0 0
3 15.0 60.0 1
4 15.0 69.0 9
The first column is the df.index column. Then , I want to sum 'Distance' column only if 'Code1' and 'Code2' columns coincide. In this case the desired output would be something like:
Code1 Code2 Distance
0 14.0 15.0 2
1 14.0 60.0 3.5
2 15.0 15.0 1
3 15.0 60.0 2
4 15.0 69.0 5.5
5 15.0 434.0 0.5
6 15.0 842.0 0
I've tried to do this using conditionals, but for more than two df is really hard to do. Is there any method in Pandas to do it faster?
Any help would be appreciated :-)
You could put all your data frames in a list and then use reduce to either append or merge them all.
Take a look at reduce here.
First, below some functions are defined for sample data generation.
import pandas
import numpy as np
# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
return np.floor(np.random.rand(n,1) * 3 + 13)
# Code 2 between 1 and 1000
def generate_code_2(n):
return np.floor(np.random.rand(n,1) * 1000) + 1
# Distance between 0 and 9
def generate_distance(n):
return np.floor(np.random.rand(n,1) * 10)
# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
data = np.hstack([
generate_code_1(n)
,generate_code_2(n)
,generate_distance(n)
])
df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
# Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
# Duplications will break merge method however will not break append method
df = df.groupby(['Code 1', 'Code 2'], as_index=False)
df = df.aggregate(np.min)
return df
# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
df_list = []
for k in range(0, n):
df = generate_data_frame(m)
# Add count column, needed for merge method to keep track of how many cases we have seen
if with_count:
df['Count'] = 1
df_list.append(df)
return df_list
Append method (faster, shorter, nicer)
df_list = generate_data_frames(94, 5)
# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)
# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result
Merge method
df_list = generate_data_frames(94, 5, with_count=True)
# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
df = df.fillna(0)
df['Distance'] = df['Distance'] + df['Distance_y']
df['Count'] = df['Count'] + df['Count_y']
del df['Distance_y']
del df['Count_y']
return df
# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)
# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result