I have the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/test.csv')
df.drop(['SecurityID'],1,inplace=True)
Time = 1
trade_filter_size = 9
groupbytime = (str(Time) + "min")
df['dateTime_s'] = df['dateTime'].astype('datetime64[s]')
df['dateTime'] = pd.to_datetime(df['dateTime'])
df[str(Time)+"min"] = df['dateTime'].dt.floor(str(Time)+"min")
df['tradeBid'] = np.where(((df['tradePrice'] <= df['bid1']) & (df['isTrade']==1)), df['tradeVolume'], 0)
groups = df[df['isTrade'] == 1].groupby(groupbytime)
print("groups",groups.dtypes)
#THIS IS WORKING
df_grouped = (groups.agg({
'tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum())],
}))
# creating a new data frame which is filttered
df2 = pd.DataFrame( df.loc[(df['isTrade'] == 1) & (df['tradeVolume']>=trade_filter_size)])
#recalculating all the bid/ask volume to be bsaed on the filter size
df2['tradeBid'] = np.where(((df2['tradePrice'] <= df2['bid1']) & (df2['isTrade']==1)), df2['tradeVolume'], 0)
df2grouped = (df2.agg({
# here is the problem!!! NOT WORKING
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
}))
The same function is used tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum()). In the first time it's working ok but when doing it on filtered data in a new df it's causing an error:
ValueError: downticks_number is an unknown string function
when I use this code instead to solve the above
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
I get this error:
ValueError: cannot combine transform and aggregation operations
Any idea why I get different results for the same usage of code?
since there were 2 conditions to match for the 2nd groupby, I solved this by moving the filter into the df by creating a new column which is used as a filter (with both 2 filters).
than there was no problem to groupby
the order was the problem
Related
Thank you for taking a look! I am having issues with a 4 level multiindex & attempting to make sure every possible value of the 4th index is represented.
Here is my dataframe:
np.random.seed(5)
size = 25
dict = {'Customer':np.random.choice( ['Bob'], size),
'Grouping': np.random.choice( ['Corn','Wheat','Soy'], size),
'Date':np.random.choice( pd.date_range('1/1/2018','12/12/2022', freq='D'), size),
'Data': np.random.randint(20,100, size=(size))
}
df = pd.DataFrame(dict)
# create the Sub-Group column
df['Sub-Group'] = np.nan
df.loc[df['Grouping'] == 'Corn', 'Sub-Group'] = np.random.choice(['White', 'Dry'], size=len(df[df['Grouping'] == 'Corn']))
df.loc[df['Grouping'] == 'Wheat', 'Sub-Group'] = np.random.choice(['SRW', 'HRW', 'SWW'], size=len(df[df['Grouping'] == 'Wheat']))
df.loc[df['Grouping'] == 'Soy', 'Sub-Group'] = np.random.choice(['Beans', 'Meal'], size=len(df[df['Grouping'] == 'Soy']))
df['Year'] = df.Date.dt.year
With that, I'm looking to create a groupby like the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
This works as expected. I want to reindex this dataframe so that every single month (index 3) is represented & filled with 0s. The reason I want this is later on I'll be doing a cumulative sum of a groupby.
I have tried both the following reindex & nothing happens - many months are still missing.
rere = pd.date_range('2018-01-01','2018-12-31', freq='M').month
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(rere, level=3, fill_value=0))
)
I've also tried the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(pd.MultiIndex.from_product(x.index.levels)))
)
The issue with the last one is that the index is much too long - it's doing the cartesian product of Grouping & Sub-Group when really there are no combinations of 'Wheat' as a Grouping & 'Dry' as 'Sub-Group'.
I'm looking for a flexible way to reindex this dataframe to make sure a specific index level (3rd in this case) has every option.
Thanks so much for any help!
try this:
def reindex_sub(g: pd.DataFrame):
g = g.droplevel([0, 1, 2])
result = g.reindex(range(1, 13))
return result
tmp = (df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
grouped = tmp.groupby(level=[0,1,2], group_keys=True)
out = grouped.apply(reindex_sub)
print(out)
EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data
You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)
Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19
I am only beginning with Pandas and I am stuck with the following problem:
I want to use the row number in df.apply() so that it calculates (1+0.05)^(row_number), ex:
(1+0.05)^0 in its first row, (1+0.05)^1 in its second, (1+0.05)^2 in its third etc....
I tried the following but get AttributeError: 'int' object has no attribute 'name'
import pandas as pd
considered_period_years = 60
start_year = 2019
TDE = 0.02
year = list(range(start_year,start_year+considered_period_years))
df = pd.DataFrame(year, columns = ['Year'])
df.insert(0, 'Year Number', range(0,60), allow_duplicates = False)
df.insert(2, 'Investition', 0, allow_duplicates = False)
df['Investition2'] = df['Investition'].apply(lambda x: x*(1+TDE)**x.name)
Any ideas ?
Regards Johann
Welcome to pandas. Familiarize yourself with vectorized functions. The basic idea behind vectorized functions is that you apply the operation to every element in an array without an explicit loop. For example:
x + 1
means "add 1 to element in x".
Similarly:
x * y
means "multiply every element in x by every element in y, pair-wise".
Deep down, vectorized functions are implemented using highly-optimized C loops so they are both fast and convenient.
In your case:
df['Investition2'] = (1+TDE)**df.index
You can create a custom function (foo in our case) to access row.name since you are using df['Investitions2'] it's giving you a Series. And apply series will iterate through its values.
import pandas as pd
considered_period_years = 60
start_year = 2019
TDE = 0.02
year = list(range(start_year,start_year+considered_period_years))
df = pd.DataFrame(year, columns = ['Year'])
df.insert(0, 'Year Number', range(0,60), allow_duplicates = False)
df.insert(2, 'Investition', 0, allow_duplicates = False)
def foo(row):
return row['Investition']*(1+TDE)**row.name
df['Investition2'] = df.apply(lambda x: foo(x), axis=1)
Another alternative is to use itertuples or iterrows.
I have a few pandas series with PeriodIndex of varying frequency. I'd like to filter these based on another PeriodIndex of which the frequency is in principle unknown (specified directly in the example below as selectionA or selectionB, but in practice stripped from another series).
I've found 3 approaches, each with its own downside, shown in the example below. Is there a better way?
import numpy as np
import pandas as pd
y = pd.Series(np.random.random(4), index=pd.period_range('2018', '2021', freq='A'), name='speed')
q = pd.Series(np.random.random(16), index=pd.period_range('2018Q1', '2021Q4', freq='Q'), name='speed')
m = pd.Series(np.random.random(48), index=pd.period_range('2018-01', '2021-12', freq='M'), name='speed')
selectionA = pd.period_range('2018Q3', '2020Q2', freq='Q') #subset of y, q, and m
selectionB = pd.period_range('2014Q3', '2015Q2', freq='Q') #not subset of y, q, and m
#Comparing some options:
#1: filter method
#2: slicing
#3: selection based on boolean comparison
#1: problem when frequencies unequal: always returns empty series
yA_1 = y.filter(selectionA, axis=0) #Fail: empty series
qA_1 = q.filter(selectionA, axis=0)
mA_1 = m.filter(selectionA, axis=0) #Fail: empty series
yB_1 = y.filter(selectionB, axis=0)
qB_1 = q.filter(selectionB, axis=0)
mB_1 = m.filter(selectionB, axis=0)
#2: problem when frequencies unequal: wrong selection and error instead of empty result
yA_2 = y[selectionA[0]:selectionA[-1]]
qA_2 = q[selectionA[0]:selectionA[-1]]
mA_2 = m[selectionA[0]:selectionA[-1]] #Fail: selects 22 months instead of 24
yB_2 = y[selectionB[0]:selectionB[-1]] #Fail: error
qB_2 = q[selectionB[0]:selectionB[-1]]
mB_2 = m[selectionB[0]:selectionB[-1]] #Fail: error
#3: works, but very verbose
yA_3 =y[(y.index >= selectionA[0].start_time) & (y.index <= selectionA[-1].end_time)]
qA_3 =q[(q.index >= selectionA[0].start_time) & (q.index <= selectionA[-1].end_time)]
mA_3 =m[(m.index >= selectionA[0].start_time) & (m.index <= selectionA[-1].end_time)]
yB_3 =y[(y.index >= selectionB[0].start_time) & (y.index <= selectionB[-1].end_time)]
qB_3 =q[(q.index >= selectionB[0].start_time) & (q.index <= selectionB[-1].end_time)]
mB_3 =m[(m.index >= selectionB[0].start_time) & (m.index <= selectionB[-1].end_time)]
Many thanks
I've solved it by adding start_time and end_time to the slice range:
yA_2fixed = y[selectionA[0].start_time: selectionA[-1].end_time]
qA_2fixed = q[selectionA[0].start_time: selectionA[-1].end_time]
mA_2fixed = m[selectionA[0].start_time: selectionA[-1].end_time] #now has 24 rows
yB_2fixed = y[selectionB[0].start_time: selectionB[-1].end_time] #doesn't fail; returns empty series
qB_2fixed = q[selectionB[0].start_time: selectionB[-1].end_time]
mB_2fixed = m[selectionB[0].start_time: selectionB[-1].end_time] #doesn't fail; returns empty series
But if there's a more concise way to write this, I'm still all ears. I especially would like to know if it's possible to do this filtering in a way that is more 'native' to the PeriodIndex, i.e., not converting it into datetime instances first with the start_time and end_time attributes.
I have 2 <pandas.core.groupby.DataFrameGroupBy> objects and would like to combine them by a key? How would I do that?. Having 'as_index=False' does not work(it used to work before) I tried the following
result = pd.merge(groupobject_a, groupobject_b, on='important_key', how='inner')
But I am getting error below
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.groupby.DataFrameGroupBy'>
Here is my minimal code how I created my groupby objects
import pandas as pd
my_dataframe = pd.read_csv("here is my csv")
groupobject_a= my_dataframe[(my_dataframe['colA'] > 0) & (my_dataframe['colB'] < 15) & (my_dataframe['colC'].notnull())].groupby(['important_key'], as_index=False)
groupobject_b= my_dataframe[(my_dataframe['colA'] > 25) & (my_dataframe['colB'] < 65) & (my_dataframe['colC'].notnull())].groupby(['important_key'], as_index=False)