Aggregating datetime64[ns] and float columns in a pandas dataframe - python

I have a pandas dataframe which looks like the one below.
racer
race_time_1
race_time_2
1st_Place
2nd_Place
...
joe shmo
0:24:12
NaN
1
0
joe shmo
NaN
0:32:43
0
0
joe shmo
NaN
0:30:21
0
1
sally sue
NaN
0:29:54
1
0
I would like to group all the rows by racer name to show me total race times, places, etc.
I am attempting to do this with
df.groupby('racer', dropna=True).agg('sum')
Each column was initially loaded as an object dtype which causes issues when aggregating numbers with anything that isn't a null value.
For the race_time values, after lots of searching I tried changing the columns to datetime64[ns] dtypes with dummy data for day/month/year, but instead of summing the race_time columns they are dropped from the dataframe when the groupby function is called.
The opposite issue arises when I change 1st_Place and 2nd_place to float dtypes. When groupby is called, the aggregation will work as expected, but every other column is dropped (the object columns that is).
For example, with joe shmo I would want to see:
racer
race_time_1
race_time_2
1st_Place
2nd_Place
joe shmo
0:24:12
1:03:04
1
1
How can I get pandas to aggregate my dataframe like this?

Use:
#function for formating timedeltas
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
#convert Place columns to numeric
cols1 = df.filter(like='Place').columns
df[cols1] = df[cols1].apply(pd.to_numeric)
#convert time columns to timedeltas and then to unix time
cols = df.filter(like='time').columns
df[cols] = df[cols].fillna('0').apply(pd.to_timedelta).astype(np.int64)
#aggregate sum
df = df.groupby('racer', dropna=True).sum()
#convert timedeltas to times with formating
df[cols] = df[cols].apply(lambda x: pd.to_timedelta(x).map(f))
print (df)
race_time_1 race_time_2 1st_Place 2nd_Place
racer
joe shmo 00:24:12 01:03:04 1 1
sally sue 00:00:00 00:29:54 1 0

Related

Pandas apply lambda to a function based on condition

I have a data frame of rental data and would like to annualise the rent based on whether a column containing the frequency states that the rent is monthly, i.e. price * 12
The frequency column contains the following values - 'Yearly', 'Monthly', nan
I have tried - np.where(df['frequency'] == "Monthly", df['price'].apply(lambda x: x*12), 0)
However, where there is monthly data, the figure seems to be being copied 12 times rather than multiplied by 12:
And I need to have the price multiplied by 12 but can't figure out how to do this
The problem is your price column contains string and not numeric values.
If you load your dataframe from a file (csv, xlsx), use thousands=',' as parameter of pd.read_csv or pd.read_excel to interpret string like '4,500 as the number 4500.
Demo:
import pandas as pd
import io
csvdata = """\
frequency;price
Monthly;4,500
Yearly;30,200
"""
df1 = pd.read_csv(io.StringIO(csvdata), sep=';')
df2 = pd.read_csv(io.StringIO(csvdata), sep=';', thousands=',')
For df1:
>>> df1
frequency price
0 Monthly 4,500
1 Yearly 30,200
>>> df1.dtypes
frequency object
price object # not numeric
dtype: object
>>> df1['price'] * 2
0 4,5004,500
1 30,20030,200
Name: price, dtype: object
For df2:
>>> df2
frequency price
0 Monthly 4500
1 Yearly 30200
>>> df2.dtypes
frequency object
price int64 # numeric
dtype: object
>>> df2['price'] * 2
0 9000
1 60400
Name: price, dtype: int64
It seems there are strings instead numbers floats in column price, so first replace , to . and then convert to floats, last multiple by 12:
np.where(df['frequency'] == "Monthly", df['price'].str.replace(',','.').astype(float)*12, 0)
If values are thousands separated by , replace by empty string:
np.where(df['frequency'] == "Monthly", df['price'].str.replace(',','').astype(float)*12, 0)

How to count the sale of each day according stocks with pandas

I want to count the sale of each day. The values in the original table are stocks but not sale.
I use excel to solve the problem,But now I have millions of products ,so I want to solve the problem with pandas.
I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
pandas.DataFrame.diff() is enough.
df['STOCK'] = df['STOCK'].diff()
df.rename(columns={'STOCK': 'SALE'}, inplace=True)
df.rename(columns={'ID1_stock': 'ID1_sale', 'ID2_stock': 'ID2_sale', 'ID3_stock': 'ID3_sale'}, level=1, inplace=True)
Use DataFrame.diff with rename first first level of MultiIndex and then second by lambda function:
print (df)
STOCK
ID1_stock ID2_stock
0 20 21
1 18 20
2 16 19
df = (df.diff()
.rename(columns={'STOCK': 'SALE'}, level=0)
.rename(columns=lambda x: x.replace('stock','sale'), level=1))
print (df)
SALE
ID1_sale ID2_sale
0 NaN NaN
1 -2.0 -1.0
2 -2.0 -1.0

How to Loop over Numeric Column in Pandas Dataframe and filter Values?

df:
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
I want loop over Numeric Column (Age, Salary) to check each value whether it is numeric or not, if string value present in Numeric column filter out the record and create a new data frame without that error.
Output :
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
You could extend this answer to filter on multiple columns for numerical data types:
import pandas as pd
from io import StringIO
data = """
Org_Name,Emp_Name,Age,Salary
Axempl,Rick,29,1000
Lastik,John,34,2000
Xenon,sidd,47,9000
Foxtrix,Ammy,thirty,2000
Hensaui,giny,33,ten
menuia,rony,fifty,7000
lopex,nick,23,Ninety
"""
df = pd.read_csv(StringIO(data))
print('Original dataframe\n', df)
df = df[(df.Age.apply(lambda x: x.isnumeric())) &
(df.Salary.apply(lambda x: x.isnumeric()))]
print('Filtered dataframe\n', df)
gives
Original dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
Filtered dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
I believe this can be solved using Pandas' "to_numeric" function.
import pandas as pd
df['Column to Check'] = pd.to_numeric(df['Column to Check'], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
Where 'Column to Check' is the column name that your are checking for values that cannot be cast as an integer (or any numeric type); in your question I believe you will want to apply this code to 'Age' and 'Salary'. "to_numeric" will convert any values in those columns to NaN if they could not be cast as your selected type. The "dropna" method will remove all rows that have a NaN in any of your columns.
To loop over the columns like you ask, you could do the following:
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
EDIT:
In response to harry's comment. If there are preexisting NaNs in the data, something like the following should keep any valid row that had a preexisting NaN in one of the other columns.
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df = df[df[col].notnull()]
You can use a mask to indicate wheter or not there is a string type among the Age and Salary columns:
mask_str = (df[['Age', 'Salary']]
.applymap(lambda x: str(type(x)))
.sum(axis=1)
.str.contains("str"))
df[~mask_str]
This is assuming that the dataframe already contains the proper types. If not, you can convert them using the following:
def convert(val):
try:
return int(val)
except ValueError:
return val
df = (df.assign(Age=lambda f: f.Age.apply(convert),
Salary=lambda f: f.Salary.apply(convert)))

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

Shift time in multi-index to merge

I want to merge two datasets that are indexed by time and id. The problem is, the time is slightly different in each dataset. In one dataset, the time (Monthly) is mid-month, so the 15th of every month. In the other dataset, it is the last business day. This should still be a one-to-one match, but the dates are not exactly the same.
My approach is to shift mid-month dates to business day end-of-month dates.
Data:
dt = pd.date_range('1/1/2011','12/31/2011', freq='D')
dt = dt[dt.day == 15]
lst = [1,2,3]
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
df.head()
output:
0
date id
2011-01-15 1 -0.598584
2 -0.484455
3 -2.044912
2011-02-15 1 -0.017512
2 0.852843
This is what I want (I removed the performance warning):
In[83]:df.index.levels[0] + BMonthEnd()
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
However, indexes are immutable, so this does not work:
In: df.index.levels[0] = df.index.levels[0] + BMonthEnd()
TypeError: 'FrozenList' does not support mutable operations.
The only solution I've got is to reset_index(), change the dates, then set_index() again:
df.reset_index(inplace=True)
df['date'] = df['date'] + BMonthEnd()
df.set_index(['date','id'], inplace=True)
This gives what I want, but is this the best way? Is there a set_level_values() function (I didn't see it in the API)?
Or maybe I'm taking the wrong approach to the merge. I could merge the dataset with keys df.index.get_level_values(0).year, df.index.get_level_values(0).month and id but this doesn't seem much better.
You can use set_levels in order to set multiindex levels:
df.index.set_levels(df.index.levels[0] + pd.tseries.offsets.BMonthEnd(),
level='date', inplace=True)
>>> df.head()
0
date id
2011-01-31 1 -1.410646
2 0.642618
3 -0.537930
2011-02-28 1 -0.418943
2 0.983186
You could just build it again:
df.index = pd.MultiIndex.from_arrays(
[
df.index.get_level_values(0) + BMonthEnd(),
df.index.get_level_values(1)
])
set_levels implicitly rebuilds the index under the covers. If you have more than two levels, this solution becomes unweildy, so consider using set_levels for typing brevity.
Since you want to merge anyway, you can forget about changing the index and use use pandas.merge_asof()
Data
df1
0
date id
2011-01-15 1 -0.810581
2 1.177235
3 0.083883
2011-02-15 1 1.217419
2 -0.970804
3 1.262364
2011-03-15 1 -0.026136
2 -0.036250
3 -1.103929
2011-04-15 1 -1.303298
And here is one with last business day of the month, df2
0
date id
2011-01-31 1 -0.277675
2 0.086539
3 1.441449
2011-02-28 1 1.330212
2 -0.028398
3 -0.114297
2011-03-31 1 -0.031264
2 -0.787093
3 -0.133088
2011-04-29 1 0.938732
merge
Use df1 as your left DataFrame and then choose the merge direction as forward since the last business day is always after the 15th. Optionally, you can set a tolerance. This is useful in the situation where you are missing a month in the right DataFrame and will prevent you from merging 03-31-2011 to 02-15-2011 if you are missing data for the last business day February.
import pandas as pd
pd.merge_asof(df1.reset_index(), df2.reset_index(), by='id', on='date',
direction='forward', tolerance=pd.Timedelta(days=20)).set_index(['date', 'id'])
Results in
0_x 0_y
date id
2011-01-15 1 -0.810581 -0.277675
2 1.177235 0.086539
3 0.083883 1.441449
2011-02-15 1 1.217419 1.330212
2 -0.970804 -0.028398
3 1.262364 -0.114297
2011-03-15 1 -0.026136 -0.031264
2 -0.036250 -0.787093
3 -1.103929 -0.133088
2011-04-15 1 -1.303298 0.938732

Categories