When resampling Series with mean aggregation (daily to monthly) -> missing datetimes are filled with NaNs which is okay since we can simply remove them using .dropna() function,
however, with sum/total aggregation -> missing datetimes are filled with 0s (zeros) which is technically correct, but a bit bothersome as masks are needed to remove them.
The question is if there is a more efficient way on resampling with aggregate sum without zero-filling or using masks? Preferrably similar to dropna() but for dropping 0s.
For example:
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2000-03-01', '2000-03-02', '2000-05-01', '2000-05-02'])
# wanted output
# 2000-01-31 2.0
# 2000-03-31 2.0
# 2000-05-31 2.0
# ideal output but for aggregate sum.
ser.resample('M').mean().dropna()
# 2000-01-31 1.0
# 2000-03-31 1.0
# 2000-05-31 1.0
# not ideal
ser.resample('M').sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with .grouper() seems to have the exact behavior from resampling.
# not ideal
ser.groupby(pd.Grouper(freq='M')).sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with index.year is also doable, however, there does not seem to be an 'identity' for calendar month. Noting that .index.month is not what we are after.
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2002-03-01', '2002-03-02', '2005-05-01', '2005-05-02'])
ser.groupby(ser.index.year).sum()
# 2000 2
# 2002 2
# 2005 2
Use pd.offsets.MonthEnd and add this with the DatetimeIndex of ser to create a month end grouper, then use Series.groupby with this grouper and aggregate using sum or mean:
grp = ser.groupby(ser.index + pd.offsets.MonthEnd())
s1, s2 = grp.sum(), grp.mean()
Result:
print(s1)
2000-01-31 2
2002-03-31 2
2005-05-31 2
dtype: int64
print(s2)
2000-01-31 1
2002-03-31 1
2005-05-31 1
dtype: int64
Related
I have the following dataframe.
for each time point (row) A1,A2,A3 ; A4,5,6 ; ... are 3 replicates. I would like to get the averages and standard deviation for each group of 3 per row and add it to a new df.
I have tried:
new_df['A1-A3_mean']=np.mean(df[['A1','A2','A3']],axis=1)
new_df['A1-A3_std']=np.std(df[['A1','A2','A3']],axis=1)
which works but is quite manual and time consuming. I tried using groupby('Time').agg({'mean','std'}) but not I don't know how to specify that it should always take 3 columns. Ideally the resulting column would be named A1-3_mean / A1-3_stdev
Thanks in advance!
You can try:
N = 3
cols = list(df.drop(columns='time'))
mapper = {c: f'{cols[i//N]}-{cols[i//N+N-1]}' for i,c in enumerate(cols)}
g = df[cols].rename(columns=mapper).groupby(level=0, axis=1)
out = pd.concat({x: g.agg(x) for x in ['mean', 'std']}, axis=1)
Output:
mean std
A1-A3 A2-A4 A1-A3 A2-A4
0 4.666667 3.000000 2.886751 2.000000
1 2.666667 4.333333 1.154701 3.214550
2 6.333333 4.333333 2.309401 1.154701
I have a DataFrame with 2 columns total_open_amount and invoice_currency.
invoice_currency has
USD 45011
CAD 3828
Name: invoice_currency, dtype: int64
And I want to convert all the CAD to USD from the total_open_amount column wrt to invoice_currency with an exchange rate of 1 CAD = 0.7USD and store them in a separate column.
My code:
df_data['converted_usd'] = df_data['total_open_amount'].where(df_data['invoice_currency']=='CAD')
df_data['converted_usd']= df_data['converted_usd'].apply(lambda x: x*0.7)
df_data['converted_usd']
output:
0 NaN
1 NaN
2 NaN
3 2309.79
4 NaN
...
49995 NaN
49996 NaN
49997 NaN
49998 NaN
49999 NaN
Name: converted_usd, Length: 48839, dtype: float64
I was able to fill the new column with CAD values converted but how do I fill the rest of the USD values now?
We can use Series.mask or Series.where, series.mask set to NaN the rows where 'invoice_currency' is USD, but with the other parameter we tell it that these values have to be filled with df_data['total_open_amount'] series multiplied by 0.7.
using serie.where the rows that do not meet the condition are set to NaN, so first we multiply the series by 0.7 and leave only the rows where the condition is met, that is, the rows with USD currency and we use other parameter to leave the rest of rows with initial value
Note that series.mask and series.where are the opposite of each other.
df_data['converted_usd'] = df_data['total_open_amount']\
.mask(df_data['invoice_currency'] == 'CAD',
other=df_data['total_open_amount'].mul(0.7))
Or:
df_data['converted_usd'] = df_data['total_open_amount'].mul(0.7)\
.where(df_data['invoice_currency'] == 'CAD',
df_data['total_open_amount'])
numpy version
df_data['converted_usd'] = \
np.where(df_data['invoice_currency'] == 'CAD',
df_data['total_open_amount'].mul(0.7),
df_data['total_open_amount'])
The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)
I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})
I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?
Here is some data that has the right structure (code modeled on an answer here):
import pandas as pd
import numpy as np
import random
from itertools import product
random.seed(1) # so you can play along at home
np.random.seed(2) # ditto
# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))
mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})
Ok, great. This gives me a frame like so:
date ticker value
0 2013-10-03 ticker_2 0.435995
1 2013-10-04 ticker_2 0.025926
2 2013-10-02 ticker_1 0.549662
3 2013-10-01 ticker_0 0.435322
4 2013-10-02 ticker_2 0.420368
5 2013-10-03 ticker_0 0.330335
6 2013-10-04 ticker_1 0.204649
7 2013-10-02 ticker_0 0.619271
8 2013-10-01 ticker_2 0.299655
My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:
data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1
and returns:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:
data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2
A quick check shows that, in fact, data1 is equal to data2. However, if I do this:
data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3
I get a strange result:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0
1 2013-10-04 ticker_2 0.025926 NaN
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 NaN
5 2013-10-03 ticker_0 0.330335 0
6 2013-10-04 ticker_1 0.204649 NaN
7 2013-10-02 ticker_0 0.619271 NaN
8 2013-10-01 ticker_2 0.299655 0
What's going on here? When you call the .diff method on a Pandas object, is it not just calling np.diff? I know there's a diff method on the DataFrame class, but I couldn't figure out how to pass that to transform without the lambda function syntax I used to make data1 work. Am I missing something? Why is the diffs column in data3 screwy? How can I have call the Pandas diff method within transform without needing to write a lambda to do it?
Nice easy to reproduce example!! more questions should be like this!
Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2
In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)
In [34]: data3.sort_index(inplace=True)
In [25]: data3
Out[25]:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
[9 rows x 4 columns]
I believe that np.diff doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).
Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.
You can see that the Series .diff() method is different to np.diff():
In [11]: data.value.diff() # Note the NaN
Out[11]:
0 NaN
1 -0.410069
2 0.523736
3 -0.114340
4 -0.014955
5 -0.090033
6 -0.125686
7 0.414622
8 -0.319616
Name: value, dtype: float64
In [12]: np.diff(data.value.values) # the values array of the column
Out[12]:
array([-0.41006867, 0.52373625, -0.11434009, -0.01495459, -0.09003298,
-0.12568619, 0.41462233, -0.31961629])
In [13]: np.diff(data.value) # on the column (Series)
Out[13]:
0 NaN
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 NaN
Name: value, dtype: float64
In [14]: np.diff(data.value.index) # er... on the index
Out[14]: Int64Index([8], dtype=int64)
In [15]: np.diff(data.value.index.values)
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])