The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)
I have two Dataframes as:
Master_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,120.0,1.0,32.0,96156.9,81.05,1.15,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,80.0,0.8,29.0,161207.0,56.95,0.95,6500
PCJEWELLER,82.5,0.55,31.0,154772.0,56.95,0.95,6500
PCJEWELLER,85.0,0.6,33.0,147882.0,56.95,0.7,6500
PCJEWELLER,90.0,0.5,37.0,138977.0,56.95,0.55,6500
and Child_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
I want compare child_DF with master_DF base on Column (Symbol,Strike_Price) i.e. if the Symbol & Strike_Price are already available in master_DF then it will not be consider as new data.
New Rows are:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
You can use right merge with indicator=True and then query 'right_only', finally reindex() to get columns in order of child:
(master.merge(child,on=['Symbol','Strike_Price'],how='right',
suffixes=('_',''),indicator=True)
.query('_merge=="right_only"')).reindex(child.columns,axis=1)
Symbol Strike_Price C_BidPrice Pecentage Margin_Req Underlay \
2 JETAIRWAYS 150.0 1.3 22.0 44156.9 81.05
3 PCJEWELLER 100.0 1.8 29.0 441207.0 46.95
C_LTP LotSize
2 1.05 2200
3 4.95 6500
First merge both dataframe on symbol and strike_price setting indicator=True and how='right'
result = pd.merge(master_df[['Symbol','Strike_Price']],child_df,on=['Symbol','Strike_Price'],indicator=True,how='right')
Then filter right_only from _merge column to get desired result
result = result[result['_merge']=='right_only']
Code snippet
I am trying to merge different dataframes.
Assume these two melted dataframes.
melted_dfs[0]=
Date Code delta_7
0 2014-04-01 GWA 0.08
1 2014-04-02 TVV -0.98
melted_dfs[1] =
Date Code delta_14
0 2014-04-01 GWA nan
1 2014-04-02 XRP -1.02
I am looking to merge both the above dataframes along with volume& GR columns from my base dataframe.
base_df =
Date Code Volume GR
0 2014-04-01 XRP 74,776.48 482.76
1 2014-04-02 TRR 114,052.96 460.19
I tried to use Python's built in reduce function by converting all dataframes into a list but it throws a error
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
# feature_dfs is a list which contains all the above dfs.
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
Any help is appreciated. Thanks!
This should work , as it stated some of df's date is not datetime format
feature_dfs=[x.assign(Date=pd.to_datetime(x['Date'])) for x in feature_dfs]
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
One of your dataframes in feature_dfs probably has a non-datetime dtype.
Try printing the datatypes and index of the DataFrames:
for i, df in enumerate(feature_dfs):
print 'DataFrame index: {}'.format(str(i))
print df.info()
print '-'*72
I would assume that one of the DataFrames is going to show a line like:
Date X non-null object
Indicating that you don't have a datetime datatype for Date. That DataFrame is the culprit and you will have the index from the print above.
Hi I am trying to resample a pandas DataFrame backwards.
This is my dataframe:
seconds = np.arange(20, 700, 60)
timedeltas = pd.to_timedelta(seconds, unit='s')
vals = np.array([randint(-10,10) for a in range(len(seconds))])
df = pd.DataFrame({'values': vals}, index = timedeltas)
then I have
In [252]: df
Out[252]:
values
00:00:20 8
00:01:20 4
00:02:20 5
00:03:20 9
00:04:20 7
00:05:20 5
00:06:20 5
00:07:20 -6
00:08:20 -3
00:09:20 -5
00:10:20 -5
00:11:20 -10
and
In [253]: df.resample('5min').mean()
Out[253]:
values
00:00:20 6.6
00:05:20 -0.8
00:10:20 -7.5
and what I would like is something like
Out[***]:
values
00:01:20 6
00:06:20 valb
00:11:20 -5.8
where the values of each new time are the ones if I roll back the dataframe and compute the mean in each bin going from backwards to forward. For example in this
case the last value should be
valc = (-6-3-5-5-10)/5.
valc= -5.8
which is the average of the last 5 values, and the first one should be the average of the only 2 first values because the "bin" is incomplete.
Reading pandas documentation I thought that I have to use the parameters how='last' but in my current version of pandas this is not working (version 0.20.3). Additionally I tried with the options closed and convention, but I wasn't able to perform this.
Thanks for the help
The easiest way is to sort the index in reverse order, then resample to get the desired results:
df.sort_index(ascending=False).resample('5min').mean()
Resample reference - When the resample starts the first bin is of max available length, in this case 5. Closed, label, convention parameters are helpful but do not compute the mean going from backwards to forward. To do that use sort.
This is a demo example of my DataFrame. The full DataFrame has multiple additional variables and covers 6 months of data.
sentiment date
1 2015-05-26 18:58:44
0.9 2015-05-26 19:57:31
0.7 2015-05-26 18:58:24
0.4 2015-05-27 19:17:34
0.6 2015-05-27 18:46:12
0.5 2015-05-27 13:32:24
1 2015-05-28 19:27:31
0.7 2015-05-28 18:58:44
0.2 2015-05-28 19:47:34
I want to group the DataFrame by just the day of the date column, but at the same time aggregate the median of the sentiment column.
Everything I have tried with groupby, the dt accessor and timegrouper has failed.
I want to return a pandas DataFrame not a GroupBy object.
The date column is M8[ns]
The sentiment column float64
You fortunately have the tools you need listed in your question.
In [61]: df.groupby(df.date.dt.date)[['sentiment']].median()
Out[61]:
sentiment
2015-05-26 0.9
2015-05-27 0.5
2015-05-28 0.7
I would do this :
df['date'] = df['date'].apply(lambda x : x.date())
df = df.groupby('date').agg({'sentiment':np.median}).reset_index()
You first replace the datetime column with the date.
Then you perform the groupby+agg operation.
I would do this, because you can do multiple aggregations (like median, mean, min, max, etc.) on multiple columns at the same time:
df.groupby(df.date.dt.date).agg({'sentiment': ['median']})
You can get any number of metrics using one group by and .agg() function
1) create new column extracting date.
2) Use groupy by and apply numpy.median,numpy.mean etc
import pandas as pd
x = [[1,'2015-05-26 18:58:44'],
[0.9,'2015-05-26 19:57:31']]
t = pd.DataFrame(x,columns = ['a','b'])
t.b = pd.to_datetime(t['b'])
t['datex'] = t['b'].dt.date
t.groupby(['datex']).agg({
'a': np.median
})
Output -
datex
2015-05-26 0.95