merge different dataframes and add other columns from base dataframe - python

I am trying to merge different dataframes.
Assume these two melted dataframes.
melted_dfs[0]=
Date Code delta_7
0 2014-04-01 GWA 0.08
1 2014-04-02 TVV -0.98
melted_dfs[1] =
Date Code delta_14
0 2014-04-01 GWA nan
1 2014-04-02 XRP -1.02
I am looking to merge both the above dataframes along with volume& GR columns from my base dataframe.
base_df =
Date Code Volume GR
0 2014-04-01 XRP 74,776.48 482.76
1 2014-04-02 TRR 114,052.96 460.19
I tried to use Python's built in reduce function by converting all dataframes into a list but it throws a error
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
# feature_dfs is a list which contains all the above dfs.
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
Any help is appreciated. Thanks!

This should work , as it stated some of df's date is not datetime format
feature_dfs=[x.assign(Date=pd.to_datetime(x['Date'])) for x in feature_dfs]
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)

One of your dataframes in feature_dfs probably has a non-datetime dtype.

Try printing the datatypes and index of the DataFrames:
for i, df in enumerate(feature_dfs):
print 'DataFrame index: {}'.format(str(i))
print df.info()
print '-'*72
I would assume that one of the DataFrames is going to show a line like:
Date X non-null object
Indicating that you don't have a datetime datatype for Date. That DataFrame is the culprit and you will have the index from the print above.

Related

How do I remove extra zeros in a dataframe column?

I got a dataframe that contains data looking like:
Date Values
2016-12-31 13000000.0
2017-12-31 -45000000.0
2018-12-31 -129000000.0
2019-12-31 276000000.0
Is there a way to remove the last 3 zeros before the decimal point? So for example in the values column. the value is 13000000.0 I would like to remove 3 zeros so the number becomes: 13000.0 Any help would be greatly appreciated.
To remove 3 zeros you need to divide the column by 1000:
df['Values'] = df['Values'] / 1e3
Try this:
df['Values'] = df['Values'] / 1000

pandas assigns column name randomly

I extracted some dates and temperature values from an xml file and wanted to make a dataframe out of them. So after some loops I defined variable temperature and date and appended their values to a list out of the loop (placeholder). Later I made a DataFrame from them and assigned the column name directly while making the dataframe. But I relaized that with every time I run my code the column names get assigned right or wrong randomly.
Here is my code:
placeholder=[]
for timeserie in timeseries:
date = re.findall('<entryisIntraday\D*(\d*.\d*.\d*)', timeserie)
temperature = re.findall('<value>(.*)<\/value>', timeserie)[0]
placeholder.append([date, temperature])
print(placeholder)
df = pd.DataFrame(placeholder, columns= {"DATE", "TEMP"})
print(df)
after running the code sometimes the result is like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'] ...
TEMP DATE
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
and sometimes like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'], ...
DATE TEMP
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
I didn't had this problem when I assigned the column names after I built the DataFrame:
df = pd.DataFrame(placeholder)
df=df.rename(columns= {0:"DATE",1:"TEMP"})
How can I solve this problem?
The columns argument of DataFrame constructor should be a list, not a set:
df = pd.DataFrame(placeholder, columns = ["DATE", "TEMP"])

How to apply "Diff()" methode to multiple columns?

I'm trying to apply diff() method to multiple columns to make data Stationary for time series.
x1 = frc_data['004L004T10'].diff(periods=8)
x1
Date
2013-10-01 NaN
2013-11-01 NaN
2013-12-01 NaN
2014-01-01 NaN
2014-02-01 NaN
So diff is working for single column.
However diff is not working for all the columns:
for x in frc_data.columns:
frc_data[x].diff(periods=1)
No errors, although the Data remains Unchanged
In order to change the DataFrame, you need to assign the diff to a new column i.e.
for x in frc_data.columns:
frc_data[x] = frc_data[x].diff(periods=1)
Loop is not necessary, only remove [x] for difference of all columns:
frc_data = frc_data.diff(periods=1)

Compare master and child dataframe and extract new rows base on two column values only

I have two Dataframes as:
Master_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,120.0,1.0,32.0,96156.9,81.05,1.15,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,80.0,0.8,29.0,161207.0,56.95,0.95,6500
PCJEWELLER,82.5,0.55,31.0,154772.0,56.95,0.95,6500
PCJEWELLER,85.0,0.6,33.0,147882.0,56.95,0.7,6500
PCJEWELLER,90.0,0.5,37.0,138977.0,56.95,0.55,6500
and Child_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
I want compare child_DF with master_DF base on Column (Symbol,Strike_Price) i.e. if the Symbol & Strike_Price are already available in master_DF then it will not be consider as new data.
New Rows are:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
You can use right merge with indicator=True and then query 'right_only', finally reindex() to get columns in order of child:
(master.merge(child,on=['Symbol','Strike_Price'],how='right',
suffixes=('_',''),indicator=True)
.query('_merge=="right_only"')).reindex(child.columns,axis=1)
Symbol Strike_Price C_BidPrice Pecentage Margin_Req Underlay \
2 JETAIRWAYS 150.0 1.3 22.0 44156.9 81.05
3 PCJEWELLER 100.0 1.8 29.0 441207.0 46.95
C_LTP LotSize
2 1.05 2200
3 4.95 6500
First merge both dataframe on symbol and strike_price setting indicator=True and how='right'
result = pd.merge(master_df[['Symbol','Strike_Price']],child_df,on=['Symbol','Strike_Price'],indicator=True,how='right')
Then filter right_only from _merge column to get desired result
result = result[result['_merge']=='right_only']
Code snippet

Resampling a multi-index DataFrame

I want to resample a DataFrame with a multi-index containing both a datetime column and some other key. The Dataframe looks like:
import pandas as pd
from StringIO import StringIO
csv = StringIO("""ID,NAME,DATE,VAR1
1,a,03-JAN-2013,69
1,a,04-JAN-2013,77
1,a,05-JAN-2013,75
2,b,03-JAN-2013,69
2,b,04-JAN-2013,75
2,b,05-JAN-2013,72""")
df = pd.read_csv(csv, index_col=['DATE', 'ID'], parse_dates=['DATE'])
df.columns.name = 'Params'
Because resampling is only allowed on datatime indexes, i thought unstacking the other index column would help. And indeed it does, but i cant stack it again afterwards.
print df.unstack('ID').resample('W-THU')
Params VAR1
ID 1 2
DATE
2013-01-03 69 69.0
2013-01-10 76 73.5
But then stacking 'ID' again results in an index-error:
print df.unstack('ID').resample('W-THU').stack('ID')
IndexError: index 0 is out of bounds for axis 0 with size 0
Strangely enough, i can stack the other column level with both:
print df.unstack('ID').resample('W-THU').stack(0)
and
print df.unstack('ID').resample('W-THU').stack('Params')
The index-error also occurs if i reorder (swap) both column levels. Does anyone know how to overcome this issue?
The example unstacks a non-numerical column 'NAME' which is silently dropped but causes problems during re-stacking. The code below worked for me
print df[['VAR1']].unstack('ID').resample('W-THU').stack('ID')
Params VAR1
DATE ID
2013-01-03 A 69.0
B 69.0
2013-01-10 A 76.0
B 73.5

Categories