merge different dataframes and add other columns from base dataframe

merge different dataframes and add other columns from base dataframe - python

I am trying to merge different dataframes.
Assume these two melted dataframes.
melted_dfs[0]=
Date Code delta_7
0 2014-04-01 GWA 0.08
1 2014-04-02 TVV -0.98
melted_dfs[1] =
Date Code delta_14
0 2014-04-01 GWA nan
1 2014-04-02 XRP -1.02
I am looking to merge both the above dataframes along with volume& GR columns from my base dataframe.
base_df =
Date Code Volume GR
0 2014-04-01 XRP 74,776.48 482.76
1 2014-04-02 TRR 114,052.96 460.19
I tried to use Python's built in reduce function by converting all dataframes into a list but it throws a error
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
# feature_dfs is a list which contains all the above dfs.
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
Any help is appreciated. Thanks!

This should work , as it stated some of df's date is not datetime format
feature_dfs=[x.assign(Date=pd.to_datetime(x['Date'])) for x in feature_dfs]
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)

One of your dataframes in feature_dfs probably has a non-datetime dtype.

Try printing the datatypes and index of the DataFrames:
for i, df in enumerate(feature_dfs):
print 'DataFrame index: {}'.format(str(i))
print df.info()
print '-'*72
I would assume that one of the DataFrames is going to show a line like:
Date X non-null object
Indicating that you don't have a datetime datatype for Date. That DataFrame is the culprit and you will have the index from the print above.

Related

How do I remove extra zeros in a dataframe column?

I got a dataframe that contains data looking like:
Date Values
2016-12-31 13000000.0
2017-12-31 -45000000.0
2018-12-31 -129000000.0
2019-12-31 276000000.0
Is there a way to remove the last 3 zeros before the decimal point? So for example in the values column. the value is 13000000.0 I would like to remove 3 zeros so the number becomes: 13000.0 Any help would be greatly appreciated.

To remove 3 zeros you need to divide the column by 1000:
df['Values'] = df['Values'] / 1e3

Try this:
df['Values'] = df['Values'] / 1000

pandas assigns column name randomly

I extracted some dates and temperature values from an xml file and wanted to make a dataframe out of them. So after some loops I defined variable temperature and date and appended their values to a list out of the loop (placeholder). Later I made a DataFrame from them and assigned the column name directly while making the dataframe. But I relaized that with every time I run my code the column names get assigned right or wrong randomly.
Here is my code:
placeholder=[]
for timeserie in timeseries:
date = re.findall('<entryisIntraday\D*(\d*.\d*.\d*)', timeserie)
temperature = re.findall('<value>(.*)<\/value>', timeserie)[0]
placeholder.append([date, temperature])
print(placeholder)
df = pd.DataFrame(placeholder, columns= {"DATE", "TEMP"})
print(df)
after running the code sometimes the result is like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'] ...
TEMP DATE
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
and sometimes like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'], ...
DATE TEMP
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
I didn't had this problem when I assigned the column names after I built the DataFrame:
df = pd.DataFrame(placeholder)
df=df.rename(columns= {0:"DATE",1:"TEMP"})
How can I solve this problem?

The columns argument of DataFrame constructor should be a list, not a set:
df = pd.DataFrame(placeholder, columns = ["DATE", "TEMP"])

How to apply "Diff()" methode to multiple columns?

I'm trying to apply diff() method to multiple columns to make data Stationary for time series.
x1 = frc_data['004L004T10'].diff(periods=8)
x1
Date
2013-10-01 NaN
2013-11-01 NaN
2013-12-01 NaN
2014-01-01 NaN
2014-02-01 NaN
So diff is working for single column.
However diff is not working for all the columns:
for x in frc_data.columns:
frc_data[x].diff(periods=1)
No errors, although the Data remains Unchanged

In order to change the DataFrame, you need to assign the diff to a new column i.e.
for x in frc_data.columns:
frc_data[x] = frc_data[x].diff(periods=1)

Loop is not necessary, only remove [x] for difference of all columns:
frc_data = frc_data.diff(periods=1)

Compare master and child dataframe and extract new rows base on two column values only

I have two Dataframes as:
Master_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,120.0,1.0,32.0,96156.9,81.05,1.15,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,80.0,0.8,29.0,161207.0,56.95,0.95,6500
PCJEWELLER,82.5,0.55,31.0,154772.0,56.95,0.95,6500
PCJEWELLER,85.0,0.6,33.0,147882.0,56.95,0.7,6500
PCJEWELLER,90.0,0.5,37.0,138977.0,56.95,0.55,6500
and Child_DF:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,110.0,1.25,26.0,105308.9,81.05,1.2,2200
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,77.5,0.95,27.0,171217.0,56.95,1.3,6500
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500
I want compare child_DF with master_DF base on Column (Symbol,Strike_Price) i.e. if the Symbol & Strike_Price are already available in master_DF then it will not be consider as new data.
New Rows are:
Symbol,Strike_Price,C_BidPrice,Pecentage,Margin_Req,Underlay,C_LTP,LotSize
JETAIRWAYS,150.0,1.3,22.0,44156.9,81.05,1.05,2200
PCJEWELLER,100.0,1.8,29.0,441207.0,46.95,4.95,6500

You can use right merge with indicator=True and then query 'right_only', finally reindex() to get columns in order of child:
(master.merge(child,on=['Symbol','Strike_Price'],how='right',
suffixes=('_',''),indicator=True)
.query('_merge=="right_only"')).reindex(child.columns,axis=1)
Symbol Strike_Price C_BidPrice Pecentage Margin_Req Underlay \
2 JETAIRWAYS 150.0 1.3 22.0 44156.9 81.05
3 PCJEWELLER 100.0 1.8 29.0 441207.0 46.95
C_LTP LotSize
2 1.05 2200
3 4.95 6500

First merge both dataframe on symbol and strike_price setting indicator=True and how='right'
result = pd.merge(master_df[['Symbol','Strike_Price']],child_df,on=['Symbol','Strike_Price'],indicator=True,how='right')
Then filter right_only from _merge column to get desired result
result = result[result['_merge']=='right_only']
Code snippet

Resampling a multi-index DataFrame

I want to resample a DataFrame with a multi-index containing both a datetime column and some other key. The Dataframe looks like:
import pandas as pd
from StringIO import StringIO
csv = StringIO("""ID,NAME,DATE,VAR1
1,a,03-JAN-2013,69
1,a,04-JAN-2013,77
1,a,05-JAN-2013,75
2,b,03-JAN-2013,69
2,b,04-JAN-2013,75
2,b,05-JAN-2013,72""")
df = pd.read_csv(csv, index_col=['DATE', 'ID'], parse_dates=['DATE'])
df.columns.name = 'Params'
Because resampling is only allowed on datatime indexes, i thought unstacking the other index column would help. And indeed it does, but i cant stack it again afterwards.
print df.unstack('ID').resample('W-THU')
Params VAR1
ID 1 2
DATE
2013-01-03 69 69.0
2013-01-10 76 73.5
But then stacking 'ID' again results in an index-error:
print df.unstack('ID').resample('W-THU').stack('ID')
IndexError: index 0 is out of bounds for axis 0 with size 0
Strangely enough, i can stack the other column level with both:
print df.unstack('ID').resample('W-THU').stack(0)
and
print df.unstack('ID').resample('W-THU').stack('Params')
The index-error also occurs if i reorder (swap) both column levels. Does anyone know how to overcome this issue?

The example unstacks a non-numerical column 'NAME' which is silently dropped but causes problems during re-stacking. The code below worked for me
print df[['VAR1']].unstack('ID').resample('W-THU').stack('ID')
Params VAR1
DATE ID
2013-01-03 A 69.0
B 69.0
2013-01-10 A 76.0
B 73.5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge different dataframes and add other columns from base dataframe - python

This should work , as it stated some of df's date is not datetime format feature_dfs=[x.assign(Date=pd.to_datetime(x['Date'])) for x in feature_dfs] abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)

One of your dataframes in feature_dfs probably has a non-datetime dtype.

Related

How do I remove extra zeros in a dataframe column?

pandas assigns column name randomly

How to apply "Diff()" methode to multiple columns?

Compare master and child dataframe and extract new rows base on two column values only

Resampling a multi-index DataFrame

Categories

Resources