Filter elements from 2 pandas dataframes - python

I have two dataframes which represent stock prices over time and stock related information over time (e.g. fundamental data on the company).
Both dataframes contain monthly data, however they are over different time spans. One is 5 years, the other is 10 years. Also, both do not have the same number of stocks, there is only an 80% overlap.
Below is an example of the dataframes:
days1 = pd.date_range(start='1/1/1980', end='7/1/1980', freq='M')
df1 = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'),index=days1)
days2 = pd.date_range(start='1/1/1980', end='5/1/1980', freq='M')
df2 = pd.DataFrame(np.random.randn(4, 6), columns=list('ABCDEF'),index=days2)
My goal is to reduce both dataframes to the inner joint. That is, so both cover the same time period and contain the same stocks. My index contains the dates, and the column names are the stock names.
I have tried multiple variations of merge() etc, but those recreate a merged dataframe, I want to keep both dataframes. I have also tried isin() but I am struggling with accessing the index of each dataframe. For instance:
df3=df1[df1.isin(df2)].dropna()
Does someone have any suggestions?

for the column intersection:
column_intersection = df1.columns & df2.columns
for the row intersection:
row_intersection = df1.index & df2.index
then just subset each dataframe:
df1 = df1.loc[row_intersection, column_intersection]
df2 = df2.loc[row_intersection, column_intersection]

Related

Loop over rows of a single row to merge rows with overlapping dates

I have two dataframes with the following framework:
DF1:
df1 = pd.DataFrame({'Country': ['USA', 'USA','USA','UK','UK','UK'],
'City': ['NYC','NYC','NYC','London','London','London'],
'Start_Range': pd.to_datetime(['2016-01-01', '2020-07-01','2022-01-01','2019-01-01','2021-01-01','2023-01-01']),
'End_Range': pd.to_datetime(['2020-06-30', '2021-12-31','2023-12-31','2020-12-31','2022-12-31','2023-12-31'])
})
DF2:
df2 = pd.DataFrame({'Country': ['USA', 'USA','UK'],
'City': ['NYC','NYC','London'],
'Grade_Validity_Begin': pd.to_datetime(['2021-10-01','2023-01-01','2021-10-01']),
'Garde_Validity_End': pd.to_datetime(['2022-12-31', '2099-12-31','2023-12-31'])
})
DF1 contains the validity of the city across different time slices.
DF2 describes the range of another column called 'Grade' for each country-city combination having its own time slice validity.
I would like to merge these dataframes into a resultant dataframe such that :
For start_range and end_range not having a grade in DF2, the rows from DF2 remain blank after merging
In cases where the Grade is valid in a time slice which overlaps with a tie slice in DF1, the row needs to be split according to the validity of the time ranges in DF1
Since there are multiple cities and country's present in the dataframe, I have to tried to left merge them and group them on the basis of city and country.
Then to loop over each group and compare the start and end dates. But I am not able to figure it out how this would work.
Thankyou for any tips!

Divide two dataframes with multiple columns (column specific)

I have two identical sized dataframes (df1 & df2). I would like to create a new dataframe with values that are df1 column1 / df2 column1.
So essentially df3 = df1(c1)/df2(c1), df1(c2)/df2(c2), df1(c3)/df2(c3)...
I've tried the below code, however both give a dataframe filled with NaN
#attempt 1
df3 = df2.divide(df1, axis='columns')
#attempt 2
df3= df2/df1
You can try the following code:
df3 = df2.div(df1.iloc[0], axis='columns')
To use the divide function, the indexes of the dataframes need to match. In this situation, df1 was beginning of month values, df2 was end of month. The questions can be solved by:
df3 = df2.reset_index(drop=True)/df1.reset_index(drop=True)
df3.set_index(df2.index,inplace=True) ##set the index back to the original (ie end of month)

Left join DataFrame where the Date in the left DataFrame is contained in the range of Dates based around a Date in the right DataFrame

import pandas as pd
df_A = pd.DataFrame({'Team_A': ['Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Redskins'], 'Start':['2017-11-09','2017-09-10']})
df_B = pd.DataFrame({'Team_A': ['Cowboys', 'Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Eagles','Redskins'], 'Start':['2017-11-09','2017-11-11','2017-09-10']})
df_A['Start'] = pd.to_datetime(df_A.Start)
df_B['Start'] = pd.to_datetime(df_B.Start)
I want to left join on df A. The trouble is that the games may be repeated in df_B usually with a slightly different date, no more than +- 4 days from the correct date (the one listed in df A). In the example shown the first game in df A is shown twice: first with the correct date, the second time with an incorrect date. It it not necessarily the case that the first date will be the correct one. It is also possible that more that one incorrect dates may be shown so a game may appear more than twice. Please note also that the example above is simplified in the actual problem there are several other columns which may or may not match. The other key point is that these teams will appear again several times in the real problem but at dates much further that +- 4 days.
df_merge = pd.merge(df_A, df_B, on=['Team_A', 'Team_B', 'Start'], how='left')
This is close to what I want but only gives the games where the Start dates match exactly. I also want the games that are within +- 4 days of the Start date.
Merging two dataframes based on a date between two other dates without a common column
This tackles a similar problem but in my case the number of rows in each DataFrame are different so it won't work for me.
I also tried this one but could not get it to work for me:
How to join two table in pandas based on time with delay
I also tried:
a = df_A['Start'] - pd.Timedelta(4, unit='d')
b = df_A['Start'] + pd.Timedelta(4, unit='d')
df = db_B[db_B['Start'].between(a, b, inclusive=False)]
but again this does not work because of the differing number of rows in each DataFrame.
IIUC you would rather use outer merge as in the following example
import pandas as pd
df_A = pd.DataFrame({'Team_A': ['Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Redskins'], 'Start':['2017-11-09','2017-09-10']})
df_B = pd.DataFrame({'Team_A': ['Cowboys', 'Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Eagles','Redskins'], 'Start':['2017-11-09','2017-11-11','2017-09-10']})
df_A['Start'] = pd.to_datetime(df_A.Start)
df_B['Start'] = pd.to_datetime(df_B.Start)
# +/- 4 days
df_A["lower"] = df_A["Start"]- pd.Timedelta(4, unit='d')
df_A["upper"] = df_A["Start"] + pd.Timedelta(4, unit='d')
# Get rid of Start col
df_A = df_A.drop("Start", axis=1)
# outer merge on Team_A, Team_B only
df = pd.merge(df_A, df_B, on=['Team_A', 'Team_B'], how='outer')
# filter
df = df[df["Start"].between(df["lower"], df["upper"])].reset_index(drop=True)
If your dataframe is huge you might consider using dask.

Is there a way to extract only one column from all the 30 dataframes?

I have 30 dataframes, but from each of these dataframes i just want one column from them. Each of these dataframes contain stock prices OHLC, Adj Close and volumes. I want to extract only one column from 30 dataframes i.e. "Adj Close"
How do i do that without making the code lengthy?
Use list comprehension:
dfs = [df1, df2, df3...df30]
#if need Series
out = [df['Adj Close'] for df in dfs]
#if need one column DataFrames
#out = [df[['Adj Close']] for df in dfs]
Or loop:
out = []
for df in dfs:
#if need Series
out.append(df['Adj Close'])
#if need one column DataFrames
out.append(df[['Adj Close']])
Last if need one big DataFrame with each column for each Series:
df_big = pd.concat(out, ignore_index=True, axis=1)

Joining multiple data frames with join with pandas

I have two data frames mention below.
df1 dataframe consists SaleDate column as the unique key column
df1 shape is (12, 11)
the 2nd data frame mention below
df2 dataframe consists SaleDate column as the unique key column
df2 shape is (2,19)
But the dimension of each data-frame are different .
Some how I need to join 2 data-frames based on new [month-year] column which can be derived from SaleDate and add same urea price for whole month of the respective year.
Expected out put mention below
df3 data-frame consist of monthly ureaprice for each raw at the data-frame
The shape of new dataframe (13,11)
***The actual df1 consist of 2 Million records and df2 consist of 360 records.
I tried to join two data-frames with left join to get above output. But, unable to achieve it.
import pandas as pd # Import Pandas for data manipulation using dataframes
df1['month_year']=pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['SaleDate']).dt.to_period('M')
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']
'month-year':['2013-02','2013-03','2013-06','2013-05']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01','2013-06-01']
'month-year':['2013-01','2013-02','2013-03','2013-04','2013-05','2013-06']})
Final data frame
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
all values pertaining for the urea-price was "NaN".
Hope to get expert advice in this regard.
Assuming your SaleDate columns are string dtypes, you could just do:
df1['month_year'] = df1['SaleDate'].apply(lambda x: x[:7])
df2['month_year'] = df2['SaleDate'].apply(lambda x: x[:7])
And I think the rest should work!
I copied your code, without month_year column:
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01',
'2013-06-01']})
Then I created month_year column in both DataFrames:
df1['month_year'] = pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['Month']).dt.to_period('M')
and merged them:
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
When I executed print(s1) I got:
Factory SaleDate month_year Price Month
0 MF0322 2013-02-07 2013-02 425.63 2013-02-01
1 MF0657 2013-03-07 2013-03 398.13 2013-03-01
2 MF0300 2013-06-07 2013-06 325.13 2013-06-01
3 MF0790 2013-05-07 2013-05 343.33 2013-05-01
As you can see, Price column is correct, equal to Price for
respective month (according to SaleDate).
So generally your code is OK.
Check for other sources of errors. E.g. in your code snippet:
you first set month_year in each DataFrame,
then you create both DataFrames again, destroying the previous content.
Copy my code (and nothing more) and confirm that it gives the same result.
Maybe the source of your problem is in some totally other place?
Note that e.g. your df2 has Month column, not SaleDate.
Maybe this is the root cause?

Categories