Match 2 data frames by date and column name to get values - python

I have two data frames (they are already in a data frame format but for illustration, I created them as a dictionary first):
first = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul']}
df1 = pd.DataFrame(first)
And
second = {
'Date':['2013-02-28','2013-03-31','2013-05-30','2014-10-31'],
'Felix':['Value1_x','Value2_x','Value3_x','Value4_x'],
'Peter':['Value1_y','Value2_y','Value3_y','Value4_y']}
df2 = pd.DataFrame(second)
Now, I'd like to add an additional column to df1 containing the values of df2 if the df1.Date matches the df2.Date by year and month (the day does not usually match since df1 contains end of month dates) AND if the column name of df2 matches the according df1.Name values.
So the result should look like this:
df_new = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul'],
'Values':['Value1_x','Value2_x','Value3_y','NaN']}
df_new = pd.DataFrame(df_new)
Do you have any suggestions how to solve this problem?
I considered creating additional columns for year and month (df1['year']= df1['Date'].dt.year) and then matching df1[(df1['year'] == df2['year']) & (df1['month'] == df2['month'])] and calling the df2.column but I cant figure out how to put everything together

In general, try not to post your data sets as images, b/c it's hard to help you out then.
I think the easiest thing to do would be to create a column in each data frame where the Date is rounded to the first day of each month.
df1['Date_round'] = df1['Date'] - pd.offsets.MonthBegin(1)
df2['Date_round'] = df2['Date'] - pd.offsets.MonthBegin(1)
Then reshape df2 using melt.
df2_reshaped = df2.melt(id_vars=['Date','Date_round'], var_name='Name', value_name='Values')
And then you can join the data frames on Date_round and Name using pd.merge.
df = pd.merge(df1, df2_reshaped.drop('Date', axis=1), how='left', on=['Date_round', 'Name'])

Related

Adding correction column to dataframe

I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])

Divide two dataframes with multiple columns (column specific)

I have two identical sized dataframes (df1 & df2). I would like to create a new dataframe with values that are df1 column1 / df2 column1.
So essentially df3 = df1(c1)/df2(c1), df1(c2)/df2(c2), df1(c3)/df2(c3)...
I've tried the below code, however both give a dataframe filled with NaN
#attempt 1
df3 = df2.divide(df1, axis='columns')
#attempt 2
df3= df2/df1
You can try the following code:
df3 = df2.div(df1.iloc[0], axis='columns')
To use the divide function, the indexes of the dataframes need to match. In this situation, df1 was beginning of month values, df2 was end of month. The questions can be solved by:
df3 = df2.reset_index(drop=True)/df1.reset_index(drop=True)
df3.set_index(df2.index,inplace=True) ##set the index back to the original (ie end of month)

Left join DataFrame where the Date in the left DataFrame is contained in the range of Dates based around a Date in the right DataFrame

import pandas as pd
df_A = pd.DataFrame({'Team_A': ['Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Redskins'], 'Start':['2017-11-09','2017-09-10']})
df_B = pd.DataFrame({'Team_A': ['Cowboys', 'Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Eagles','Redskins'], 'Start':['2017-11-09','2017-11-11','2017-09-10']})
df_A['Start'] = pd.to_datetime(df_A.Start)
df_B['Start'] = pd.to_datetime(df_B.Start)
I want to left join on df A. The trouble is that the games may be repeated in df_B usually with a slightly different date, no more than +- 4 days from the correct date (the one listed in df A). In the example shown the first game in df A is shown twice: first with the correct date, the second time with an incorrect date. It it not necessarily the case that the first date will be the correct one. It is also possible that more that one incorrect dates may be shown so a game may appear more than twice. Please note also that the example above is simplified in the actual problem there are several other columns which may or may not match. The other key point is that these teams will appear again several times in the real problem but at dates much further that +- 4 days.
df_merge = pd.merge(df_A, df_B, on=['Team_A', 'Team_B', 'Start'], how='left')
This is close to what I want but only gives the games where the Start dates match exactly. I also want the games that are within +- 4 days of the Start date.
Merging two dataframes based on a date between two other dates without a common column
This tackles a similar problem but in my case the number of rows in each DataFrame are different so it won't work for me.
I also tried this one but could not get it to work for me:
How to join two table in pandas based on time with delay
I also tried:
a = df_A['Start'] - pd.Timedelta(4, unit='d')
b = df_A['Start'] + pd.Timedelta(4, unit='d')
df = db_B[db_B['Start'].between(a, b, inclusive=False)]
but again this does not work because of the differing number of rows in each DataFrame.
IIUC you would rather use outer merge as in the following example
import pandas as pd
df_A = pd.DataFrame({'Team_A': ['Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Redskins'], 'Start':['2017-11-09','2017-09-10']})
df_B = pd.DataFrame({'Team_A': ['Cowboys', 'Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Eagles','Redskins'], 'Start':['2017-11-09','2017-11-11','2017-09-10']})
df_A['Start'] = pd.to_datetime(df_A.Start)
df_B['Start'] = pd.to_datetime(df_B.Start)
# +/- 4 days
df_A["lower"] = df_A["Start"]- pd.Timedelta(4, unit='d')
df_A["upper"] = df_A["Start"] + pd.Timedelta(4, unit='d')
# Get rid of Start col
df_A = df_A.drop("Start", axis=1)
# outer merge on Team_A, Team_B only
df = pd.merge(df_A, df_B, on=['Team_A', 'Team_B'], how='outer')
# filter
df = df[df["Start"].between(df["lower"], df["upper"])].reset_index(drop=True)
If your dataframe is huge you might consider using dask.

Joining multiple data frames with join with pandas

I have two data frames mention below.
df1 dataframe consists SaleDate column as the unique key column
df1 shape is (12, 11)
the 2nd data frame mention below
df2 dataframe consists SaleDate column as the unique key column
df2 shape is (2,19)
But the dimension of each data-frame are different .
Some how I need to join 2 data-frames based on new [month-year] column which can be derived from SaleDate and add same urea price for whole month of the respective year.
Expected out put mention below
df3 data-frame consist of monthly ureaprice for each raw at the data-frame
The shape of new dataframe (13,11)
***The actual df1 consist of 2 Million records and df2 consist of 360 records.
I tried to join two data-frames with left join to get above output. But, unable to achieve it.
import pandas as pd # Import Pandas for data manipulation using dataframes
df1['month_year']=pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['SaleDate']).dt.to_period('M')
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']
'month-year':['2013-02','2013-03','2013-06','2013-05']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01','2013-06-01']
'month-year':['2013-01','2013-02','2013-03','2013-04','2013-05','2013-06']})
Final data frame
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
all values pertaining for the urea-price was "NaN".
Hope to get expert advice in this regard.
Assuming your SaleDate columns are string dtypes, you could just do:
df1['month_year'] = df1['SaleDate'].apply(lambda x: x[:7])
df2['month_year'] = df2['SaleDate'].apply(lambda x: x[:7])
And I think the rest should work!
I copied your code, without month_year column:
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01',
'2013-06-01']})
Then I created month_year column in both DataFrames:
df1['month_year'] = pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['Month']).dt.to_period('M')
and merged them:
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
When I executed print(s1) I got:
Factory SaleDate month_year Price Month
0 MF0322 2013-02-07 2013-02 425.63 2013-02-01
1 MF0657 2013-03-07 2013-03 398.13 2013-03-01
2 MF0300 2013-06-07 2013-06 325.13 2013-06-01
3 MF0790 2013-05-07 2013-05 343.33 2013-05-01
As you can see, Price column is correct, equal to Price for
respective month (according to SaleDate).
So generally your code is OK.
Check for other sources of errors. E.g. in your code snippet:
you first set month_year in each DataFrame,
then you create both DataFrames again, destroying the previous content.
Copy my code (and nothing more) and confirm that it gives the same result.
Maybe the source of your problem is in some totally other place?
Note that e.g. your df2 has Month column, not SaleDate.
Maybe this is the root cause?

convert string to float and align columns in dataframe

I have a Dataframe df where a Date column belongs to each Value column. The contents are (still) formatted as strings:
Date Value1 Date Value2
Index
0 30.01.2001 20,32 30.05.2005 50,55
1 30.02.2001 19,5 30.06.2005 49,21
2 30.03.2001 21,45 30.07.2005 48,1
my issues (in order of priority):
I do not manage to convert the Value columns to float even after I successfully converted the ',' to '.' with
df.replace(to_replace=",", value='.', inplace=True, regex=True)
so what can you suggest how I can convert to float? I suspect the reason for not working is that there are sometimes only one decimal after the comma. How can I solve this?
How can I align the dates so that Value2's dates match those of Value1 (so it needs to be moved downwards until it matches, provided that the rows continue till the present day)?
what is the most efficient way to iterate through the columns in order to do the formatting of the columns?
EDIT:
Based on the answers so far...how can I iterate through the larger dataframe and split it into single ones/series as suggested? (I have issues adding counter integers to the df's , i.e. df1, df2, df3...)
A number of steps:
# I wanted to split into two dataframes
df1 = df.iloc[:, :2].copy()
# rename column to Date because my `read_csv` parsing
df2 = df.iloc[:, 2:].copy().rename(columns={'Date.1': 'Date'})
# As EdChum suggested.
df1.Value1 = df1.Value1.str.replace(',', '.').astype(float)
df2.Value2 = df2.Value2.str.replace(',', '.').astype(float)
# Convert to dates
df1.Date = pd.to_datetime(df1.Date.str.replace('.', '/'))
df2.Date = pd.to_datetime(df2.Date.str.replace('.', '/'))
# set dates as index in anticipation for `pd.concat`
df1 = df1.set_index('Date')
df2 = df2.set_index('Date')
# line up dates.
pd.concat([df1, df2], axis=1)

Categories