Python Pandas merge and update dataframe - python

I am currently using Python and Pandas to form a stock price "database". I managed to find some codes to download the stock prices.
df1 is my existing database. Each time I download the share price, it will look like df2 and df3. Then, i need to combine df1, df2 and df3 data to look like df4.
Each stock has its own column.
Each date has its own row.
df1: Existing database
+----------+-------+----------+--------+
| Date | Apple | Facebook | Google |
+----------+-------+----------+--------+
| 1/1/2018 | 161 | 58 | 1000 |
| 2/1/2018 | 170 | 80 | |
| 3/1/2018 | 190 | 84 | 100 |
+----------+-------+----------+--------+
df2: New data (2/1/2018 and 4/1/2018) and updated data (3/1/2018) for Google.
+----------+--------+
| Date | Google |
+----------+--------+
| 2/1/2018 | 500 |
| 3/1/2018 | 300 |
| 4/1/2018 | 200 |
+----------+--------+
df3: New data for Amazon
+----------+--------+
| Date | Amazon |
+----------+--------+
| 1/1/2018 | 1000 |
| 2/1/2018 | 1500 |
| 3/1/2018 | 2000 |
| 4/1/2018 | 3000 |
+----------+--------+
df4 Final output: Basically, it merges and updates all the data into the database. (df1 + df2 + df3) --> this will be the updated database of df1
+----------+-------+----------+--------+--------+
| Date | Apple | Facebook | Google | Amazon |
+----------+-------+----------+--------+--------+
| 1/1/2018 | 161 | 58 | 1000 | 1000 |
| 2/1/2018 | 170 | 80 | 500 | 1500 |
| 3/1/2018 | 190 | 84 | 300 | 2000 |
| 4/1/2018 | | | 200 | 3000 |
+----------+-------+----------+--------+--------+
I do not know how to combine df1 and df3.
And I do not know how to combine df1 and df2 (add new row: 4/1/2018) while at the same time updating the data (2/1/2018 -> Original data: NaN; amended data: 500 | 3/1/2018 -> Original data: 100; amended data: 300) and leaving the existing intact data (1/1/2018).
Can anyone help me to get df4? =)
Thank you.
EDIT: Based on Sociopath suggestion, I amended the code to:
dataframes = [df2, df3]
df4 = df1
for i in dataframes:
# Merge the dataframe
df4 = df4.merge(i, how='outer', on='date')
# Get the stock name
stock_name = i.columns[1]
# To check if there is any column with "_x", if have, then combine these columns
if stock_name+"_x" in df4.columns:
x = stock_name+"_x"
y = stock_name+"_y"
df4[stock_name] = df4[y].fillna(df4[x])
df4.drop([x, y], 1, inplace=True)

You need merge:
df1 = pd.DataFrame({'date':['2/1/2018','3/1/2018','4/1/2018'], 'Google':[500,300,200]})
df2 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018','4/1/2018'], 'Amazon':[1000,1500,2000,3000]})
df3 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018'], 'Apple':[161,171,181], 'Google':[1000,None,100], 'Facebook':[58,75,65]})
If the column is not present in current database simply use merge as below
df_new = df3.merge(df2, how='outer',on=['date'])
If the column in present in DB then use fillna to update the values as below:
df_new = df_new.merge(df1, how='outer', on='date')
#print(df_new)
df_new['Google'] = df_new['Google_y'].fillna(df_new['Google_x'])
df_new.drop(['Google_x','Google_y'], 1, inplace=True)
Output:
date Apple Facebook Amazon Google
0 1/1/2018 161.0 58.0 1000 1000.0
1 2/1/2018 171.0 75.0 1500 500.0
2 3/1/2018 181.0 65.0 2000 300.0
3 4/1/2018 NaN NaN 3000 200.0
EDIT
More generic solution for later part.
dataframes = [df2, df3, df4]
for i in dataframes:
stock_name = list(i.columns.difference(['date']))[0]
df_new = df_new.merge(i, how='outer', on='date')
x = stock_name+"_x"
y = stock_name+"_y"
df_new[stock_name] = df_new[y].fillna(df_new[x])
df_new.drop([x,y], 1, inplace=True)

Related

Plotly Dash callback between 2 pandas DataFrames

I have two pandas DataFrames. df1 is 2 years of time series data recorded hourly for 20,000+ users, and it looks something like this:
TimeStamp | UserID1 | UserID2 | ... | UserID20000 |
---------------------------------------------------------------
2017-01-01 00:00:00 | 1.5 | 22.5 | ... | 5.5 |
2017-01-01 01:00:00 | 4.5 | 3.2 | ... | 9.12 |
.
.
.
2019-12-31 22:00:00 | 4.2 | 7.6 | ... | 8.9 |
2029-12-31 23:00:00 | 3.2 | 0.9 | ... | 11.2 |
df2 is ~ 20 attributes for each of the users and looks something like this:
User | Attribute1 | Attribute2 | ... | Attribute20 |
------------------------------------------------------------
UserID1 | yellow | big | ... | 450 |
UserID2 | red | small | ... | 6500 |
.
.
.
UserID20000 | yellow | small | ... | 950 |
I would like to create a Plotly Dash with callbacks where a user can specify attribute values or ranges of values (ie Attribute1 == 'yellow', Attribute20 < 1000 AND Attribute20 > 500) to create line graphs of the time series data of only the users that meet the specified attribute criteria.
I'm new to Plotly, but I'm able to create static plots with matplotlib by filtering df2 based on the attributes I want, making a list of the User IDs after filter, and reindexing df1 with the list of filtered User IDs:
filtered_users = df2.loc[(df2[Attribute1] == 'yellow'), 'User'].to_list()
df1 = df1.reindex(filtered_users, axis=1)
While this works, I'm not sure if the code is that efficient, and I'd like to be able to explore the data interactively, hence the move to Plotly.

Take single column from multiple CSV files and place them as new columns in dataframe

I have multiple CSV files with the same column headers that look like this:
| Date & Time | Rain | Flow |
| --------------------- | ----- | ---------- |
| 3/19/2018 12:00 | 0 | 0.51 |
| 3/19/2018 13:00 | 2 | 0.51 |
...
I want to take the 'Flow' column from each CSV and place them side by side according to the date. The issue I am facing is that the Date & Time for each CSV is different and I want to align the columns according to date and if there was no value for that date when I merge, I want to leave an empty space or NaN
I created a new dataframe that has a range of dates that encapsulates all the dates found in the list of CSVs, but I am unable to merge the columns accordingly.
The final dataframe would look something like
| Date & Time | CSV 1 Flow | CSV 2 Flow | CSV 3 Flow |
| --------------------- | ---------------- | ---------------- | ---------------- |
| 3/19/2018 12:00 | 0.51 | NaN | 0.34 |
| 3/19/2018 13:00 | 0.51 | NaN | 0.47 |
...
What I tried so far looks like:
csv_files = glob.glob(os.path.join(pwd, "*.csv"))
range = pd.date_range('2017-01-01', periods=45985, freq='H')
df_full = pd.DataFrame({'Date & Time': range})
for j in csv_files:
df_full[j]=''
df_hourly = pd.read_csv(j, usecols=['Date & Time','Flow'])
df_merged = pd.merge(df_full, df_hourly, on='Date & Time', how='left')
I have gotten the code to look like:
range = pd.date_range('2017-01-01', periods=45985, freq='H')
df_full = pd.DataFrame({'Date & Time': range})
for filename in csv_files:
df_full[filename] = ''
df = pd.read_csv(filename,header=0, parse_dates=['Date & Time'],
usecols=['Date & Time', 'Flow'])
df_combined = pd.merge(left=df_full,right=df, on='Date & Time', how='outer')
df_combined
Which gives an output DF that looks like
| Date & Time | CSV 1 Filepath | CSV 2 Filepath |... | - Flow- |
| --------------------- | ---------------- | ---------------- |... | ------- |
| 01/01/2017 00:00 | BLANK | BLANK |... | 0.34 |
| 01/01/2017 01:00 | BLANK | BLANK |... | 0.25 |
...
The entire table is blank except for the last column which is labeled 'Flow'. It seems that the script is not putting the values in the correct column.
Try something like this:
df1 = pd.read_csv('example.csv', parse_dates=['Date & Time'])
df2 = pd.read_csv('example.csv', parse_dates=['Date & Time'])
df_all = df1.merge(df2, on='Date & Time', how='left')
print(df_all)
Output:
Date & Time Rain_x Flow_x Rain_y Flow_y
0 2018-03-19 12:00:00 0 0.51 0 0.51
1 2018-03-19 13:00:00 2 0.51 2 0.51
Approximately your loop will be something like this:
csv_files = glob.glob(os.path.join(pwd, "*.csv"))
df_all = pd.read_csv(csv_files[0], parse_dates=['Date & Time'], usecols=['Date & Time','Flow'])
for file in csv_files[1:]:
df = pd.read_csv(file, parse_dates=['Date & Time'], usecols=['Date & Time','Flow'])
df_all = df_all.merge(df, on='Date & Time', how='left')

Python Pandas - How to compare values from two columns of a dataframe to another Dataframe columns?

I have two dataframes which I need to compare between two columns based on condition and print the output. For example:
df1:
| ID | Date | value |
| 248 | 2021-10-30| 4.5 |
| 249 | 2021-09-21| 5.0 |
| 100 | 2021-02-01| 3,2 |
df2:
| ID | Date | value |
| 245 | 2021-12-14| 4.5 |
| 246 | 2021-09-21| 5.0 |
| 247 | 2021-10-30| 3,2 |
| 248 | 2021-10-30| 3,1 |
| 249 | 2021-10-30| 2,2 |
| 250 | 2021-10-30| 6,3 |
| 251 | 2021-10-30| 9,1 |
| 252 | 2021-10-30| 2,0 |
I want to write a code which compares ID column and date column between two dataframes is having a conditions like below,
if "ID and date is matching from df1 to df2": print(df1['compare'] = 'Both matching')
if "ID is matching and date is not matching from df1 to df2" : print(df1['compare'] = 'Date not matching')
if "ID is Not matching from df1 to df2" : print(df1['compare'] = 'ID not available')
My result df1 should look like below:
df1 (expected result):
| ID | Date | value | compare
| 248 | 2021-10-30| 4.5 | Both matching
| 249 | 2021-09-21| 5.0 | Id matching - Date not matching
| 100 | 2021-02-01| 3,2 | Id not available
how to do this with Python pandas dataframe?
What I suggest you do is to use iterrows. It might not be the best idea, but still can solve your problem:
compareColumn = []
for index, row in df1.iterrows():
df2Row = df2[df2["ID"] == row["ID"]]
if df2Row.shape[0] == 0:
compareColumn.append("ID not available")
else:
check = False
for jndex, row2 in df2Row.iterrows():
if row2["Date"] == row["Date"]:
compareColumn.append("Both matching")
check = True
break
if check == False:
compareColumn.append("Date not matching")
df1["compare"] = compareColumn
df1
Output
ID
Date
value
compare
0
248
2021-10-30
4.5
Both matching
1
249
2021-09-21
5
Date not matching
2
100
2021-02-01
3.2
ID not available
suppose 'ID' column is the index, then we can do like this:
def f(x):
if x.name in df2.index:
return 'Both matching' if x['Date']==df2.loc[x.name,'Date'] else 'Date not matching'
return 'ID not available'
df1 = df1.assign(compare=df1.apply(f,1))
print(df1)
Date value compare
ID
248 2021-10-30 4.5 Both matching
249 2021-09-21 5.0 Date not matching
100 2021-02-01 3,2 ID not available

Left join two dateframes with date columns on a range of dates?

Here's an example that fulfills the criteria. Dataframe df1 and df2 both have id columns and date columns. In df1 the id column is unique while in df2 it is non-unique. I'd like to create a new dataframe where the join happens if df1.id == df2.id and some pair of dates in df2 is within 1 week of the date in df1
df1
| customer_id (unique) | 'purchase_date'|
| -------- | -------------- |
| 1 | 2021-05-14 |
| 2 | 2021-09-16 |
df2
| customer_id | 'visit_dates' |
| -------- | -------------- |
| 1 | 2021-05-11 |
| 1 | 2021-05-16 |
| 1 | 2021-05-21 |
| 2 | 2021-07-14 |
| 2 | 2021-09-17 |
# New Df will only have 1 row.
# For customer 1 there is a date within the range(05-07 -> 05-14)
# and within the range(05-14 -> 05-21) with matching id.
# For customer 2, there are no dates within (09-09 -> 09-16) so it should be filtered
newdf
| customer_id (unique) | 'purchase_date'| begin_date_range | end_date_range
| -------- | -------------- | ---------------- | -------------
| 1 | 2021-05-14 | 2021-05-11 | 2021-05-16
I understand how to do this in SQL, but I don't know what functions allow similar date predicate filtering in Pandas.
To construct df1 and df2:
data1 = {'customer_id': [1, 2], 'purchase_date': ['2021-05-14', '2021-09-16']}
data2 = {'customer_id': [1, 1, 1, 2, 2],
'visit_dates': ['2021-05-11', '2021-05-16', '2021-05-21', '2021-07-14', '2021-09-17']}
Building on #Raymond Kwok's excellent answer: We could merge twice, once to left merge df1 to df2 and then merge the "begin_date_range" part with the "end_date_range" part
merged = df1.merge(df2, on='customer_id', how='left')
merged['purchase_date'] = pd.to_datetime(merged['purchase_date'])
merged['visit_dates'] = pd.to_datetime(merged['visit_dates'])
day_diff = merged['purchase_date'].sub(merged['visit_dates']).dt.days
out = (merged[day_diff.between(0,6)]
.merge(merged[day_diff.between(-6,0)], on=['customer_id','purchase_date'])
.rename(columns={'visit_dates_x': 'begin_date_range', 'visit_dates_y': 'end_date_range'}))
Output:
customer_id purchase_date begin_date_range end_date_range
0 1 2021-05-14 2021-05-11 2021-05-16
Filter the merge result using the week condition.
df = df1.merge(df2, on='customer_id', how='left')
df[(df['purchase_date'] - df['visit_dates']).dt.days.between(0, 6)]
df['visit_dates+1week'] = df['visit_dates'] + pd.Timedelta(days=6)

How can I check check for matching values in a second dataframe, then return a value from a column in the second dataframe?

I have two dataframes. One contains a list of the most recent meeting for each customer. The second is a list of statuses that each customer has been recorded with, and their start date and end date.
I want to look up a customer and meeting date, and find out what status they were at when the meeting occurred.
What I think this will involve is creating a new column in my meeting dataframe that checks the rows of the statuses dataframe for a matching customer ID, then checks if the date from the first dataframe is between two dates in the second. If it is, the calculated column will take its value from the second dataframe's status column.
My dataframes are:
meeting
| CustomerID | MeetingDate |
|------------|-------------|
| 70704 | 2019-07-23 |
| 70916 | 2019-09-04 |
| 72712 | 2019-04-16 |
statuses
| CustomerID | Status | StartDate | EndDate |
|------------|--------|------------|------------|
| 70704 | First | 2019-04-01 | 2019-06-30 |
| 70704 | Second | 2019-07-01 | 2019-08-25 |
| 70916 | First | 2019-09-01 | 2019-10-13 |
| 72712 | First | 2019-03-15 | 2019-05-02 |
So, I think I want to take meeting.CustomerID and find a match in statuses.CustomerID. I then want to check if meeting.MeetingDate is between statuses.StartDate and statuses.EndDate. If it is, I want to return statuses.Status from the matching row, if not, ignore that row and move to the next to see if that matches the criteria and return the Status as described.
The final result should look like:
| CustomerID | MeetingDate | Status |
|------------|-------------|--------|
| 70704 | 2019-07-23 | Second |
| 70916 | 2019-09-04 | First |
| 72712 | 2019-04-16 | First |
I'm certain there must be a neater and more streamlined way to do this than what I've suggested, but I'm still learning the ins and outs of python and pandas and would appreciate if someone could point me in the right direction.
This should work. If the columns are not sorted by CustomerID or Status, this can be easily done. This is assuming your dates are already a datetime type. Here, df2 refers to the dataframe whose columns are CustomerID, Status, StartDate, and EndDate.
import numpy as np
df2 = df2[::-1]
row_arr = np.unique(df2.CustomerID, return_index = True)[1]
df2 = df2.iloc[row_arr, :].drop(['StartDate', 'EndDate'], axis = 1)
final = pd.merge(df1, df2, how = 'inner', on = 'CustomerID')
I managed to wrangle something that works for me:
df = statuses.merge(meetings, on='CustomerID')
df = df[(df['MeetingDate'] >= df['StartDate']) & (df['MeetingDate'] <= df['EndDate'])].reset_index(drop=True)
Gives:
| CustomerID | Status | StartDate | EndDate | MeetingDate |
|------------|--------|------------|------------|-------------|
| 70704 | Second | 2019-01-21 | 2019-07-28 | 2019-07-23 |
| 70916 | First | 2019-09-04 | 2019-10-21 | 2019-09-04 |
| 72712 | First | 2019-03-19 | 2019-04-17 | 2019-04-16 |
And I can just drop the now unneeded columns.

Categories