I am trying to Merge multiple dataframes to one main dataframe using the datetime index and id from main dataframe and datetime and id columns from other dataframes
Main dataframe
DateTime | id | data
(Df.Index)
---------|----|------
2017-9-8 | 1 | a
2017-9-9 | 2 | b
df1
id | data1 | data2 | DateTime
---|-------|-------|---------
1 | a | c | 2017-9-8
2 | b | d | 2017-9-9
5 | a | e | 2017-9-20
df2
id | data3 | data4 | DateTime
---|-------|-------|---------
1 | d | c | 2017-9-8
2 | e | a | 2017-9-9
4 | f | h | 2017-9-20
The main dataframe and the other dataframes are in different dictionaries. I want to read from each dictionary and merge when the joining condition (datetime, id) is met
for sleep in dictOfSleep#MainDataFrame:
for sensorDevice in dictOfSensor#OtherDataFrames:
try:
dictOfSleep[sleep]=pd.merge(dictOfSleep[sleep],dictOfSensor[sensorDevice], how='outer',on=['DateTime','id'])
except:
print('Join could not be done')
Desired Output:
DateTime | id | data | data1 | data2 | data3 | data4
(Df.Index)
---------|----|------|-------|-------|-------|-------|
2017-9-8 | 1 | a | a | c | d | c |
2017-9-9 | 2 | b | b | d | e | a |
I'm not sure how your dictionaries are set up so you will most likely need to modify this but I'd try something like:
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# set df index to match the main_df index
df = df.set_index(['DateTime'])
# try join (not merge) when combining on index
main_df = main_df.join(df, how='outer')
Alternatively, if the id column is very important you can try to first reset your main_df index and then merging.
main_df = main_df.reset_index()
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# try to merge on both columns
main_df = main_df.merge(df, how='outer', on=['DateTime', 'id])
Related
Here's an example that fulfills the criteria. Dataframe df1 and df2 both have id columns and date columns. In df1 the id column is unique while in df2 it is non-unique. I'd like to create a new dataframe where the join happens if df1.id == df2.id and some pair of dates in df2 is within 1 week of the date in df1
df1
| customer_id (unique) | 'purchase_date'|
| -------- | -------------- |
| 1 | 2021-05-14 |
| 2 | 2021-09-16 |
df2
| customer_id | 'visit_dates' |
| -------- | -------------- |
| 1 | 2021-05-11 |
| 1 | 2021-05-16 |
| 1 | 2021-05-21 |
| 2 | 2021-07-14 |
| 2 | 2021-09-17 |
# New Df will only have 1 row.
# For customer 1 there is a date within the range(05-07 -> 05-14)
# and within the range(05-14 -> 05-21) with matching id.
# For customer 2, there are no dates within (09-09 -> 09-16) so it should be filtered
newdf
| customer_id (unique) | 'purchase_date'| begin_date_range | end_date_range
| -------- | -------------- | ---------------- | -------------
| 1 | 2021-05-14 | 2021-05-11 | 2021-05-16
I understand how to do this in SQL, but I don't know what functions allow similar date predicate filtering in Pandas.
To construct df1 and df2:
data1 = {'customer_id': [1, 2], 'purchase_date': ['2021-05-14', '2021-09-16']}
data2 = {'customer_id': [1, 1, 1, 2, 2],
'visit_dates': ['2021-05-11', '2021-05-16', '2021-05-21', '2021-07-14', '2021-09-17']}
Building on #Raymond Kwok's excellent answer: We could merge twice, once to left merge df1 to df2 and then merge the "begin_date_range" part with the "end_date_range" part
merged = df1.merge(df2, on='customer_id', how='left')
merged['purchase_date'] = pd.to_datetime(merged['purchase_date'])
merged['visit_dates'] = pd.to_datetime(merged['visit_dates'])
day_diff = merged['purchase_date'].sub(merged['visit_dates']).dt.days
out = (merged[day_diff.between(0,6)]
.merge(merged[day_diff.between(-6,0)], on=['customer_id','purchase_date'])
.rename(columns={'visit_dates_x': 'begin_date_range', 'visit_dates_y': 'end_date_range'}))
Output:
customer_id purchase_date begin_date_range end_date_range
0 1 2021-05-14 2021-05-11 2021-05-16
Filter the merge result using the week condition.
df = df1.merge(df2, on='customer_id', how='left')
df[(df['purchase_date'] - df['visit_dates']).dt.days.between(0, 6)]
df['visit_dates+1week'] = df['visit_dates'] + pd.Timedelta(days=6)
This question already has answers here:
Outer Join Pandas Dataframe
(2 answers)
Closed 1 year ago.
I've two dataframes with matching and non-matching timestamps. I want to join both the dataframes such that new dataframe contains timestamps from both the dataframes and missing data from other dataframe is set to that dataframe's previous value. I want to analyze two different data sets to check their value at any exact timestamp (moment)
DF1
| Data1 | Timestamp1 |
| ----- | ---------- |
| A | 1623974400000|
| B | 1623974400200|
| C | 1623974400200|
| D | 1623974400400|
DF2
| Data2 | Timestamp2 |
| ----- | ---------- |
| M | 1623974400000|
| N | 1623974400100|
| O | 1623974400200|
| P | 1623974400500|
Output:
DF3
| Data1 | Data2 | Timestamp |
| ----- | ----- | --------- |
| A | M | 1623974400000|
| A | N | 1623974400100|
| B | O | 1623974400200|
| C | O | 1623974400200|
| D | O | 1623974400400|
| D | P | 1623974400500|
As I see it, just outer merge. sort_values and fillna forward.
Code below
Rename columns
DF1.rename(columns={'Timestamp1':'Timestamp'}, inplace=True)
DF2.rename(columns={'Timestamp2':'Timestamp'}, inplace=True)
merge
pd.merge(DF1,DF2, on='Timestamp', how='outer').sort_values(by='Timestamp').fillna(method='ffill')
outcome
Data1 Timestamp Data2
0 A 1623974400000 M
4 A 1623974400100 N
1 B 1623974400200 O
2 C 1623974400200 O
3 D 1623974400400 O
5 D 1623974400500 P
You can use pd.merge_asof:
# sort dfs by timestamp:
df1 = df1.sort_values(by="Timestamp1")
df2 = df2.sort_values(by="Timestamp2")
x = pd.merge_asof(df1, df2, left_on="Timestamp1", right_on="Timestamp2")
y = pd.merge_asof(df2, df1, left_on="Timestamp2", right_on="Timestamp1")
df_out = pd.concat([x, y]).drop_duplicates()
df_out["Timestamp"] = df_out[["Timestamp1", "Timestamp2"]].max(axis=1)
print(df_out[["Data1", "Data2", "Timestamp"]])
Prints:
Data1 Data2 Timestamp
0 A M 1623974400000
1 B O 1623974400200
2 C O 1623974400200
3 D O 1623974400400
1 A N 1623974400100
3 D P 1623974400500
I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+
This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1
I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']
I'm trying to fill a column of a dataframe from another dataframe based on conditions. Let's say my first dataframe is df1 and the second is named df2.
# df1 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| C | 3 |
| A | 1 |
+------+------+
And
# df2 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | NaN |
| B | NaN |
| D | NaN |
+------+------+
Each distinct value of Col1 has her an id number (In Col2), so what I want is to fill the NaN values in df2.Col2 where df2.Col1==df1.Col1 .
So that my second dataframe will look like :
# df2 :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| D | NaN |
+------+------+
I'm using Python 2.7
Use drop_duplicates with set_index and combine_first:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
If need check dupes only in id column:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
Here is a solution with the filter df1.Col1 == df2.Col1
df2['Col2'] = df1[df1.Col1 == df2.Col1]['Col2']
It is even better to use loc (but less clear from my point of view)
df2['Col2'] = df1.loc[df1.Col1 == df2.Col2, 'Col2']
Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |
Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.
If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.
I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]