Pandas - Join two Dataframes based on timestamp insert in between dates [duplicate] - python

This question already has answers here:
Outer Join Pandas Dataframe
(2 answers)
Closed 1 year ago.
I've two dataframes with matching and non-matching timestamps. I want to join both the dataframes such that new dataframe contains timestamps from both the dataframes and missing data from other dataframe is set to that dataframe's previous value. I want to analyze two different data sets to check their value at any exact timestamp (moment)
DF1
| Data1 | Timestamp1 |
| ----- | ---------- |
| A | 1623974400000|
| B | 1623974400200|
| C | 1623974400200|
| D | 1623974400400|
DF2
| Data2 | Timestamp2 |
| ----- | ---------- |
| M | 1623974400000|
| N | 1623974400100|
| O | 1623974400200|
| P | 1623974400500|
Output:
DF3
| Data1 | Data2 | Timestamp |
| ----- | ----- | --------- |
| A | M | 1623974400000|
| A | N | 1623974400100|
| B | O | 1623974400200|
| C | O | 1623974400200|
| D | O | 1623974400400|
| D | P | 1623974400500|

As I see it, just outer merge. sort_values and fillna forward.
Code below
Rename columns
DF1.rename(columns={'Timestamp1':'Timestamp'}, inplace=True)
DF2.rename(columns={'Timestamp2':'Timestamp'}, inplace=True)
merge
pd.merge(DF1,DF2, on='Timestamp', how='outer').sort_values(by='Timestamp').fillna(method='ffill')
outcome
Data1 Timestamp Data2
0 A 1623974400000 M
4 A 1623974400100 N
1 B 1623974400200 O
2 C 1623974400200 O
3 D 1623974400400 O
5 D 1623974400500 P

You can use pd.merge_asof:
# sort dfs by timestamp:
df1 = df1.sort_values(by="Timestamp1")
df2 = df2.sort_values(by="Timestamp2")
x = pd.merge_asof(df1, df2, left_on="Timestamp1", right_on="Timestamp2")
y = pd.merge_asof(df2, df1, left_on="Timestamp2", right_on="Timestamp1")
df_out = pd.concat([x, y]).drop_duplicates()
df_out["Timestamp"] = df_out[["Timestamp1", "Timestamp2"]].max(axis=1)
print(df_out[["Data1", "Data2", "Timestamp"]])
Prints:
Data1 Data2 Timestamp
0 A M 1623974400000
1 B O 1623974400200
2 C O 1623974400200
3 D O 1623974400400
1 A N 1623974400100
3 D P 1623974400500

Related

How do you give a date range then have that daterange be appended to the dataframe?

I know how to generate a daterange using this code:
pd.date_range(start='2022-10-16', end='2022-10-19')
How do I get the daterange result above and loop through every locations in the below dataframe?
+----------+
| Location |
+----------+
| A |
| B |
| C |
+----------+
This is the result I want.
+----------+------------+
| Location | Date |
+----------+------------+
| A | 2022/10/16 |
| A | 2022/10/17 |
| A | 2022/10/18 |
| A | 2022/10/19 |
| B | 2022/10/16 |
| B | 2022/10/17 |
| B | 2022/10/18 |
| B | 2022/10/19 |
| C | 2022/10/16 |
| C | 2022/10/17 |
| C | 2022/10/18 |
| C | 2022/10/19 |
+----------+------------+
I have spent the whole day figuring this out. Any help would be appreciated!
You can cross join your date range and dataframe to get your desired result:
date_range = (pd.date_range(start='2022-10-16', end='2022-10-19')
.rename('Date')
.to_series())
df = df.merge(date_range, 'cross')
print(df)
Output:
Location Date
0 A 2022-10-16
1 A 2022-10-17
2 A 2022-10-18
3 A 2022-10-19
4 B 2022-10-16
5 B 2022-10-17
6 B 2022-10-18
7 B 2022-10-19
8 C 2022-10-16
9 C 2022-10-17
10 C 2022-10-18
11 C 2022-10-19
You seem to be looking for a cartesian product of two iterables, which is something itertools.product can do. Take a look at this article.
In your case, you can try:
import pandas as pd
from itertools import product
# Test data:
df = pd.DataFrame(['A', 'B', 'C'], columns=['Location'])
dr = pd.date_range(start='2022-10-16', end='2022-10-19')
# Create the cartesian product:
res_df = pd.DataFrame(product(df['Location'], dr), columns=['Location', 'Date'])
print(res_df)

Select value from other dataframe where index is equal

I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+
This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1
I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']

How to count the number of items in a group after using Groupby in Pandas

I have multiple columns in my dataframe of which I am using 2 columns "customer id" and "trip id". I used the groupby function data.groupby(['customer_id','trip_id']) There are multiple trips taken from each customer. I want to count how many trips each customer took, but when I am using aggregate function along with group by I am getting 1 in all the rows. How should I proceed ?
I want something in this format.
Example :
Customer_id , Trip_Id, Count
CustID1 ,trip1, 3
trip 2
trip 3
CustID2 ,Trip450, 2
Trip23
You can group by customer and count the number of unique trips using the built in nunique:
data.groupby('Customer_id').agg(Count=('Trip_id', 'nunique'))
You can use data.groupby('customer_id','trip_id').count()
Example:
df1 = pd.DataFrame(columns=["c1","c1a","c1b"], data = [[1,2,3],[1,5,6],[2,8,9]])
print(df1)
# | c1 | c1a | c1b |
# |----|-----|-----|
# | x | 2 | 3 |
# | z | 5 | 6 |
# | z | 8 | 9 |
df2 = df1.groupby("c1").count()
print(df2)
# | | c1a | c1b |
# |----|-----|-----|
# | x | 1 | 1 |
# | z | 2 | 2 |

Pandas Merge multiple dataframes on index and column

I am trying to Merge multiple dataframes to one main dataframe using the datetime index and id from main dataframe and datetime and id columns from other dataframes
Main dataframe
DateTime | id | data
(Df.Index)
---------|----|------
2017-9-8 | 1 | a
2017-9-9 | 2 | b
df1
id | data1 | data2 | DateTime
---|-------|-------|---------
1 | a | c | 2017-9-8
2 | b | d | 2017-9-9
5 | a | e | 2017-9-20
df2
id | data3 | data4 | DateTime
---|-------|-------|---------
1 | d | c | 2017-9-8
2 | e | a | 2017-9-9
4 | f | h | 2017-9-20
The main dataframe and the other dataframes are in different dictionaries. I want to read from each dictionary and merge when the joining condition (datetime, id) is met
for sleep in dictOfSleep#MainDataFrame:
for sensorDevice in dictOfSensor#OtherDataFrames:
try:
dictOfSleep[sleep]=pd.merge(dictOfSleep[sleep],dictOfSensor[sensorDevice], how='outer',on=['DateTime','id'])
except:
print('Join could not be done')
Desired Output:
DateTime | id | data | data1 | data2 | data3 | data4
(Df.Index)
---------|----|------|-------|-------|-------|-------|
2017-9-8 | 1 | a | a | c | d | c |
2017-9-9 | 2 | b | b | d | e | a |
I'm not sure how your dictionaries are set up so you will most likely need to modify this but I'd try something like:
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# set df index to match the main_df index
df = df.set_index(['DateTime'])
# try join (not merge) when combining on index
main_df = main_df.join(df, how='outer')
Alternatively, if the id column is very important you can try to first reset your main_df index and then merging.
main_df = main_df.reset_index()
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# try to merge on both columns
main_df = main_df.merge(df, how='outer', on=['DateTime', 'id])

What is the smartest way to get the rest of a pandas.DataFrame?

Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |
Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.
If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.
I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]

Categories