How to join two table in pandas based on time with delay - python

I have actually two CSV files, df1 and df2.
When I use the command:
df1=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])
I get:
index likes ... user_screen_name sentiment
created_at ...
2019-02-27 05:36:29 0 94574 ... realDonaldTrump positive
2019-02-27 05:31:21 1 61666 ... realDonaldTrump negative
2019-02-26 18:08:14 2 151844 ... realDonaldTrump positive
2019-02-26 04:50:37 3 184597 ... realDonaldTrump positive
2019-02-26 04:50:36 4 181641 ... realDonaldTrump negative
... ... ... ... ... ...
When I use the command:
df2=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])
I get:
Unnamed: 0 Close Open Volume Day
created_at
2019-03-01 00:47:00 0 2784.49 2784.49 NaN STABLE
2019-03-01 00:21:00 1 2784.49 2784.49 NaN STABLE
2019-03-01 00:20:00 2 2784.49 2784.49 NaN STABLE
2019-03-01 00:19:00 3 2784.49 2784.49 NaN STABLE
2019-03-01 00:18:00 4 2784.49 2784.49 NaN STABLE
2019-03-01 00:17:00 5 2784.49 2784.49 NaN STABLE
... ... ... ... ... ...
As you know, when you use the command:
df3=df1.join(df2)
You will join the two tables based on the index "created_at" with the exact date and time in the two tables.
But I would like to have the result, with a delay, for an example, of 2 min.
For example, instead of:
file df1 file df2
created_at created_at
2019-02-27 05:36:29 2019-02-27 05:36:29
I would like to have the two tables join like this:
file df1 file df2
created_at created_at
2019-02-27 05:36:29 2019-02-27 05:38:29
It is important for my data that the time df1 is before df2. I mean it is important that the event df1 is before df2.

For small dataframes, Merging two dataframes based on a date between two other dates without a common column contains a nice solution. Simply it uses a cartesian product of both data frames, and will not scale nicely with larger data frames.
A possible optimization would be to add rounded datetime columns to the dataframes, and join on those columns. As a join is very more efficient than a cartesian product, the memory and execution time gain should be noticeable.
What you want is (pseudo code here):
df1.created_at <= df2.created_at and df2.created_at - df1.created_at <= 2mins
I would add in both dataframes a ref column defined as (still pseudo code): created_at - (created_at.minute % 2)
It lines in both dataframes share the same ref value, they should have dates distant from less that 4 minutes. But this will not pick all the expected cases, because dates can be closer than 2 minutes and fall in 2 different slots. To cope with that, I suggest to have a ref2 column in df1 defined as ref1 + 2minutes and do a second join on df1.ref == df1.ref2. It will be enough because you want the df1 event to be before df2 one, else we would need a 3rd column ref3 = ref1 - 2minutes.
Then as in the referenced answer, we can select the lines actually meeting the requirement and contact the two joined data frames.
Pandas code could be:
# create auxilliary columns
df1['ref'] = df1.index - pd.to_timedelta(df1.index.minute % 2, unit='m')
df1['ref2'] = df1.ref + pd.Timedelta(minutes=2)
df2['ref'] = df2.index - pd.to_timedelta(df2.index.minute % 2, unit='m')
df2.index.name = 'created_at_2'
df2 = df2.reset_index().set_index('ref')
# join on ref and select the relevant lines
x1 = df1.join(df2, on='ref', how='inner')
x1 = x1.loc[(x1.index <= x1.created_at_2)
& (x1.created_at_2 - x1.index <= pd.Timedelta(minutes=2))]
# join on ref2 and select the relevant lines
x2 = df1.join(df2, on='ref2', how='inner')
x2 = x2.loc[(x2.index <= x2.created_at_2)
& (x2.created_at_2 - x2.index <= pd.Timedelta(minutes=1))]
# concatenate the partial result and clean the resulting dataframe
merged = pd.concat([x1, x2]).drop(columns=['ref', 'ref2'])
merged.index.name = 'created_at'

Related

Converting Pandas DataFrame dates so that I can pick out particular dates

I have two dataframes with particular data that I'm needing to merge.
Date Greenland Antarctica
0 2002.29 0.00 0.00
1 2002.35 68.72 19.01
2 2002.62 -219.32 -59.36
3 2002.71 -242.83 46.55
4 2002.79 -209.12 63.31
.. ... ... ...
189 2020.79 -4928.78 -2542.18
190 2020.87 -4922.47 -2593.06
191 2020.96 -4899.53 -2751.98
192 2021.04 -4838.44 -3070.67
193 2021.12 -4900.56 -2755.94
[194 rows x 3 columns]
and
Date Mean Sea Level
0 1993.011526 -38.75
1 1993.038692 -39.77
2 1993.065858 -39.61
3 1993.093025 -39.64
4 1993.120191 -38.72
... ... ...
1021 2020.756822 62.83
1022 2020.783914 62.93
1023 2020.811006 62.98
1024 2020.838098 63.00
1025 2020.865190 63.00
[1026 rows x 2 columns]
My ultimate goal is trying to pull out the data from the second data frame(Mean Sea Level column) that comes from (roughly) the same time frame as the dates in the first dataframe, and then merge that back in with the first data frame.
However, the only ways that I can think of selecting out certain dates, involves first converting all of the dates in the Date columns of both Dataframes to something Pandas recognizes, but I have been unable to figure our how to do that. I figured out some code(below) that can convert individual dates to a more common date format, but its been difficult to successfully apply it to all of the Dates in dataframe. Also I'm not sure I can then get Pandas to then convert that to a date format that Pandas recognizes.
from datetime import datetime
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first)*fraction
I also looked at pandas.to_datetime but I don't see a way to have it read the format the dates are initially in.
So does anyone have any guidance on this? Firstly with the conversion of dates, but also with the task of picking out the dates from the second dataframe if possible. Any help would be greatly appreciated.
Suppose you have this 2 dataframes:
df1:
Date Greenland Antarctica
0 2020.79 -4928.78 -2542.18
1 2020.87 -4922.47 -2593.06
2 2020.96 -4899.53 -2751.98
3 2021.04 -4838.44 -3070.67
4 2021.12 -4900.56 -2755.94
df2:
Date Mean Sea Level
0 2020.756822 62.83
1 2020.783914 62.93
2 2020.811006 62.98
3 2020.838098 63.00
4 2020.865190 63.00
To convert the dates:
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first) * fraction
df1["Date"] = df1["Date"].apply(fraction2datetime)
df2["Date"] = df2["Date"].apply(fraction2datetime)
print(df1)
print(df2)
Prints:
Date Greenland Antarctica
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94
Date Mean Sea Level
0 2020-10-03 23:55:28.012795 62.83
1 2020-10-13 21:54:02.073603 62.93
2 2020-10-23 19:52:36.134397 62.98
3 2020-11-02 17:51:10.195198 63.00
4 2020-11-12 15:49:44.255992 63.00
For the join, you can use pd.merge_asof. For example this will join on nearest date within 30-day tolerance (you can tweak these values as you want):
x = pd.merge_asof(
df1, df2, on="Date", tolerance=pd.Timedelta(days=30), direction="nearest"
)
print(x)
Will print:
Date Greenland Antarctica Mean Sea Level
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18 62.93
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06 63.00
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98 NaN
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67 NaN
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94 NaN
You can specify a timestamp format in to_datetime(). Otherwise, if you need to use a custom function you can use apply(). If performance is a concern, be aware that apply() does not perform as well as builtin pandas methods.
To combine the DataFrames you can use an outer join on the date column.

Python (Pandas) How to merge 2 dataframes with different dates in incremental order?

I am trying to merge 2 dataframes by date index in order. Is this possible?
A sample code of what I need to manipulate
Link for sg_df:https://query1.finance.yahoo.com/v7/finance/download/%5ESTI?P=^STI?period1=1442102400&period2=1599955200&interval=1mo&events=history
Link for facemask_compliance_df: https://today.yougov.com/topics/international/articles-reports/2020/05/18/international-covid-19-tracker-update-18-may (YouGov COVID-19 behaviour changes tracker: Wearing a face mask when in public places)
# Singapore Index
# Read file
# Format Date
# index date column for easy referencing
sg_df = pd.read_csv("^STI.csv")
conv = lambda x: datetime.strptime(x, "%d/%m/%Y")
sg_df["Date"] = sg_df["Date"].apply(conv)
sg_df.sort_values("Date", inplace = True)
sg_df.set_index("Date", inplace = True)
# Will wear face mask in public
# Read file
# Format Date, Removing time
# index date column for easy referencing
facemask_compliance_df = pd.read_csv("yougov-chart.csv")
convert1 = lambda x: datetime.strptime(x, "%d/%m/%Y %H:%M")
facemask_compliance_df["DateTime"] = facemask_compliance_df["DateTime"].apply(convert1).dt.date
facemask_compliance_df.sort_values("DateTime", inplace = True)
facemask_compliance_df.set_index("DateTime", inplace = True)
sg_df = sg_df.merge(facemask_compliance_df["Singapore"], left_index = True, right_index = True, how = "outer").sort_index()
and I wish to output a table kind of like this.
Kindly let me know if you need any more info, I will kindly provide them to you shortly if I am able to.
Edit:
This is the issue
data from yougov-chart
I think it is reading the dates even when it is not from Singapore
Use:
merge to merge to tables.
1.1. on to choose on which column to merge:
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
1.2. outer option:
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
sort_values to sort by date
import pandas as pd
df1 = pd.read_csv("^STI.csv")
df1['Date'] = pd.to_datetime(df1.Date)
df2 = pd.read_csv("yougov-chart.csv")
df2['Date'] = pd.to_datetime(df2.DateTime)
result = df2.merge(df1, on='Date', how='outer')
result = result.sort_values('Date')
print(result)
Output:
Date US_GDP_Thousands Mask Compliance
6 2016-02-01 NaN 37.0
7 2017-07-01 NaN 73.0
8 2019-10-01 NaN 85.0
0 2020-02-21 50.0 27.0
1 2020-03-18 55.0 NaN
2 2020-03-19 60.0 NaN
3 2020-03-25 65.0 NaN
4 2020-04-03 70.0 NaN
5 2020-05-14 75.0 NaN
First use parameters parse_dates and index_col in read_csv for DatetimeIndex in both and in second remove times by Series.dt.floor:
sg_df = pd.read_csv("^STI.csv",
parse_dates=['Date'],
index_col=['Date'])
facemask_compliance_df = pd.read_csv("yougov-chart.csv",
parse_dates=['DateTime'],
index_col=['DateTime'])
facemask_compliance_df["DateTime"] = facemask_compliance_df["DateTime"].dt.floor('d')
Then use DataFrame.merge by index by outer join and then sort index by DataFrame.sort_index:
df = sg_df.merge(facemask_compliance_df,
left_index=True,
right_index=True,
how='outer').sort_index()
print (df)
Mask Compliance US_GDP_Thousands
Date
2016-02-01 37.0 NaN
2017-07-01 73.0 NaN
2019-10-01 85.0 NaN
2020-02-21 27.0 50.0
2020-03-18 NaN 55.0
2020-03-19 NaN 60.0
2020-03-25 NaN 65.0
2020-04-03 NaN 70.0
2020-05-14 NaN 75.0
If i remember right In numpy you can do v.stack or h.stack. depends on how you want to join them together.
in pandas there was something like concatenate https://pandas.pydata.org/docs/user_guide/merging.html which i used for merging dataframes

Accessing column values with columns set up as values of an index

I have the following dataframe named df:
V1 V2
IDS
a 1 2
b 3 4
If I print out the index and the columns, this is the result:
> print(df.index)
Index(['a','b'],dtype='object',name='IDS',length=2)
> print(df.columns)
Index(['V1','V2'],dtype='object',length=2)
I want to perform a calculation on these two columns (row-wise) and add this to a new column. I have tried the following, but I can't seem to access the column as expected.
df['sum'] = df.apply(lambda row: row['V1'] + row['V2'], axis=1)
I get the following error running the last line of code:
KeyError: ('V1', 'occurred at index a')
How do I access these columns?
Update: contrived example is not showing the error, here is the actual dataframe I am working with:
DATE ... gathering_size_100_to_26 shelter_in_place
FIPS
10001 2020-01-22 ... 2020-01-01 2020-01-01
10002 2020-01-22 ... 2020-01-01 2020-01-02
10003 2020-02-25 ... 2020-01-01 2020-01-03
... ... ... ... ...
9013 2020-02-22 ... 2020-01-01 2020-01-01
I want to take the difference between 'gathering_size_100_to_26' and 'DATE', as well as 'shelter_in_place' and 'DATE' and replace this value in place.
df["v1_v2_sum"] = df["V1"] + df["V2"]
Anyways, avoid using df.apply and UDF, they have bad performance, and only needed when you have no options.
df = pd.DataFrame(data=[[0.8062, 0.9308], [0.364 , 0.6909]],index=['a','b'], columns=['V1','V2'])
print(df)
Output:
V1 V2
a 0.8062 0.9308
b 0.3640 0.6909
df['sum'] = df.apply(sum,axis=1)
print(df)
Output:
V1 V2 sum
a 0.8062 0.9308 1.7370
b 0.3640 0.6909 1.0549```
I realized I had a typo.... what was stated above (but reworked for me instance) works.

Grouping by date and number of unique users for multiple variables

I have a dataframe containing tweets. I've got columns with information about the datetime, about a unique user_id and then columns indicating if the tweet belongs to a thematic category. In the end I'd like to visualize it with a line graph.
The data looks as follows:
datetime user_id Meta News & Media Environment ...
0 2019-05-08 07:16:02 21741359 NaN NaN 1.0
1 2019-05-08 07:15:23 2785265103 NaN NaN 1.0
2 2019-05-08 07:14:11 606785697 NaN 1.0 NaN
3 2019-05-08 07:13:42 718989200616529921 1.0 NaN NaN
4 2019-05-08 07:13:27 939207240728350720 1.0 NaN 1.0
... ... ... ... ... ...
So far I've managed to produce one just summing each theme per day with the following code:
monthly_trends = tweets_df.groupby(pd.Grouper(key='datetime', freq='D'))[list(issues.keys())].sum().fillna(0)
which gives me:
Meta News & Media Environment ...
datetime
2019-05-07 586.0 25.0 30.0
2019-05-08 505.0 16.0 70.0
2019-05-09 450.0 12.0 50.0
2019-05-10 339.0 8.0 90.0
2019-05-11 254.0 5.0 10.0
I plot this with:
monthly_trends.plot(kind='line', figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20)
plt.title('Issue activity during the election period', size = 30)
plt.show()
Which gives me a nice graph. But since one user may just be spamming one theme, I'd like to get a count of the frequency of unique users per theme per day. I've tried using additional groupby's but only got errors.
For pandas' DataFrame.plot across multiple series you need data in wide format with separate columns. However, for unique user_id calculation you need data in long format for the aggregation. Therefore, consider melt, groupby, then pivot back for plotting. Had you not needed a
### RESHAPE LONG AND AGGREGATE
long_df = (tweets_df.melt(id_vars=['datetime', 'user_id'],
value_name = 'Count', var_name = 'Issue')
.query("Count >= 1")
.groupby([pd.Grouper(key='datetime', freq='D'), 'Issue'])['user_id'].nunique()
.reset_index()
)
### RESHAPE WIDE AND PLOT
(long_df.pivot(index='datetime', columns='Issue', values='user_id')
.plot(kind='line', title='Unique Users by Day and Tweet Issue')
)
plt.show()
plt.clf()
plt.close()
Stack all issues, group by issue and day, and count the unique user ids:
df.columns.names = ['issue']
df_users = (df.set_index(['datetime', 'user_id'])[issues]
.stack()
.reset_index().groupby([pd.Grouper(key='datetime', freq='D'), 'issue'])
.apply(lambda x: len(x.user_id.unique()))
.rename('n_unique_users').reset_index())
print(df_users)
datetime issue n_unique_users
0 2019-05-08 Environment 3
1 2019-05-08 Meta 2
2 2019-05-08 News & Media 1
Then you can reshape as required for plotting:
df_users.pivot_table(index='datetime', columns='issue', values='n_unique_users', aggfunc=sum)
issue Environment Meta News & Media
datetime
2019-05-08 3 2 1

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

Categories