I have two dataframes with particular data that I'm needing to merge.
Date Greenland Antarctica
0 2002.29 0.00 0.00
1 2002.35 68.72 19.01
2 2002.62 -219.32 -59.36
3 2002.71 -242.83 46.55
4 2002.79 -209.12 63.31
.. ... ... ...
189 2020.79 -4928.78 -2542.18
190 2020.87 -4922.47 -2593.06
191 2020.96 -4899.53 -2751.98
192 2021.04 -4838.44 -3070.67
193 2021.12 -4900.56 -2755.94
[194 rows x 3 columns]
and
Date Mean Sea Level
0 1993.011526 -38.75
1 1993.038692 -39.77
2 1993.065858 -39.61
3 1993.093025 -39.64
4 1993.120191 -38.72
... ... ...
1021 2020.756822 62.83
1022 2020.783914 62.93
1023 2020.811006 62.98
1024 2020.838098 63.00
1025 2020.865190 63.00
[1026 rows x 2 columns]
My ultimate goal is trying to pull out the data from the second data frame(Mean Sea Level column) that comes from (roughly) the same time frame as the dates in the first dataframe, and then merge that back in with the first data frame.
However, the only ways that I can think of selecting out certain dates, involves first converting all of the dates in the Date columns of both Dataframes to something Pandas recognizes, but I have been unable to figure our how to do that. I figured out some code(below) that can convert individual dates to a more common date format, but its been difficult to successfully apply it to all of the Dates in dataframe. Also I'm not sure I can then get Pandas to then convert that to a date format that Pandas recognizes.
from datetime import datetime
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first)*fraction
I also looked at pandas.to_datetime but I don't see a way to have it read the format the dates are initially in.
So does anyone have any guidance on this? Firstly with the conversion of dates, but also with the task of picking out the dates from the second dataframe if possible. Any help would be greatly appreciated.
Suppose you have this 2 dataframes:
df1:
Date Greenland Antarctica
0 2020.79 -4928.78 -2542.18
1 2020.87 -4922.47 -2593.06
2 2020.96 -4899.53 -2751.98
3 2021.04 -4838.44 -3070.67
4 2021.12 -4900.56 -2755.94
df2:
Date Mean Sea Level
0 2020.756822 62.83
1 2020.783914 62.93
2 2020.811006 62.98
3 2020.838098 63.00
4 2020.865190 63.00
To convert the dates:
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first) * fraction
df1["Date"] = df1["Date"].apply(fraction2datetime)
df2["Date"] = df2["Date"].apply(fraction2datetime)
print(df1)
print(df2)
Prints:
Date Greenland Antarctica
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94
Date Mean Sea Level
0 2020-10-03 23:55:28.012795 62.83
1 2020-10-13 21:54:02.073603 62.93
2 2020-10-23 19:52:36.134397 62.98
3 2020-11-02 17:51:10.195198 63.00
4 2020-11-12 15:49:44.255992 63.00
For the join, you can use pd.merge_asof. For example this will join on nearest date within 30-day tolerance (you can tweak these values as you want):
x = pd.merge_asof(
df1, df2, on="Date", tolerance=pd.Timedelta(days=30), direction="nearest"
)
print(x)
Will print:
Date Greenland Antarctica Mean Sea Level
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18 62.93
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06 63.00
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98 NaN
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67 NaN
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94 NaN
You can specify a timestamp format in to_datetime(). Otherwise, if you need to use a custom function you can use apply(). If performance is a concern, be aware that apply() does not perform as well as builtin pandas methods.
To combine the DataFrames you can use an outer join on the date column.
Related
I have actually two CSV files, df1 and df2.
When I use the command:
df1=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])
I get:
index likes ... user_screen_name sentiment
created_at ...
2019-02-27 05:36:29 0 94574 ... realDonaldTrump positive
2019-02-27 05:31:21 1 61666 ... realDonaldTrump negative
2019-02-26 18:08:14 2 151844 ... realDonaldTrump positive
2019-02-26 04:50:37 3 184597 ... realDonaldTrump positive
2019-02-26 04:50:36 4 181641 ... realDonaldTrump negative
... ... ... ... ... ...
When I use the command:
df2=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])
I get:
Unnamed: 0 Close Open Volume Day
created_at
2019-03-01 00:47:00 0 2784.49 2784.49 NaN STABLE
2019-03-01 00:21:00 1 2784.49 2784.49 NaN STABLE
2019-03-01 00:20:00 2 2784.49 2784.49 NaN STABLE
2019-03-01 00:19:00 3 2784.49 2784.49 NaN STABLE
2019-03-01 00:18:00 4 2784.49 2784.49 NaN STABLE
2019-03-01 00:17:00 5 2784.49 2784.49 NaN STABLE
... ... ... ... ... ...
As you know, when you use the command:
df3=df1.join(df2)
You will join the two tables based on the index "created_at" with the exact date and time in the two tables.
But I would like to have the result, with a delay, for an example, of 2 min.
For example, instead of:
file df1 file df2
created_at created_at
2019-02-27 05:36:29 2019-02-27 05:36:29
I would like to have the two tables join like this:
file df1 file df2
created_at created_at
2019-02-27 05:36:29 2019-02-27 05:38:29
It is important for my data that the time df1 is before df2. I mean it is important that the event df1 is before df2.
For small dataframes, Merging two dataframes based on a date between two other dates without a common column contains a nice solution. Simply it uses a cartesian product of both data frames, and will not scale nicely with larger data frames.
A possible optimization would be to add rounded datetime columns to the dataframes, and join on those columns. As a join is very more efficient than a cartesian product, the memory and execution time gain should be noticeable.
What you want is (pseudo code here):
df1.created_at <= df2.created_at and df2.created_at - df1.created_at <= 2mins
I would add in both dataframes a ref column defined as (still pseudo code): created_at - (created_at.minute % 2)
It lines in both dataframes share the same ref value, they should have dates distant from less that 4 minutes. But this will not pick all the expected cases, because dates can be closer than 2 minutes and fall in 2 different slots. To cope with that, I suggest to have a ref2 column in df1 defined as ref1 + 2minutes and do a second join on df1.ref == df1.ref2. It will be enough because you want the df1 event to be before df2 one, else we would need a 3rd column ref3 = ref1 - 2minutes.
Then as in the referenced answer, we can select the lines actually meeting the requirement and contact the two joined data frames.
Pandas code could be:
# create auxilliary columns
df1['ref'] = df1.index - pd.to_timedelta(df1.index.minute % 2, unit='m')
df1['ref2'] = df1.ref + pd.Timedelta(minutes=2)
df2['ref'] = df2.index - pd.to_timedelta(df2.index.minute % 2, unit='m')
df2.index.name = 'created_at_2'
df2 = df2.reset_index().set_index('ref')
# join on ref and select the relevant lines
x1 = df1.join(df2, on='ref', how='inner')
x1 = x1.loc[(x1.index <= x1.created_at_2)
& (x1.created_at_2 - x1.index <= pd.Timedelta(minutes=2))]
# join on ref2 and select the relevant lines
x2 = df1.join(df2, on='ref2', how='inner')
x2 = x2.loc[(x2.index <= x2.created_at_2)
& (x2.created_at_2 - x2.index <= pd.Timedelta(minutes=1))]
# concatenate the partial result and clean the resulting dataframe
merged = pd.concat([x1, x2]).drop(columns=['ref', 'ref2'])
merged.index.name = 'created_at'
I have a series of dataframes, some hold static values some are time series.
I have been able to add
I want to transpose the values from one time series to a new time series, applying a function which draws values from both the original time series dataframe and the dataframe which holds static values.
A snip of the time series and static dataframes are below.
Time series dataframe (Irradiance)
Datetime Date Time GHI DIF flagR SE SA TEMP
2017-07-01 00:11:00 01.07.2017 00:11 0 0 0 -9.39 -179.97 11.1
2017-07-01 00:26:00 01.07.2017 00:26 0 0 0 -9.33 -176.47 11.0
2017-07-01 00:41:00 01.07.2017 00:41 0 0 0 -9.14 -172.98 10.9
2017-07-01 00:56:00 01.07.2017 00:56 0 0 0 -8.83 -169.51 10.9
2017-07-01 01:11:00 01.07.2017 01:11 0 0 0 -8.40 -166.04 11.0
Static dataframe (Properties)
Bearing (degrees) Inclination (degrees)
Home ID
151631 244 29
151632 244 29
151633 184 44
I have written a function which I want to use to populate a new dataframe using values from both of these.
def dif(DIF, Inclination, GHI):
global Albedo
return DIF * (1 + math.cos(math.radians(Inclination)) / 2) + (GHI * Albedo * (1 - math.cos(math.radians(Inclination)) / 2))
When I have tried to do the same, but within the same dataframe I have used the Numpy vectorize funcion, so I thought I would be able to iterate over each column of the the new dataframe using the following code.
for column in DIF:
DIF[column] = np.vectorize(dif)(irradiance['DIF'], properties.iloc['Inclination (degrees)'][column], irradiance['GHI'])
Instead this throws the following error.
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [Inclination (degrees)] of <class 'str'>
I've checked the dtypes for the values of Inclination(degrees) but it is returned as Int64, not str so I'm not sure why this error is being generated.
I'm obviously missing something critical here. Are there alternative methods that would work better, or at all? Any help would be much appreciated.
So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})
I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer