I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.
Related
I have this dataframe:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-01 0.603989 S2B-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
I would like to groupby by the date columns where I keep the rows according to a ranking/importance of my chooseing from the source columns. For example, my ranking is L8-SR>S2B-SR>GP6_r, meaning that for all rows with the same date, keep the row where source==L8-SR, if none contain L8-SR, then keep the row where source==S2B-SR etc. How can I accomplish that in pandas groupby
Output should look like this:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-11 0.717264 S2B-SR
4 2020-04-02 0.737118 L8-SR
Let's try category dtype and drop_duplicates:
orders = ['L8-SR','S2B-SR','GP6_r']
df.source = df.source.astype('category')
df.source.cat.set_categories(orders, ordered=True)
df.sort_values(['date','source']).drop_duplicates(['date'])
Output:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
TRY below code for the group by operation. For ordering after this operation you can perform sortby:
# Import pandas library
import pandas as pd
# Declare a data dictionary contains the data mention in table
pandasdata_dict = {'date':['2020-02-14', '2020-02-15', '2020-03-01', '2020-03-01', '2020-03-11', '2020-04-02'],
'value':[0.438767, 0.422867, 0.657453, 0.603989, 0.717264, 0.737118],
'source':['L8-SR', 'S2A-SR', 'L8-SR', 'S2B-SR', 'S2B-SR', 'L8-SR']}
# Convert above dictionary data to the data frame
df = pd.DataFrame(pandasdata_dict)
# display data frame
df
# Convert date field to datetime
df["date"] = pd.to_datetime(df["date"])
# Once conversion done then do the group by operation on the data frame with date field
df.groupby([df['date'].dt.date])
I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30
I'm trying to aggregate from the end of a date range instead of from the beginning. Despite the fact that I would think that adding closed='right' to the grouper would solve the issue, it doesn't. Please let me know how I can achieve my desired output shown at the bottom, thanks.
import pandas as pd
df = pd.DataFrame(columns=['date','number'])
df['date'] = pd.date_range('1/1/2000', periods=8, freq='T')
df['number'] = pd.Series(range(8))
df
date number
0 2000-01-01 00:00:00 0
1 2000-01-01 00:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 00:07:00 7
With the groupby and aggregation of the date I get the following. Since I have 8 dates and I'm grouping by periods of 3 it must choose whether to truncate the earliest date group or the oldest date group, and it chooses the oldest date group (the oldest date group has a count of 2):
df.groupby(pd.Grouper(key='date', freq='3T')).agg('count')
date number
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
My desired output is to instead truncate the earliest date group:
date number
2000-01-01 00:00:00 2
2000-01-01 00:02:00 3
2000-01-01 00:05:00 3
Please let me know how this can be achieved, I'm hopeful there's just a parameter that can be set that I've overlooked. Note that this is similar to this question, but my question is specific to the date truncation.
EDIT: To reframe the question (thanks Alexdor) the default behavior in pandas is to bin by period [0, 3), [3, 6), [6, 9) but instead I'd like to bin by (-1, 2], (2, 5], (5, 8]
It seems like the grouper function build up the bins starting from the oldest time in the series that you pass to it. I couldn't see a way to make it build up the bins from the newest time, but it's fairly easy to construct the bins from scratch.
freq = '3min'
minTime = df.date.min()
maxTime = df.date.max()
deltaT = pd.Timedelta(freq)
minTime -= deltaT - (maxTime - minTime) % deltaT # adjust min time to start of first bin
r = pd.date_range(start=minTime, end=maxTime, freq=freq)
df.groupby(pd.cut(df["date"], r)).agg('count')
Gives
date date number
(1999-12-31 23:58:00, 2000-01-01 00:01:00] 2 2
(2000-01-01 00:01:00, 2000-01-01 00:04:00] 3 3
(2000-01-01 00:04:00, 2000-01-01 00:07:00] 3 3
This is one hack, which let's you group by a constant group size, counting bottom up.
from itertools import chain
def grouper(x, k=3):
n = len(df.index)
return list(chain.from_iterable([[0]*int(n//k)] + [[i]*k for i in range(1, int(n/k)+1)]))
df['grouper'] = grouper(df, 3)
res = df.groupby('grouper', as_index=False)\
.agg({'date': 'first', 'number': 'count'})\
.drop('grouper', 1)
# date number
# 0 2000-01-01 00:00:00 2
# 1 2000-01-01 00:02:00 3
# 2 2000-01-01 00:05:00 3
Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.
Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)
I have a csv file that I am trying to import into pandas.
There are two columns of intrest. date and hour and are the first two cols.
E.g.
date,hour,...
10-1-2013,0,
10-1-2013,0,
10-1-2013,0,
10-1-2013,1,
10-1-2013,1,
How do I import using pandas so that that hour and date is combined or is that best done after the initial import?
df = DataFrame.from_csv('bingads.csv', sep=',')
If I do the initial import how do I combine the two as a date and then delete the hour?
Thanks
Define your own date_parser:
In [291]: from dateutil.parser import parse
In [292]: import datetime as dt
In [293]: def date_parser(x):
.....: date, hour = x.split(' ')
.....: return parse(date) + dt.timedelta(0, 3600*int(hour))
In [298]: pd.read_csv('test.csv', parse_dates=[[0,1]], date_parser=date_parser)
Out[298]:
date_hour a b c
0 2013-10-01 00:00:00 1 1 1
1 2013-10-01 00:00:00 2 2 2
2 2013-10-01 00:00:00 3 3 3
3 2013-10-01 01:00:00 4 4 4
4 2013-10-01 01:00:00 5 5 5
Apply read_csv instead of read_clipboard to handle your actual data:
>>> df = pd.read_clipboard(sep=',')
>>> df['date'] = pd.to_datetime(df.date) + pd.to_timedelta(df.hour, unit='D')/24
>>> del df['hour']
>>> df
date ...
0 2013-10-01 00:00:00 NaN
1 2013-10-01 00:00:00 NaN
2 2013-10-01 00:00:00 NaN
3 2013-10-01 01:00:00 NaN
4 2013-10-01 01:00:00 NaN
[5 rows x 2 columns]
Take a look at the parse_dates argument which pandas.read_csv accepts.
You can do something like:
df = pandas.read_csv('some.csv', parse_dates=True)
# in which case pandas will parse all columns where it finds dates
df = pandas.read_csv('some.csv', parse_dates=[i,j,k])
# in which case pandas will parse the i, j and kth columns for dates
Since you are only using the two columns from the cdv file and combining those into one, I would squeeze into a series of datetime objects like so:
import pandas as pd
from StringIO import StringIO
import datetime as dt
txt='''\
date,hour,A,B
10-1-2013,0,1,6
10-1-2013,0,2,7
10-1-2013,0,3,8
10-1-2013,1,4,9
10-1-2013,1,5,10'''
def date_parser(date, hour):
dates=[]
for ed, eh in zip(date, hour):
month, day, year=list(map(int, ed.split('-')))
hour=int(eh)
dates.append(dt.datetime(year, month, day, hour))
return dates
p=pd.read_csv(StringIO(txt), usecols=[0,1],
parse_dates=[[0,1]], date_parser=date_parser, squeeze=True)
print p
Prints:
0 2013-10-01 00:00:00
1 2013-10-01 00:00:00
2 2013-10-01 00:00:00
3 2013-10-01 01:00:00
4 2013-10-01 01:00:00
Name: date_hour, dtype: datetime64[ns]