groupby to find row with max value is converting object to datetime

groupby to find row with max value is converting object to datetime - python

I want to groupby two variables ['CIN','calendar'] and return the row of that group where the column MCelig is the largest in that specific group. It is likely that multiple rows will have the max value, but i only want one row.
for example:
AidCode CIN MCelig calendar
0 None 1e 1 2014-03-08
1 01 1e 2 2014-03-08
2 01 1e 3 2014-05-08
3 None 2e 4 2014-06-08
4 01 2e 5 2014-06-08
Since the first two rows are a group, I want the row where MCelig =2.
I came up with this line
test=dfx.groupby(['CIN','calendar'], group_keys=False).apply(lambda x: x.ix[x.MCelig.idxmax()])
and it seemed to work, except when i have all 'None' or 'np.nan' for all values in a group for a column, that column is converted to a datetime! see the example below and watch AidCode go from an object to a date.
import datetime as DT
import numpy as np
d = {'CIN' : pd.Series(['1e','1e','1e','2e','2e']),
'AidCode' : pd.Series([np.nan,'01','01',np.nan,'01']),
'calendar' : pd.Series([DT.datetime(2014, 3, 8), DT.datetime(2014, 3, 8),DT.datetime(2014, 5, 8),DT.datetime(2014, 6, 8),DT.datetime(2014, 6, 8)]),
'MCelig' : pd.Series([1,2,3,4,5])}
dfx=pd.DataFrame(d)
#testing whether it was just the np.nan that was the problem, it isn't
#dfx = dfx.where((pd.notnull(dfx)), None)
test=dfx.groupby(['CIN','calendar'], group_keys=False).apply(lambda x: x.ix[x.MCelig.idxmax()])
output
Out[820]:
AidCode CIN MCelig calendar
CIN calendar
1e 2014-03-08 2015-01-01 1e 2 2014-03-08
2014-05-08 2015-01-01 1e 3 2014-05-08
2e 2014-06-08 2015-01-01 2e 5 2014-06-08
UPDATE:
just figured out this simple solution
x=dfx.sort(['CIN','calendar',"MCelig"]).groupby(["CIN",'calendar'], as_index=False).last();x
since it works, i think I chose it for simplicity sake.

Pandas attempts to be extra helpful by recognizing columns that look like dates and converting the column to datetime64 dtype. It's being overly aggressive here.
A workaround would be to use transform to generate a boolean mask for each group which selects maximum rows:
def onemax(x):
mask = np.zeros(len(x), dtype='bool')
idx = np.argmax(x.values)
mask[idx] = 1
return mask
dfx.loc[dfx.groupby(['CIN','calendar'])['MCelig'].transform(onemax).astype(bool)]
yields
AidCode CIN MCelig calendar
1 01 1e 2 2014-03-08
2 01 1e 3 2014-05-08
4 01 2e 5 2014-06-08
Technical detail: When groupby-apply is used, when the individual DataFrames (returned by the applied function) are glued back together into one DataFrame, Pandas tries to guess if columns
with object dtype are date-like objects, and if so, convert the column to
an actual date dtype. If the values are strings, it tries to parse them as
dates using dateutil.parser:
For better or for worse, dateutil.parser interprets '01' as a date:
In [37]: import dateutil.parser as DP
In [38]: DP.parse('01')
Out[38]: datetime.datetime(2015, 1, 1, 0, 0)
This causes Pandas to attempt to convert the entire AidCode column into dates. Since no error occurs, it thinks it just helped you out :)

Related

Changing column various string formats in pandas

I have been working on a dataframe where one of the column (flight_time) contains flight duration, all of the strings are in 3 different formats for example:
"07 h 05 m"
"13h 55m"
"2h 23m"
I would like to change them all to HH:MM format and finally change the data type from object to time.
Can somebody tell me how to do this?

It's not possible to have a time dtype. You can have a datetime64 (pd.DatetimeIndex) or a timedelta64 (pd.TimedeltaIndex). In your case, I think it's better to have a TimedeltaIndex so you can use the pd.to_timedelta function:
df['flight_time2'] = pd.to_timedelta(df['flight_time'])
print(df)
# Output
flight_time flight_time2
0 07 h 05 m 0 days 07:05:00
1 13h 55m 0 days 13:55:00
2 2h 23m 0 days 02:23:00
If you want individual time from datetime.time, use:
df['flight_time2'] = pd.to_datetime(df['flight_time'].str.findall('\d+')
.str.join(':')).dt.time
print(df)
# Output
flight_time flight_time2
0 07 h 05 m 07:05:00
1 13h 55m 13:55:00
2 2h 23m 02:23:00
In this case, flight_time2 has still object dtype:
>>> df.dtypes
flight_time object
flight_time2 object
dtype: object
But each value is an instance of datetime.time:
>>> df.loc[0, 'flight_time2']
datetime.time(7, 5)
In the first case, you can use vectorized method while in the second version is not possible. Furthermore, you loose the dt accessor.

Getting values from time indexed pandas dataframe for a specific time within the two timestamps

I have the following pandas dataframe df:
C1 C2 C3
Date
2000-01-01 00:00:00 2 175 160
2000-01-01 01:00:00 4 192 164
2000-01-01 02:00:00 6 210 189
2000-01-01 03:00:00 8 217 199
2000-01-01 04:00:00 10 176 158
from which I need to get the value of C1, C2 and C3 for a specific datetime:
import datetime
my_specific_time = str(datetime.datetime(2000, 1, 1, 1, 0, 0))
print(df['C1'].loc[mytime]) # prints 4
The problem is that I can only get values for the dates stored in the df. For example, getting the value of C1 for time 2000-01-01 01:30:00 is not possible unless I resample my dataframe:
upsampled = df.resample('30min').ffill()
my_specific_time = str(datetime.datetime(2000, 1, 1, 1, 30, 0))
print(upsampled['C1'].loc[mytime]) # again prints 4
Please note that all the value of C1 between timespan of 2000-01-01 01:00:00 and 2000-01-01 02:00:00 is 4. Now the problem is that my_specific_time can be any random time and I would need to resample df using small enough values to be able to get the value for. I think this is not the best solution for this problem.
While looking for possible solutions I only came across time spans in pandas but I did not quite understand how possibly I can use it in my problem.

Use DataFrame.asof method:
print(df['C1'].asof(my_specific_time))
4

Prepare Data Frames to be compared. Index manipulation, datetime and beyond

Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.

Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)

Pandas: use array index all values

I want to select all rows with a particular index. My DataFrame look like this:
>>> df
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
Selecting one of the first (Patient) index works:
>>> df.loc[1]
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
But selecting multiple of the first (Patient) index does not:
>>> df.loc[[1, 2]]
Code
Patient Date
1 2003-01-12 00:00:00 a
2 2001-1-17 22:00:00 z
However, I would like to get the entire dataframe (as the result would be if [1,1,1,2] i.e, the original dataframe).
When using a single index it works fine. For example:
>>> df.reset_index().set_index("Patient").loc[[1, 2]]
Date Code
Patient
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
TL;DR Why do I have to repeat the index when using multiple indexes but not when I use a single index?
EDIT: Apparently it can be done similar to:
>>> df.loc[df.index.get_level_values("Patient").isin([1, 2])]
But this seems quite dirty to me. Is this the way - or is any other, better, way possible?

For Pandas verison 0.14 the recommended way, according to the above comment, is:
df.loc[([1,2],),:]

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?

Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')

If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby to find row with max value is converting object to datetime - python

Related

Changing column various string formats in pandas

Getting values from time indexed pandas dataframe for a specific time within the two timestamps

Prepare Data Frames to be compared. Index manipulation, datetime and beyond

Pandas: use array index all values

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

Categories

Resources