Python Pandas Groupby Dropping DateTime Columns - python

I am having some trouble using groupby.median() and groupby.mean() on a DataFrame containing intermittent NaT values. Specifically, I have several columns in a dataset calculating various time differences based on other columns. In some instances, no time difference exists, causing a NaT value similar to the example below:
Group Category Start Time End Time Time Diff
A 1 08:00:00.000 08:00:00.500 .500
B 1 09:00:00.000 09:02:00.000 2:00.000
B 1 09:00:00.000 NaT NaT
A 2 09:00:00.000 09:02:00.000 2:00.000
A 2 09:00:00.000 09:01:00.000 1:00.000
A 2 08:00:00.000 08:00:01.500 1.500
Any time I run df.groupby(['Group', 'Category'].median() or .mean() any column that contains NaT is dropped from the result set. I've attempted a fillna but NaT's seemed to remain. As an added point of context, this script worked correctly in an older version of Anaconda Python (1.x). I was recently able to upgrade my work computer to 2.0.1 at which point this issue began creeping up.
EDIT: I will leave my thoughts about NaT's up above in the event that they are a factor, but upon further review it seems that my problem actually lies in the fact that these columns are timedelta64s. Does anyone know of any workarounds to obtain mean/median on timedeltas?
Thanks very much for any insight you may have!

After some further googling/experimentation I confirmed that the issue appeared to be related to columns which were timedelta64. In order to perform pd.groupby on these columns I first converted them to floats like so:
df['End Time'] = df['End Time'].astype('timedelta64[ms]') / 86400000
There may be a more elegant solution to this but this allowed me to move forward with my analysis.
Thanks!

Related

Pandas is Reading .xlsx Column as Datetime rather than float

I obtained an Excel file with complicated formatting for some cells. Here is a sample:
The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:
I need to read this column into pandas as a float value. However, when I use
import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])
I get
['USDC Amount USDC']
0 NaT
1 1927-06-05 05:38:32.726400
2 1872-07-25 18:21:27.273600
3 NaT
4 NaT
Name: USDC Amount USDC, dtype: datetime64[ns]
I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:
['USDC Amount USDC']
0 NaN
1 10018.235101
2 -10018.235101
3 NaN
4 NaN
Name: USDC Amount USDC, dtype: float64
The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.
I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.
EDIT:
I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:
openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8
It appears updating OpenPyXL from 3.0.3 to 3.1.0 resolved this issue. A quick glance at the changelog (https://openpyxl.readthedocs.io/en/stable/changes.html) suggests it appears to be related to bugfix 1413 or 1500.
You could use the dtype input in read_excel to be along the lines of
import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':np.float64})
but that comes with some issues. Particularly, your source data contains characters that can't be casted into a float. Your next best options are the object or string dtypes. So instead of :np.float64, you would do something like :"string" instead, resulting in
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':"string"})
After that, you need to extract the numbers from the column. Here's a resource that could help you get an idea of the overall process, although the exact method of doing so is up to you.
Finally, you would want to convert the now numbers-only column to floats. You can do it with the inbuilt casting which is
df["numbers_only"] = df["numbers_only"].astype(np.float64)

What is the advantage of using mode() to replace nans in columns with Dtype=object?

I am currently learning Machine Learning and I came across a tutorial where when a column is of Dtype = object the nans are replaced by the columns mode.
The particular line where this is done is:
test_df['MSZoning']=test_df['MSZoning'].fillna(test_df['MSZoning'].mode()[0])
When checking what the values of MSZoning with
test_df['MSZoning'].value_counts()
The output is
RL 1114
RM 242
FV 74
C (all) 15
RH 10
After taking the mode and filling the nans, the output seems to be the same.
It is not clear to me what mode() is actually doing here. I was wondering if someone could help me with this matter.
The notebook of this data: https://github.com/krishnaik06/Kaggle-Competitions/blob/master/Advance%20House%20PRice%20PRediction/HandleTestData.ipynb
Not sure why this is not working for you. Mode should typically fill the missing values with the most occurring value in the column. In this case it should fill with 'RL'. Are you sure the column has missing values?
I was working this data recently and did not find any missing values in this particular column.

Pandas datetime64 problem (datetime introduces spikes in data)

This is my first question on stackoverflow, so be kind :)
I work with imported csv files and pandas and really liked the pandas datetime possibilities to work and filter dataframes. But i have serious problems with plotting the data in a neat way when using dates as datetime64. Either when using pandas plots or seaborn plots.
my csv looks like this:
date time Flux_ConNT_C Flux_ConB1 Flux_ConB2 Flux_ConB3 Flux_ConB4 Flux_ConB4
0 01.01.2015 00:30 2.552032129 2.193558665 1.0093326 1.013124869 1.159512896 1.159512896
1 01.01.2015 01:00 2.553308464 2.195533756 1.01003938 1.013935693 1.160672989 1.160672989
2 01.01.2015 01:30 2.554585438 2.197510626 1.010746655 1.014747166 1.161834243 1.161834243
3 01.01.2015 02:00 2.55586305 2.199489276 1.011454426 1.015559289 1.162996658 1.162996658
4 01.01.2015 02:30 2.557141301 2.201469707 1.012162692 1.016372061 1.164160236 1.164160236
when I plot the data with
df.plot(figsize=(15,8))
my output is right output
but when I change the "date time" column to 'datetime64 with
df['date time'] = pd.to_datetime(df['date time'])
and use the same code to plot, the data is plotted with these spikes and its not usable false output
There seems to be a problem with matplotlib, but i can't find anything else than putting register_matplotlib_converters() before the plot, which doesn't change anything.
I'm working with Spyder IDE and Python 3.7 and all libraries are up to date.
Thanks for your help!
Your problem is no miracle, it's simply not reproduciable.
Are you sure your csv doesn't have a header for the first index column 0..4?
Are you sure in the csv column 8 is a duplicate of column 7?
How did you actually import this csv and construct your dataframe?
The first plot only works after replaceing the range index 0..4 by the "date time" column. What other transformations did you apply to the dataframe before calling the plot method?
Your to_datetime conversion only works on a column, not an index. Why don't you share all the code that you've been using?
In the 2 plots the first 5 rows don't don't differ. Why don't you share the data rows that are actually different in the 2 plots?
I will give you credit for trying to abstract the problem properly. Unfortunately, you omitted important information. Based on the limited information you've been showing here, there is no problem at all.
To make my point clear: What you observed is not related to the datetime64[ns] conversion, but to something probably very simple that you didn't consider important enough to share with us.
Have a look at How to create a Minimal, Reproducible Example. The idea is: When you're able to prepare your problem in a reproduciable way, you'll probably be ab le to solve it yourself.

Why do I get NonExistentTimeError in python for time stamps between 12 to 1am on 2017-03-12

I get this error when trying to append two Pandas DFs together in a for loop:
Aggdata=Aggdata.append(Newdata)
This is the full error:
File "pandas\tslib.pyx", line 4096, in pandas.tslib.tz_localize_to_utc (pandas
\tslib.c:69713)
pytz.exceptions.NonExistentTimeError: 2017-03-12 02:01:24
However, in my files, I do not have such a time stamp, but I do have ones like 03/12/17 00:45:26 or 03/12/17 00:01:24. Where it is 2 hours before daylight savings. And if I manually delete the offending row, I get that same error for the next row with times between 12 and 1am on the 12th of March.
My original date/time column has no TZ info, but I calculate another column in EST, before the concatenation and localize it to EST, with time with TZ information:
`data['EST_DateTimeStamp']=pd.DatetimeIndex(pd.to_datetime(da‌​ta['myDate'])).tz_lo‌​calize('US/Eastern')‌​.tz_convert('US/East‌​ern')`
Doing some research here, I understand that 2 to 3am on the 12th should be having such error, but why midnight to 1am. So am I localizing it incorrectly? and then why is the error on the append line, and not the localization line?
I was able to reproduce this behavior in a very simple MCVE, saved here:
https://codeshare.io/GLjrLe
It absolutely boggles my mind that the error is raised on the third append, and only if the next 3 appends follow. In others words, if I comment out the last 3 copies of appends, it works fine.. can't imagine what is happening.
Thank you for reading.
In case someone else may still find this helpful:
Talking about it with #hashcode55, the solution was to upgrade Pandas on my server, as this was likely a bug in my previous version of that module.
The problem seems to occur at daylight savings switch - there are local times that do not exist, once per year. In the opposite direction there will be duplicate times.
This could be from say your input dates being converted from UTC to "local time" by adding a fixed offset. When you try to localize these you will hit a non existent times over that hour (or 30 minutes if you are in Adelaide).

Apply a function to each row python

I am trying to convert from UTC time to LocaleTime in my dataframe. I have a dictionary where I store the number of hours I need to shift for each country code. So for example if I have df['CountryCode'][0]='AU' and I have a df['UTCTime'][0]=2016-08-12 08:01:00 I want to get df['LocaleTime'][0]=2016-08-12 19:01:00 which is
df['UTCTime'][0]+datetime.timedelta(hours=dateDic[df['CountryCode'][0]])
I have tried to do it with a for loop but since I have more than 1 million rows it's not efficient. I have looked into the apply function but I can't seem to be able to put it to take inputs from two different columns.
Can anyone help me?
Without having a more concrete example its difficult but try this:
pd.to_timedelta(df.CountryCode.map(dateDict), 'h') + df.UTCTime

Categories