Pandas datetime64 problem (datetime introduces spikes in data) - python

This is my first question on stackoverflow, so be kind :)
I work with imported csv files and pandas and really liked the pandas datetime possibilities to work and filter dataframes. But i have serious problems with plotting the data in a neat way when using dates as datetime64. Either when using pandas plots or seaborn plots.
my csv looks like this:
date time Flux_ConNT_C Flux_ConB1 Flux_ConB2 Flux_ConB3 Flux_ConB4 Flux_ConB4
0 01.01.2015 00:30 2.552032129 2.193558665 1.0093326 1.013124869 1.159512896 1.159512896
1 01.01.2015 01:00 2.553308464 2.195533756 1.01003938 1.013935693 1.160672989 1.160672989
2 01.01.2015 01:30 2.554585438 2.197510626 1.010746655 1.014747166 1.161834243 1.161834243
3 01.01.2015 02:00 2.55586305 2.199489276 1.011454426 1.015559289 1.162996658 1.162996658
4 01.01.2015 02:30 2.557141301 2.201469707 1.012162692 1.016372061 1.164160236 1.164160236
when I plot the data with
df.plot(figsize=(15,8))
my output is right output
but when I change the "date time" column to 'datetime64 with
df['date time'] = pd.to_datetime(df['date time'])
and use the same code to plot, the data is plotted with these spikes and its not usable false output
There seems to be a problem with matplotlib, but i can't find anything else than putting register_matplotlib_converters() before the plot, which doesn't change anything.
I'm working with Spyder IDE and Python 3.7 and all libraries are up to date.
Thanks for your help!

Your problem is no miracle, it's simply not reproduciable.
Are you sure your csv doesn't have a header for the first index column 0..4?
Are you sure in the csv column 8 is a duplicate of column 7?
How did you actually import this csv and construct your dataframe?
The first plot only works after replaceing the range index 0..4 by the "date time" column. What other transformations did you apply to the dataframe before calling the plot method?
Your to_datetime conversion only works on a column, not an index. Why don't you share all the code that you've been using?
In the 2 plots the first 5 rows don't don't differ. Why don't you share the data rows that are actually different in the 2 plots?
I will give you credit for trying to abstract the problem properly. Unfortunately, you omitted important information. Based on the limited information you've been showing here, there is no problem at all.
To make my point clear: What you observed is not related to the datetime64[ns] conversion, but to something probably very simple that you didn't consider important enough to share with us.
Have a look at How to create a Minimal, Reproducible Example. The idea is: When you're able to prepare your problem in a reproduciable way, you'll probably be ab le to solve it yourself.

Related

Plot Subset of Dataframe without Being Redundant

a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?
You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.

Apply a function to each row python

I am trying to convert from UTC time to LocaleTime in my dataframe. I have a dictionary where I store the number of hours I need to shift for each country code. So for example if I have df['CountryCode'][0]='AU' and I have a df['UTCTime'][0]=2016-08-12 08:01:00 I want to get df['LocaleTime'][0]=2016-08-12 19:01:00 which is
df['UTCTime'][0]+datetime.timedelta(hours=dateDic[df['CountryCode'][0]])
I have tried to do it with a for loop but since I have more than 1 million rows it's not efficient. I have looked into the apply function but I can't seem to be able to put it to take inputs from two different columns.
Can anyone help me?
Without having a more concrete example its difficult but try this:
pd.to_timedelta(df.CountryCode.map(dateDict), 'h') + df.UTCTime

Pandas plotting: How to format datetimeindex?

I am doing a barplot out of a dataframe with a 15min datetimeindex over a couple of years.
Using this code:
df_Vol.resample(
'A',how='sum'
).plot.bar(
title='Sums per year',
style='ggplot',
alpha=0.8
)
Unfortunately the ticks on the X-axis are now shown with the full timestamp like this: 2009-12-31 00:00:00.
I would prefer to Keep the code for plotting short, but I couldn't find an easy way to format the timestamp simply to the year (2009...2016) for the plot.
Can someone help on this?
As it does not seem to be possible to Format the date within the Pandas df.plot(), I have decided to create a new dataframe and plot from it.
The solution below worked for me:
df_Vol_new = df_Vol.resample('A',how='sum')
df_Vol_new.index = df_Vol_new.index.format(formatter=lambda x: x.strftime('%Y'))
ax2 =df_Vol_new.plot.bar(title='Sums per year',stacked=True, style='ggplot', alpha=0.8)
I figured an alternative (better, at least to me) way is to add the following to df_Vol_new.plot() command:
plt.legend(df_Vol_new.index.to_period('A'))
This way you would reserve df_Vol_new.index datetime format while getting better plots at the same time.

Python Pandas Groupby Dropping DateTime Columns

I am having some trouble using groupby.median() and groupby.mean() on a DataFrame containing intermittent NaT values. Specifically, I have several columns in a dataset calculating various time differences based on other columns. In some instances, no time difference exists, causing a NaT value similar to the example below:
Group Category Start Time End Time Time Diff
A 1 08:00:00.000 08:00:00.500 .500
B 1 09:00:00.000 09:02:00.000 2:00.000
B 1 09:00:00.000 NaT NaT
A 2 09:00:00.000 09:02:00.000 2:00.000
A 2 09:00:00.000 09:01:00.000 1:00.000
A 2 08:00:00.000 08:00:01.500 1.500
Any time I run df.groupby(['Group', 'Category'].median() or .mean() any column that contains NaT is dropped from the result set. I've attempted a fillna but NaT's seemed to remain. As an added point of context, this script worked correctly in an older version of Anaconda Python (1.x). I was recently able to upgrade my work computer to 2.0.1 at which point this issue began creeping up.
EDIT: I will leave my thoughts about NaT's up above in the event that they are a factor, but upon further review it seems that my problem actually lies in the fact that these columns are timedelta64s. Does anyone know of any workarounds to obtain mean/median on timedeltas?
Thanks very much for any insight you may have!
After some further googling/experimentation I confirmed that the issue appeared to be related to columns which were timedelta64. In order to perform pd.groupby on these columns I first converted them to floats like so:
df['End Time'] = df['End Time'].astype('timedelta64[ms]') / 86400000
There may be a more elegant solution to this but this allowed me to move forward with my analysis.
Thanks!

How to remove day from datetime index in pandas?

The idea behind this question is, that when I'm working with full datetime tags and data from different days, I sometimes want to compare how the hourly behavior compares.
But because the days are different, I can not directly plot two 1-hour data sets on top of each other.
My naive idea would be that I need to remove the day from the datetime index on both sets and then plot them on top of each other. What's the best way to do that?
Or, alternatively, what's the better approach to my problem?
This may not be exactly it but should help you along, assuming ts is your timeseries:
hourly = ts.resample('H')
hourly.index = pd.MultiIndex.from_arrays([hourly.index.hour, hourly.index.normalize()])
hourly.unstack().plot()
If you don't care about the day AT ALL, just hourly.index = hourly.index.hour should work

Categories