Python pandas plot time-series with gap - python

I am trying to plot a pandas DataFrame with TimeStamp indizes that has a time gap in its indizes. Using pandas.plot() results in linear interpolation between the last TimeStamp of the former segment and the first TimeStamp of the next. I do not want linear interpolation, nor do I want empty space between the two date segments. Is there a way to do that?
Suppose we have a DataFrame with TimeStamp indizes:
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> df = df.cumsum()
Now lets take two time chunks of it and plot it:
>>> df = pd.concat([df['Jan 2000':'Aug 2000'], df['Jan 2001':'Aug 2001']])
>>> df.plot()
>>> plt.show()
The resulting plot has an interpolation line connecting the TimeStamps enclosing the gap. I cannot figure out how to upload pictures on this machine, but these pictures from Google Groups show my problem (interpolated.jpg, no-interpolation.jpg and no gaps.jpg). I can recreate the first as shown above. The second is achievable by replacing all gap values with NaN (see also this question). How can I achieve the third version, where the time gap is omitted?

Try:
df.plot(x=df.index.astype(str))
You may want to customize ticks and tick labels.
EDIT
That works for me using pandas 0.17.1 and numpy 1.10.4.
All you really need is a way to convert the DatetimeIndex to another type which is not datetime-like. In order to get meaningful labels I chose str. If x=df.index.astype(str) does not work with your combination of pandas/numpy/whatever you can try other options:
df.index.to_series().dt.strftime('%Y-%m-%d')
df.index.to_series().apply(lambda x: x.strftime('%Y-%m-%d'))
...
I realized that resetting the index is not necessary so I removed that part.

In my case I had DateTimeIndex objects instead of TimeStamp, but the following works for me in pandas 0.24.2 to eliminate the time series gaps after converting the DatetimeIndex objects to string.
df = pd.read_sql_query(sql, sql_engine)
df.set_index('date'), inplace=True)
df.index = df.index.map(str)

Related

Hourly Average of csv data of 15 minutes interval

My data in csv files is 15 minutes average and I want to hourly Average. When I used below code, it is showing error. 'how' unrecognised argument.
import pandas as pd
df = pd.read_csv("sirifort_with_data.csv",parse_dates=['Time_Stamp'])
data.resample('H', how='mean')
Indeed, pandas.resample does not have a keyword argument named how. You only use that function to group your time-series data. After applying the function, you could apply another to apply operations on each sample/group. Since you want to calculate the average of each group, you can use .mean():
data.resample('H').mean()
import pandas as pd
df = pd.read_csv("sirifort_with_data.csv")
df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
df = df.set_index('Time_Stamp').resample('H').mean()
After converting the Time_Stamp to pd.to_datetime, it worked fine.
Thanks for the help.

pandas bar plot xlabel based on two column values

given the code below, I get the expected plot attached.
Is there a way to get pandas to plot the X labels as a combination of A and B?
I've tried passing in x=['A','B'] as well as x=('A','B') which does not work...
I would like the labels just to include both of them.
It is possible to pivot and get a semi-workable solution but I don't want to actually compare the subset B side by side...
df.pivot(index='B',columns='A',values='Val').plot(kind='bar')
import pandas as pd
df = pd.DataFrame(columns=['A','B','Val'])
for a in range(2):
for b in range(3):
df = df.append({'A':str(a),'B':str(b),'Val':a+b},ignore_index=True)
df.plot(kind='bar',x='B',y='Val')
You can create multiindex by setting columns A and B as index then use plot with kind=bar:
df.set_index(['A', 'B']).plot(kind='bar', y='Val')

How can I get the difference between values in a Pandas dataframe grouped by another field?

I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)

Issue Creating Data Frame out of Columns Pandas - Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('time_series_covid_19_deaths_US.csv')
df = df.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'],axis =1)
for name, values in df.iteritems():
if '/' in name:
df.drop([name],axis=1,inplace =True)
df2 = df.set_index(['Lat','Long_'])
print(df2.head())
lat = df2[df2["Lat"]]
print(lat)
long = df2[df2['Long_']]
Code is above. I got the data set from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset - using the US deaths.
I have attached an image of the output. I do not know what this error means.
Apologies if worded ambiguously / incorrectly, or if there is a preexisting answer somewhere
When you define an index using one or more columns, e.g. via set_index(), these columns are promoted to index and by default no longer accessible using the df[<colname>] notation. This behavior can be changed with set_index(..., drop=False) but that's usually not necessary.
With the index in place, use df.loc[] to access single rows by their index value, aka label. Read about label-based indexing here.
To access the values of your MultiIndex as you would do with a column, you can use df.index.get_level_values(<colname>).array (or .to_numpy()). So in your case you could write:
lat = df2.index.get_level_values('Lat').array
print(lat)
long = df2.index.get_level_values('Long_').array
print(lat)
BTW: read_csv() has a useful usecols argument that lets you specify which columns to load (others will be ignored).

Using str.startswith to access a dataframe slice

I have a dataframe that with temperature values over the years, What I want to do is put all the rows that are from year 2015 into a new dataframe. Currently, the Date column is an object type with the str format looking like this: YYYY-MM-DD
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("C:\\whatever\weather.csv")
weather_2015 = df.loc[df.Date == df.Date.str.startswith("2015"), :]
weather_2015.head()
this is what the data looks like in the main data frame
NOTE: if I do something like
weather_2015 = df.loc[df.Date == "2015-02-03", :]
weather_2015.head()
I get what I'd expect, dates only that match 2015-02-03
pd.Series.str.startswith returns a boolean mask, you don't need to compare it to df.Date again. You could just index with it directly:
weather_2015 = df[df.Date.str.startswith("2015")]
You don't even need .loc here.
Note that if you want to make changes on this slice, you might prefer a copy, in which case you should call df.copy:
weather_2015 = df[df.Date.str.startswith("2015")].copy()

Categories