Plot a DataFrame over a range of time indices - python

I am new to Python and pandas.
I have a dataset loaded into Python as a DataFrame. The index of the DataFrame are times of the format "2018-01-01 00:00:00". My dataset ranges from "2018-01-01 00:00:00" to "2018-12-31 23:59:59". The data column has a column name "X".
I can plot the entire dataset using matplotlib:
plt.plot(data.index, data["X"])
However, I want to plot different segments of the time series: 1 month, 6 months, 2 days, 3 seconds, etc.
What is the best way to do this?
Thanks

If you want to plot a month you could do
data.loc['2018-02',"X"].plot()
6 months
data.loc['2018-02':'2018-08',"X"].plot()
and the same logic applies for other ranges

You might need to do one more processing step on your index to ensure you're dealing with datetime objects rather than strings.
new_data = (
data
.assign(datetime=lambda df: pandas.to_datetime(df.index))
.set_index('datetime')
)
new_data["X"].plot()
This should get us really close to what you want, but i haven't tested it on data with your date format.

Related

Assigning a value to the same dates fulfilling a condition in a more efficient way in a dataframe

I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.

Using a sampling of a datetime index to select the features (rows) in a pandas dataframe at those datetimes

I have a datetime index object that consists of the index values of randomly sampled data from a LARGER dataframe on which I am training a learner. I'd like to use the date time index e.g.
DatetimeIndex(['1911-11-18', '2015-05-02', '1934-08-15', '1950-09-16',
'1944-06-01', '2004-07-30', '1947-11-18', '1977-07-08',
'1945-05-31', '1944-01-31',
...
'1884-06-24', '1999-11-22', '1960-02-02', '1883-03-08',
'1952-11-19', '1993-02-04', '1965-04-26', '1885-09-30',
'1890-02-26', '2008-03-28'],
dtype='datetime64[ns]', length=300000, freq=None)
of each training example to go back to the full data frame and look up the target value on those days, AND THEN go 1 year into the future from that date to use as the real target.
The overall context is training on a random sample from time series data, and targeting a value in the future.
My big data frame is called toLearn. And the sample dataframe on which I am training is called dataSlice (a subset of toLearn).
Something like the following works for what I was trying to do.
# Find Target 7 years after Each Training Sample
indicesOfTrainSamples=trainSamples.index
indicesOfTarget=indicesOfTrainSamples + pd.Timedelta(weeks=7*52)
targets=[]
for i in indicesOfTarget:
targets.append(toLearn.loc[i])
targetSlices=pd.DataFrame(targets,index=indicesOfTarget)
targetFeature=targetSlices['targetValues']

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

How to get vincent to display a pandas date/time axis correctly?

I have a pandas dataframe that I want to use in a vincent visualization. I can visualize the data, however, the X axis should be displayed as dates and instead the dates are just given an integer index of 500, 1000, 1500, etc.
The dataframe looks like this:
weight date
0 125.200000 2013-11-18
Truncated for brevity.
My vincent code in my ipython notebook:
chart = vincent.Line(df[['weight']])
chart.legend(title='weight')
chart.axis_titles(x='Date', y='Weight')
chart.display()
How can I tell vincent that my dataframe contains dates such that the X axis labels are just like the dataframe's dates above, i.e. 2013-11-18?
ok, so here's what I did. I ran into this problem before with matplotlib, and it was so painful that wrote a blog post about it (http://codrspace.com/szeitlin/biking-data-from-xml-to-plots-part-2/). Vincent is not exactly the same, but essentially you have to do 4 steps:
convert your dates to datetime objects, if you haven't already
df['date_objs'] = df['date'].apply(pandas.to_datetime)
convert your datetime objects to whatever format you want.
make your datetime objects into your index
df.index = df.index.values.astype('M8[D]')
tell vincent you want to plot your data (weight) as the y-axis. It will automatically use the index of your dataframe as the x-axis.
chart = vincent.Line(plot[['weight']])
chart.axis_titles(x='dates', y='weight')
chart.display()

Categories