I have a bunch of dates in the form MM / DD / YYYY. I use a parser to get them into something that looks like datetime.datetime(YYYY, MM, DD, 0, 0)
.
I have data corresponding to each date in an array y and would like to plot the two arrays against one another. Using matplotlib.dates.date2num I plot them as
x= matplotlib.dates.date2num(dates)
plot_date(dates,y)
When I do this, I get the following plot
Where as I would prefer something that looks like a time series.
How can I fix this
Presumably the dates and data that you're reading in are not in time order, in which case you'll need to sort them both before they're passed to plot_date():
s = np.argsort(dates)
plot_date(dates[s],y[s])
This should work whether you sort on the list/array of datetime instances (your dates) or their numerical equivalents (your x).
Related
I have a timeseries data and I would like to clean the data by approximating the missing data points and standardizing the sample rate.
Given the fact that there might be some unevenly spaced datapoints, I would like to define a function to get the timeseries and an interval X (e.g., 30 minutes or any other interval) as an input and gives the timeseries with points being spaced within X intervals as an output.
As you can see below, the periods are every 10 minutes but some data points are missing. So the algorithm should detect the missing times and remove them and create the appropriate times and generate the value for them. Then based on the defined function, the sample rate should be changed and standardized.
For approximating missing data and cleaning it, either average or linear interpolation would work.
Here is a part of raw data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Time": ["10:09:00","10:19:00","10:29:00","10:43:00","10:59:00 ", "11:09:00"],
"Value": ["378","378","379","377","376", "377"],
})
df
First of all you need to convert "Time"" into a datetime index. Make pandas recognize the dates as actual dates with df["Time"] = pd.to_datetime(df["Time"]). Then Set time as the index: df = df.set_index("Time").
Once you have the datetime index, you can do all sorts of time-based operations with it. In your case, you want to resample: df.resample('10T')
This leaves us with the following code:
df["Time"] = pd.to_datetime(df["Time"], format="%H:%S:%M")
df = df.set_index("Time")
df.resample('10T')
From here on you have a lot of options on how to treat cases in which you have missing data (fill / interpolate / ...), or in which you have multiple data points for one new one (average / sum / ...). I suggest you take a look at the pandas resampling api. For conversions and formatting between string and datetime refer to strftime.
I code just once in a while and I am super basic at the moment. Might be a silly question, but it got me stuck in for a bit too much now.
Background
I have a function (get_profiles) that plots points every 5m along one transect line (100m long) and extracts elevation (from a geotiff).
The arguments are:
dsm (digital surface model)
transect_file (geopackage, holds many LineStrings with different transect_ID)
transect_id (int, extracted from transect_file)
step (int, number of meters to extract elevation along transect lines)
The output for one transect line is a dataframe like in the picture, which is what I expected, and I like it!
However, the big issue is when I iterate the function over the transect_ids (the transect_files has 10 Shapely LineStrings), like this:
tr_list = np.arange(1,transect_file.shape[0]-1)
geodb_transects= []
for i in tr_list:
temp=get_profiles(dsm,transect_file,i,5)
geodb_transects.append(temp)
I get a list. It might be here the error, but I don't know how to do in another way.
type(geodb_transects)
output:list
And, what's worse, I get headers (distance, z, tr_id, date) every time a new iteration starts.
How to get a clean pandas dataframe, just like the output of 1 iteration (20rows) but with all the tr_id chunks of 20row each aligned and without headers?
If your output is a DataFrame then you’re simply looking to concatenate the incremental DataFrame into some growing DataFrame.
It’s not the most efficient but something like
import pandas
df = pandas.DataFrame()
for i in range(7) :
df = df.concat( df_ret_func(i))
You may also be interested in the from_records function if you have a list of elements that are all records of the same form and can be converted into the rows of a DataFrame.
I have a text data file with a time series whose entries are in the form (This is the first column):
20000101
20000102
20000103
...
20001231
20010101
...
20151231
Using this int values results in the points being accumulated around the year with unequal spaces (This is logical since it will simply leave the corresponding gap between 20001231 to 20010101)
Now one solution to this is to use an array like this one (let's suppose I have the dates stored in an array called date):
xaxis= np.arange(0, len(date))
The problem is that, although the plot is correct, the x axis ticks are then labeled as 0,1,2,3...
I have been trying to modify the xticks, but whatever I do changes the whole figure resulting in a weird plot.
What is the best solution to this?
You have to pass the x data and the x ticks to the plt.xticks(xdata, xticks) function. As seen here.
I am analysing race results from a CSV which looks like this:
Position,Time,Race #,Batch,Name,Surname,Category,Sex,Age
1,00:25:04,58,E,Luke,Schlebusch,Junior,Male,17
2,00:25:16,92,E,Anrich,Zimmermann,Junior,Male,17
3,00:26:27,147,E,Ryan,Mathaba,Open,Male,33
4,00:26:58,53,E,Daniel,Rademan,Junior,Male,16
5,00:27:17,19,E,Werner,Du Preez,Open,Male,29
6,00:27:44,148,E,Mazu,Ndandani,Open,Male,37
7,00:27:45,42,E,Dakota,Murphy,Open,Male,20
8,00:28:29,56,E,David,Schlebusch,Master,Male,51
9,00:28:32,52,E,Caleb,Rademan,Minimee,Male,12
I am using the following call to read_csv to parse this into a Pandas dataframe:
race1 = pandas.read_csv('data.csv', parse_dates='Time', index_col='Time')
This enables me to plot a cumulative distribution of race times very easily by just doing:
race1.Position.plot()
Pandas handles all the intricacies of the date data type and makes a nice x axis with proper formatting of the times.
Is there an elegant way of getting a histogram of times which is similarly straightforward? Ideally, I would like to be able to do race1.index.hist() or race1.index.to_series().hist(), but I know that doesn't work.
I've been able to coerce the time to a timedelta and get a working result with
times = race1.index.to_series()
((times - times[0]).dt.seconds/60).hist()
This produces a histogram of the correct shape, but obviously with wrong x values (they are off by the fastest time).
Is there an elegant way to read the column as a timedelta to begin with, and is there a better way of creating the histogram, including proper ticks? Proper ticks here mean that they use the correct locator and updates properly.
This appears to work pretty well, although I would be happier with it if it didn't go through the Matplotlib date specifics regarding ordinal dates.
times = race1.index.to_series()
today = pandas.Timestamp('00:00:00')
timedelta = times - today
times_ordinal = timedelta.dt.seconds/(24*60*60) + today.toordinal()
ax = times_ordinal.hist()
ax.xaxis_date()
plt.gcf().autofmt_xdate()
plt.ylabel('Number of finishers')
I have dates in a Python script that I need to work with that are in a list. I have to keep the format that already exists. The format is YYYY-MM-DD. They are displayed in the form ['2010-05-12', '2011-04-15', 'Date', '2010-04-20', '2010-11-05'] where the order of the dates appears to be random and they are made into lists with seemingly insignificant lengths. The length of this data can get very large. I need to know how to sort these dates into a chronological order and omit the seemingly randomly placed entries of 'Date' from this order. Then I need to be able to perform math operations such as moving up and down the list. For example if I have five dates in order I need to be able to take one date and be able to find a date x spaces ahead or behind that date in the order. I'm very new to Python so simpler explanations and implementations are preferred. Let me know if any clarifications are needed. Thanks.
You are asking several questions at the same time, so I'll answer them in order.
To filter out the "Date" entries, use the filter function like this:
dates = ['2011-06-18', 'Date', '2010-01-13', '1997-12-01', '2007-08-11']
dates_filtered = filter(lambda d: d != 'Date', dates)
Or perhaps like this, using Python's list comprehensions, if you find it easier to understand:
dates_filtered = [d for d in dates if d != 'Date']
You might want to convert the data types of the date items in your list to the date class to get access to some date-related methods like this:
from datetime import datetime
date_objects = [datetime.strptime(x,'%Y-%m-%d').date() for x in dates_filtered]
And to sort the dates you simply use the sort method
date_objects.sort()
The syntax in Python for accessing items and ranges of items in lists (or any "sequence type") is quite powerful. You can read more about it here. For example, if you want to access the last two dates in your list you could do something like this:
print(date_objects[-2:]
If you put it all together you'll get something like this:
from datetime import datetime
dates = ['2011-06-18', 'Date', '2010-01-13', '1997-12-01', '2007-08-11']
my_dates = [datetime.strptime(d, '%Y-%m-%d').date()
for d in dates
if d != 'Date']
my_dates.sort()