I have 5 columns in an array that are DateTime type values. There are 100k+ values in each column
The values are the DateTime for a timeline going from 2015-01-15 00:30 to 2020-12-31 23:00 in 30-minute increments.
Basically what I want to do is to loop through values in the array, and check if the current value is an exact 30-minute timestep from the last value. In an array with columns, this would be the value above the current value being investigated
There's probably a few ways to do this, but I've included a pseudocode sample of how I'm thinking about it
for row in _the_whole_array :
for cell in row:
if cell == to the 30 minute timestep of the cell above it
continue iterating
else:
store that value
return the smallest timestep found, and the biggest timestep found
I have looked at for loops as well as nditer but I'm getting errors with iterating over date-times, and I'm also wondering how I can find the cell value above the current cell value above it.
Any help is hugely appreciated
i wasn't sure of your axis, even if you use rows on your pseudocode. but the overall idea is the same: if your dtype is datetime you can do basic arithmetic on it
x, y = _the_whole_array.shape
for row in range(x):
diff=_the_whole_array[row][1:]-_the_whole_array[row][:-1]
print(np.amin(diff), np.amax(diff))
for column in range(y):
diff=_the_whole_array[:,column][1:]-_the_whole_array[:,column][:-1]
print(np.amin(diff), np.amax(diff))
As mentioned by #Heidiki, consider using pandas because it will be much easier to process large data.
To address your question, you can create a temporary variable to store values of previous row while looping. At every iteration, you calculate the differences and check for absolute min and max, exactly as your pseudocode.
Here are an example and test data, 3x5-matrix. Please check whether the output matches your requirement.
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
# test data
arr = np.array([
[datetime(2019, 1, 2, 13, 12), datetime(2019, 1, 2, 15, 19),
datetime(2019, 1, 2, 15, 59), datetime(2019, 1, 2, 17, 23),
datetime(2019, 1, 2, 15, 18)],
[datetime(2019, 1, 2, 13, 34), datetime(2019, 1, 2, 15, 57),
datetime(2019, 1, 2, 18, 53), datetime(2019, 1, 2, 17, 34),
datetime(2019, 1, 2, 15, 29)],
[datetime(2019, 1, 2, 13, 49), datetime(2019, 1, 2, 16, 35),
datetime(2019, 1, 2, 21, 18), datetime(2019, 1, 2, 17, 59),
datetime(2019, 1, 2, 15, 46)]
])
def timedelta_to_minutes(dt: timedelta) -> int:
return (dt.days * 24 * 60) + (dt.seconds // 60)
def min_max_timestep(data: np.array) -> tuple:
prev_row = min_step = max_step = None
df = pd.DataFrame(data)
for idx, row in df.iterrows():
if not idx:
prev_row = row
continue
diff = row - prev_row
min_diff, max_diff = min(diff), max(diff)
if min_step is None or min_diff < min_step:
min_step = min_diff
if max_step is None or max_diff > max_step:
max_step = max_diff
prev_row = row
# convert timedelta to minutes
return timedelta_to_minutes(min_step), timedelta_to_minutes(max_step)
result = min_max_timestep(arr)
try converting your npArray to a pandas dataFrame as it allows you to iterate thought its rows, here is a code snippet to help you out.
import pandas
import numpy
# Creating Dataframe From NumPy Array
df = pd.DataFrame(yout_array)
#iterating through the dataframe
for index, row in df.iterrows():
# print current row
print(row)
# print previous row
print(df[index -1]) if index != 0 else print('no previous row')
Related
I can't figure out why un-commenting ts = ts.sort_index() in the code below throws an ErrorKey:
import datetime
import pandas as pd
df = pd.DataFrame({
'x': [2, 1, 3],
'd': [
datetime.datetime(2018, 5, 21),
datetime.datetime(2018, 5, 20),
datetime.datetime(2018, 5, 22)
]
})
ts = df.set_index('d')
#ts = ts.sort_index()
ts['2018-05-21']
My assumption is that sort_index in some ways generates a new index and therefore breaks the string selection but I can't find any evidence of it.
To provide some context, I want to sort this time series in order to select a time range (e.g., ts['2018-05-21':]). If I don't sort it, it works for the example above but not for the time range.
I will recommend using .loc
#ts = df.set_index('d')
#ts = ts.sort_index()
ts.loc['2018-05-21':,:]
Out[102]:
x
d
2018-05-21 2
2018-05-22 3
I am new to Python (from Matlab) and having some trouble with a simple task:
How can I create a regular time series from date X to date Y with intervals of Z units?
E.g. from 1st January 2013 to 31st January 2013 every 10 minutes
In Matlab:
t = datenum(2013,1,1):datenum(0,0,0,0,10,0):datenum(2013,12,31);
If you can use pandas then use pd.date_range
Ex:
import pandas as pd
d = pd.date_range(start='2013/1/1', end='2013/12/31', freq="10min")
This function will make an interval of anything, similar to range, but working also for non-integers:
def make_series(begin, end, interval):
x = begin
while x < end:
yield x
x = x + interval
You can then do this:
>>> import datetime
>>> date_x = datetime.datetime(2013,1,1)
>>> date_y = datetime.datetime(2013,12,31)
>>> step = datetime.timedelta(days=50)
>>>
>>> list(make_series(date_x, date_y, step))
[datetime.datetime(2013, 1, 1, 0, 0), datetime.datetime(2013, 2, 20, 0, 0), datetime.datetime(2013, 4, 11, 0, 0), datetime.datetime(2013, 5, 31, 0, 0), datetime
.datetime(2013, 7, 20, 0, 0), datetime.datetime(2013, 9, 8, 0, 0), datetime.datetime(2013, 10, 28, 0, 0), datetime.datetime(2013, 12, 17, 0, 0)]
I picked a random date range, but here is how you can essentially get it done:
a = pd.Series({'A': 1, 'B': 2, 'C': 3}, pd.date_range('2018-2-6', '2018-3-4', freq='10Min'))
print(a)
You can add a date range when creating the DataFrame. The first argument that you enter into the date_range() method is the start date, and the second argument is automatically the end date. The last argument that you enter in this example is the frequency argument, which you can set to 10 minutes. You can set it to 'H' for 1 hour, or to 'M' for 1 minute intervals as well. Another method that is worth noting however is asfreq() method that will let you edit the frequency or store a copy of another dataframe with a different frequency. Here is an example:
copy = a.asfreq('45Min', method='pad')
print(copy)
That's important if you want to study multiple frequencies. You can just copy your dataframe repeatedly in order to look at various time intervals. Hope that this answer helps.
I tried to create a date frame with lists:
lista_values = [2983, 2983, 5652, 12375, 13055, 26180]
labels = ['00_04', '04_08', '08_12', '12_16', '16_20', '20_24']
lista_index = [datetime.datetime(2017, 11, 11, 0, 0)]
df = pd.DataFrame(lista_values,index=lista_index,columns=labels)
then my error:
ValueError: Shape of passed values is (1, 6), indices imply (6, 1)
How can I create a data frame with these lists, for example? I really don't understand why is out of bounds here if the columns and labels have same length
Now is this what you want? I flipped the dataframe with transpose (T).
import pandas as pd
import datetime
lista_values = [2983, 2983, 5652, 12375, 13055, 26180]
labels = ['00_04', '04_08', '08_12', '12_16', '16_20', '20_24']
lista_index = [datetime.datetime(2017, 11, 11, 0, 0)]
df = pd.DataFrame(data=lista_values, index=labels, columns = lista_index).T
df
Returns:
00_04 04_08 08_12 12_16 16_20 20_24
2017-11-11 2983 2983 5652 12375 13055 26180
I have an array of date time objects x and an array of y values corresponding to those datetimes. I'm trying to create a histogram which groups all those y values into the same bin by month. Basically adding all y values which are in the same month and creating a histogram which shows the total values for each month.
This is a simplified version of what my data looks like:
x = np.array(datetime.datetime(2014, 2, 1, 0, 0), datetime.datetime(2014, 2, 13, 0, 0),\n
datetime.datetime(2014, 3, 4, 0, 0), datetime.datetime(2014, 3, 6, 0, 0))
y = np.array(4,3,2,6)
The end result should be a histogram showing month 2 in 2014 with y value 7 and month 3 in 2014 with y value 8.
The first thing I tried was creating a pandas dataframe out of my two array like so:
frame = pd.DataFrame({'x':x,'y':y})
This worked fine with x mapping to all datetime objects and y to all corresponding values. However after creating this dataframe I'm kind of lost on how to add all the y values by month and create bins out of these months using plt.hist()
First of all, thanks for a well-posed question with an example of your data.
This seems to be what you want:
import pandas as pd
import numpy as np
import datetime
%matplotlib inline
x = np.array([datetime.datetime(2014, 2, 1, 0, 0),
datetime.datetime(2014, 2, 13, 0, 0),
datetime.datetime(2014, 3, 4, 0, 0),
datetime.datetime(2014, 3, 6, 0, 0)])
y = np.array([4,3,2,6])
frame = pd.DataFrame({'x':x,'y':y})
(frame.set_index('x'). # use date-time as index
assign(month=lambda x: x.index.month). # add new column with month
groupby('month'). # group by that column
sum(). # find a sum of the only column 'y'
plot.bar()) # make a barplot
Do This First
df = pd.DataFrame(dict(y=y), pd.DatetimeIndex(x, name='x'))
df
y
x
2014-02-01 4
2014-02-13 3
2014-03-04 2
2014-03-06 6
Option 1
df.resample('M').sum().hist()
Option 2
df.groupby(pd.TimeGrouper('M')).sum().hist()
Or Do This First
df = pd.DataFrame(dict(x=pd.to_datetime(x), y=y))
df
x y
0 2014-02-01 4
1 2014-02-13 3
2 2014-03-04 2
3 2014-03-06 6
Option 3
df.resample('M', on='x').sum().hist()
Yields
I have a Pandas time series with unevenly spaced dates/datapoints. I want to add 1 to the value of each data point that is the first value for each year.
The time series is very sparse and the data is sorted.
Is there a better way to do this then by looping through all data points and checking when the year changes?
Example:
dates = [datetime(2012, 1, 1, 1, 1), datetime(2012, 1, 1, 1, 2), datetime(2012, 1, 2, 0 ,0), datetime(2013, 1, 2, 0, 0), datetime(2014, 1, 3, 1, 1)]
ts = Series(np.random.randn(len(dates)), dates)
Using the example above I want to add 1 to value on 2012-01-01 01:01:00, 2013-01-02 00:00:00 and 2014-01-03 01:01:00
Sure. You can extract the year:
ts.index.year
Find where the adjacent difference is nonzero:
np.diff(ts.index.year) != 0
Remember that you also want to select the very first data point:
np.concatenate(([True], np.diff(ts.index.year) != 0))
And then modify those data points:
ts[np.concatenate(([True], np.diff(ts.index.year) != 0))] += 1