Related
I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).
I am not understanding how to essentially say: columns= [0:6, 12:15])
When I try this I get invalid syntax at the :
import pandas as pd
data = pd.read_excel (rf'C:\Users\dusti\Desktop\bulk export.xlsx',
sheet_name=1,
header=None)
df = pd.DataFrame(data,
columns= [0,1,2,3,4,5,6,12,13,14,15])
df.to_csv(rf'C:\Users\dusti\Desktop\bulk export1.csv',
header=False,
index=False)
print (df)
this thing that you trying is slicing. it used for select a subset of a list
You can use the range function for create numbers and convert it to a list with the list function
list(range(0,6+1)) + list(range(12,15+1))
#output :
[0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15]
I have 5 columns in an array that are DateTime type values. There are 100k+ values in each column
The values are the DateTime for a timeline going from 2015-01-15 00:30 to 2020-12-31 23:00 in 30-minute increments.
Basically what I want to do is to loop through values in the array, and check if the current value is an exact 30-minute timestep from the last value. In an array with columns, this would be the value above the current value being investigated
There's probably a few ways to do this, but I've included a pseudocode sample of how I'm thinking about it
for row in _the_whole_array :
for cell in row:
if cell == to the 30 minute timestep of the cell above it
continue iterating
else:
store that value
return the smallest timestep found, and the biggest timestep found
I have looked at for loops as well as nditer but I'm getting errors with iterating over date-times, and I'm also wondering how I can find the cell value above the current cell value above it.
Any help is hugely appreciated
i wasn't sure of your axis, even if you use rows on your pseudocode. but the overall idea is the same: if your dtype is datetime you can do basic arithmetic on it
x, y = _the_whole_array.shape
for row in range(x):
diff=_the_whole_array[row][1:]-_the_whole_array[row][:-1]
print(np.amin(diff), np.amax(diff))
for column in range(y):
diff=_the_whole_array[:,column][1:]-_the_whole_array[:,column][:-1]
print(np.amin(diff), np.amax(diff))
As mentioned by #Heidiki, consider using pandas because it will be much easier to process large data.
To address your question, you can create a temporary variable to store values of previous row while looping. At every iteration, you calculate the differences and check for absolute min and max, exactly as your pseudocode.
Here are an example and test data, 3x5-matrix. Please check whether the output matches your requirement.
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
# test data
arr = np.array([
[datetime(2019, 1, 2, 13, 12), datetime(2019, 1, 2, 15, 19),
datetime(2019, 1, 2, 15, 59), datetime(2019, 1, 2, 17, 23),
datetime(2019, 1, 2, 15, 18)],
[datetime(2019, 1, 2, 13, 34), datetime(2019, 1, 2, 15, 57),
datetime(2019, 1, 2, 18, 53), datetime(2019, 1, 2, 17, 34),
datetime(2019, 1, 2, 15, 29)],
[datetime(2019, 1, 2, 13, 49), datetime(2019, 1, 2, 16, 35),
datetime(2019, 1, 2, 21, 18), datetime(2019, 1, 2, 17, 59),
datetime(2019, 1, 2, 15, 46)]
])
def timedelta_to_minutes(dt: timedelta) -> int:
return (dt.days * 24 * 60) + (dt.seconds // 60)
def min_max_timestep(data: np.array) -> tuple:
prev_row = min_step = max_step = None
df = pd.DataFrame(data)
for idx, row in df.iterrows():
if not idx:
prev_row = row
continue
diff = row - prev_row
min_diff, max_diff = min(diff), max(diff)
if min_step is None or min_diff < min_step:
min_step = min_diff
if max_step is None or max_diff > max_step:
max_step = max_diff
prev_row = row
# convert timedelta to minutes
return timedelta_to_minutes(min_step), timedelta_to_minutes(max_step)
result = min_max_timestep(arr)
try converting your npArray to a pandas dataFrame as it allows you to iterate thought its rows, here is a code snippet to help you out.
import pandas
import numpy
# Creating Dataframe From NumPy Array
df = pd.DataFrame(yout_array)
#iterating through the dataframe
for index, row in df.iterrows():
# print current row
print(row)
# print previous row
print(df[index -1]) if index != 0 else print('no previous row')
I can use np.select to insert a new column and set the value for one dataFrame.
But when I combined both dataFrame. The np.select does not work. Seems index error.
import pandas as pd
import numpy as np
df = pd.DataFrame([[3, 2, 1],[4, 5, 6]], columns=['col1','col2','col3'], index=['a','b'])
df2 = pd.DataFrame([[14, 15, 16],[17, 16, 15]], columns=['col1','col2','col3'], index=['c','e'])
count = df.append(df2)
print(count)
conditions = [
(df["col1"] >= df["col2"]) & (df["col2"] >= df["col3"]),
]
choices = [100]
count["col4"] = np.select(conditions,choices, default='WHAT')
count
This is success
This is error after combine, error is :
ValueError: Length of values does not match length of index
I think there is a typo in your code when it comes to count vs df. The following code just works fine.
import pandas as pd
import numpy as np
df = pd.DataFrame([[3, 2, 1],[4, 5, 6]], columns=['col1','col2','col3'], index=['a','b'])
df2 = pd.DataFrame([[14, 15, 16],[17, 16, 15]], columns=['col1','col2','col3'], index=['c','e'])
count = df.append(df2)
print(count)
conditions = [
(count["col1"] >= count["col2"]) & (count["col2"] >= count["col3"]),
]
print(conditions)
choices = [100]
count["col4"] = np.select(conditions,choices, default='WHAT')
count
I am new to Python (from Matlab) and having some trouble with a simple task:
How can I create a regular time series from date X to date Y with intervals of Z units?
E.g. from 1st January 2013 to 31st January 2013 every 10 minutes
In Matlab:
t = datenum(2013,1,1):datenum(0,0,0,0,10,0):datenum(2013,12,31);
If you can use pandas then use pd.date_range
Ex:
import pandas as pd
d = pd.date_range(start='2013/1/1', end='2013/12/31', freq="10min")
This function will make an interval of anything, similar to range, but working also for non-integers:
def make_series(begin, end, interval):
x = begin
while x < end:
yield x
x = x + interval
You can then do this:
>>> import datetime
>>> date_x = datetime.datetime(2013,1,1)
>>> date_y = datetime.datetime(2013,12,31)
>>> step = datetime.timedelta(days=50)
>>>
>>> list(make_series(date_x, date_y, step))
[datetime.datetime(2013, 1, 1, 0, 0), datetime.datetime(2013, 2, 20, 0, 0), datetime.datetime(2013, 4, 11, 0, 0), datetime.datetime(2013, 5, 31, 0, 0), datetime
.datetime(2013, 7, 20, 0, 0), datetime.datetime(2013, 9, 8, 0, 0), datetime.datetime(2013, 10, 28, 0, 0), datetime.datetime(2013, 12, 17, 0, 0)]
I picked a random date range, but here is how you can essentially get it done:
a = pd.Series({'A': 1, 'B': 2, 'C': 3}, pd.date_range('2018-2-6', '2018-3-4', freq='10Min'))
print(a)
You can add a date range when creating the DataFrame. The first argument that you enter into the date_range() method is the start date, and the second argument is automatically the end date. The last argument that you enter in this example is the frequency argument, which you can set to 10 minutes. You can set it to 'H' for 1 hour, or to 'M' for 1 minute intervals as well. Another method that is worth noting however is asfreq() method that will let you edit the frequency or store a copy of another dataframe with a different frequency. Here is an example:
copy = a.asfreq('45Min', method='pad')
print(copy)
That's important if you want to study multiple frequencies. You can just copy your dataframe repeatedly in order to look at various time intervals. Hope that this answer helps.