I'm using Pandas 0.13.0 and I try to do a sliding average based on the value of the index.
The index values are not equally distributed.
The index is sorted with increasing and unique values.
import pandas as pd
import Quantities as pq
f = {
'A': [ 0.0, 0.1, 0.2, 0.5, 1.0, 1.4, 1.5] * pq.m,
'B': [10.0, 11.0, 12.0, 15.0, 20.0, 30.0, 50.0] * pq.kPa
}
df = pd.DataFrame(f)
df.set_index(df['A'], inplace=True)
The DataFrame gives:
in: print df
out:
A B
A
0.00 0.00 m 10.0 kPa
0.10 0.10 m 11.0 kPa
0.20 0.20 m 12.0 kPa
0.50 0.50 m 15.0 kPa
1.00 1.00 m 20.0 kPa
1.40 1.40 m 30.0 kPa
1.50 1.50 m 50.0 kPa
Now I would like to do the average of the column B for each x value of the index, between x and x+c, c being a user defined criterion.
For the sake of this example, c = 0.40.
The averaging process would give:
A B C
A
0.00 0.00 m 10.0 kPa 11.0 kPa = (10.0 + 11.0 + 12.0) / 3
0.10 0.10 m 11.0 kPa 12.7 kPa = (11.0 + 12.0 + 15.0) / 3
0.20 0.20 m 12.0 kPa 13.5 kPa = (12.0 + 15.0) / 2
0.50 0.50 m 15.0 kPa 15.0 kPa = (15.0) / 1
1.00 1.00 m 20.0 kPa 25.0 kPa = (20.0 + 30.0) / 2
1.40 1.40 m 30.0 kPa 40.0 kPa = (30.0 + 50.0) / 2
1.50 1.50 m 50.0 kPa 50.0 kPa = (50.0) / 1
Note that because the index values are not evenly space, sometimes the x+c won't be found. It is ok for now, though I will definitely add a way to take the average value at x+c between the value just before and the value just after x+c, so I get a more accurate average.
I tried the solution found here from Zelazny7:
pandas rolling computation with window based on values instead of counts
But I can't make it work for my case, where the search is made on the index.
I also looked at:
Pandas Rolling Computations on Sliding Windows (Unevenly spaced)
But I don't understand how to apply it to my case.
Any idea how to solve this problem in a efficient pandas approach? (using apply, map or rolling?)
Thanks.
What you needed to do from the answer you linked to was to turn the index into a series so you can then call apply on it. The other key thing here is that you also have to index the constructed series the same as your df index as the default is to just create an index from scratch like 0,1,2,3...
In [26]:
def f(x, c):
ser = df.loc[(df.index >= x) & (df.index <= x + c),'B']
return ser.mean()
df['C'] = pd.Series(data = df.index, index = df.index).apply(lambda x: f(x,c=0.4))
df
Out[26]:
A B C
A
0.0 0.0 10 11.000000
0.1 0.1 11 12.666667
0.2 0.2 12 13.500000
0.5 0.5 15 15.000000
1.0 1.0 20 25.000000
1.4 1.4 30 40.000000
1.5 1.5 50 50.000000
Related
snippet of the dataframe is as follows. but actual dataset is 200000 x 130.
ID 1-jan 2-jan 3-jan 4-jan
1. 4 5 7 8
2. 2 0 1 9
3. 5 8 0 1
4. 3 4 0 0
I am trying to compute Mean Absolute Deviation for each row value like this.
ID 1-jan 2-jan 3-jan 4-jan mean
1. 4 5 7 8 12.5
1_MAD 8.5 7.5 5.5 4.5
2. 2 0 1 9 6
2_MAD.4 6 5 3
.
.
I tried this,
new_df = pd.DataFrame()
for rows in (df['ID']):
new_df[str(rows) + '_mad'] = mad(df3.loc[row_value][1:])
new_df.T
where mad is a function that compares the mean to each value.
But, this is very time consuming since i have a large dataset and i need to do in a quickest way possible.
pd.concat([df1.assign(mean1=df1.mean(axis=1)).set_index(df1.index.astype('str'))
,df1.assign(mean1=df1.mean(axis=1)).apply(lambda ss:ss.mean1-ss,axis=1)
.T.add_suffix('_MAD').T.assign(mean1='')]).sort_index().pipe(print)
1-jan 2-jan 3-jan 4-jan mean1
ID
1.0 4.00 5.00 7.00 8.00 6.0
1.0_MAD 2.00 1.00 -1.00 -2.00
2.0 2.00 0.00 1.00 9.00 3.0
2.0_MAD 1.00 3.00 2.00 -6.00
3.0 5.00 8.00 0.00 1.00 3.5
3.0_MAD -1.50 -4.50 3.50 2.50
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD -1.25 -2.25 1.75 1.75
IIUC use:
#convert ID to index
df = df.set_index('ID')
#mean to Series
mean = df.mean(axis=1)
from toolz import interleave
#subtract all columns by mean, add suffix
df1 = df.sub(mean, axis=0).abs().rename(index=lambda x: f'{x}_MAD')
#join with original with mean and interleave indices
df = pd.concat([df.assign(mean=mean), df1]).loc[list(interleave([df.index, df1.index]))]
print (df)
1-jan 2-jan 3-jan 4-jan mean
ID
1.0 4.00 5.00 7.00 8.00 6.00
1.0_MAD 2.00 1.00 1.00 2.00 NaN
2.0 2.00 0.00 1.00 9.00 3.00
2.0_MAD 1.00 3.00 2.00 6.00 NaN
3.0 5.00 8.00 0.00 1.00 3.50
3.0_MAD 1.50 4.50 3.50 2.50 NaN
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD 1.25 2.25 1.75 1.75 NaN
It's possible to specify axis=1 to apply the mean calculation across columns:
df['mean_across_cols'] = df.mean(axis=1)
I'm struggeling to create a pandas sub dataframe based on values of the primary dataframe. The primary dataframe, that will contain over millions of row, is set up as follow:
import pandas as pd
import numpy as np
def get_range(start, stop, step):
return np.linspace(start, stop, int((stop - start) / step) + 1)
range1 = get_range(10, 10, 10)
range2 = get_range(10, 100, 10)
range3 = get_range(1.0, 1.0, 1.0)
range4 = get_range(7, 100, 1)
range5 = get_range(1.0, 1.0, 1.0)
range6 = get_range(0.2, 2.0, 0.01)
range7 = get_range(0.2, 2.0, 0.01)
df = pd.DataFrame(
index=pd.MultiIndex.from_product(
[
range1,
range2,
range3,
range4,
range5,
range6,
range7,
],
names=["col1", "col2", "col3", "col4", "col5", "col6", "col7"]
)
).reset_index()
The above results in the following dataframe (30M+ rows), which is created super fast in about ~5/6 sec:
col1 col2 col3 col4 col5 col6 col7
0 10.0 10.0 1.0 7.0 1.0 0.2 0.20
1 10.0 10.0 1.0 7.0 1.0 0.2 0.21
2 10.0 10.0 1.0 7.0 1.0 0.2 0.22
3 10.0 10.0 1.0 7.0 1.0 0.2 0.23
4 10.0 10.0 1.0 7.0 1.0 0.2 0.24
... ... ... ... ... ... ... ...
30795335 10.0 100.0 1.0 100.0 1.0 2.0 1.96
30795336 10.0 100.0 1.0 100.0 1.0 2.0 1.97
30795337 10.0 100.0 1.0 100.0 1.0 2.0 1.98
30795338 10.0 100.0 1.0 100.0 1.0 2.0 1.99
30795339 10.0 100.0 1.0 100.0 1.0 2.0 2.00
Now, based on these values, I want to calculate a new table that wil have the number of rows given in 'col4'. So for every row the table is dynamically created. Below is given the function that will create the table with all the value I need. In this function I will calculate only one value, becuase if I understand the solution I want, I can complete the whole table by myself.
def calculate_table(col4, col5, col6):
table = []
prev_value = 0
for msotc in range(1, col4+1, 1):
max_value = (prev_value * col6) + col5
if max_value < 100:
table.append(max_value)
else:
table.append("*")
prev_value = max_value
return table
So eg. with the following values from the primary dataframe:
col4 = 7
col5 = 1
col6 = 2
The calculated table will be:
[0, 1, 3, 7, 15, 31, 63, '*']
t_col1 t_col2 t_col3 etc.
0 0.0
1 1.0
2 3.0
3 7.0
4 15.0
5 31.0
6 63.0
7 0
So after the caculation for every row, the result can be adde to eg. 'col8', I want to be able to search something like:
eg. Select every row where ['col8']['t_col1'] > 30 and ['col2'] > 50
Which will result is a the selection on the primary dataframe.
Because of the huge amount of rows where the calculation has to take place for the sub table/dataframe, I'm really looking for the fastest way to have the calculation take place. So maybe the function needs to be rewritten to be as fast as possible and also the way to apply the function on every row is importent.
I have some earthquake data. I have a Magnitude, Distance, and Percent that I care about. I want to group all of the MAGNITUDES together and sum the distances and percents for each magnitudes. Here is a part of my data:
import pandas as pd
data = {'Distance': [1, 5, 9, 3, 5, 4, 2, 3.1],
'Magnitude': [7.3, 7.3, 7.3, 6.0, 8.2, 6.0, 8.2, 5.7],
'Percent': [0.1, 0.05, 0.07, 0.11, 0.2, 0.07, 0.08,0.11]
}
df = pd.DataFrame(data)
print(df)
Distance Magnitude Percent
0 1.0 7.3 0.10
1 5.0 7.3 0.05
2 9.0 7.3 0.07
3 3.0 6.0 0.11
4 5.0 8.2 0.20
5 4.0 6.0 0.07
6 2.0 8.2 0.08
7 3.1 5.7 0.11
My idea was this. Groupby and sum:
df2 = df.groupby(['Distance','Magnitude','Percent'],as_index=False).agg({'Percent': 'sum'},{'Distance': 'sum'})
I get the same dataframe upon running my code except it is ascending by distance which is fine, but nothing groupped together or summed.
I want it to look like this:
Distance Magnitude Percent
0 15.0 5.7 0.22
1 7.0 6.0 0.18
2 7.0 7.3 0.28
3 3.1 8.2 0.11
There is only 1 value for each magnitude and the distances and percents have been summed for each magnitude.
This will do the the task, you just need to groupby magnitude only
df.groupby(by=["Magnitude"]).sum()
Output
Distance Percent
Magnitude
5.7 3.1 0.11
6.0 7.0 0.18
7.3 15.0 0.22
8.2 7.0 0.28
Or to prevent Magnitude becoming an index as per #lsr729 you can use this as well
df.groupby(by=["Magnitude"], as_index=False).sum()
Output2
Magnitude Distance Percent
0 5.7 3.1 0.11
1 6.0 7.0 0.18
2 7.3 15.0 0.22
3 8.2 7.0 0.28
I am trying to figure out a way to generate a 'z score' from a pandas df for use in a calendar heatmap.
Here is an general example of what I'm trying to emulate. It shows day of the week along 'x' axis and weeks along the 'y' axis. Each date has a numerical value 'z score' assigned to it, and creating this z score is where I'm running into trouble.
My df is created from a csv file listing several different tasks with the following columns and some example data:
Job,Tool,Start,End
A,Hammer,2020-10-03,2020-11-02
A,Drill,2020-11-05,2020-12-02
A,Hammer,2020-12-03,2020-12-30
This data works well for a gantt chart, but it needs to be modified a bit for use with a heatmap. I have been able to use pandas to generate just the dates that matter:
def calendarmap():
d1 = min(dff['Start'])
d2 = max(dff['End'])
delta = d2 - d1
dates_that_matter = [d1 + dt.timedelta(i) for i in range(delta.days+1)]
etc
Regardless of the heatmap method used (sns, go.heatmap, etc), I need to create a list that corresponds to the tool used (z score).
fig.add_trace(go.Heatmap(z = z, x = x, y = y)
I would like to write a simple script that:
Iterates through my dates_that_matter
Checks to see if that date is between a Start or End date for every row in my df
If the date is present in my df, it should write a z score to a list corresponding to each unique tool. With this example data I would be happy with Hammer = 0.5 and Drill = 1.0.
If date is not present, z score assigned should be 0. The date will still be present, but it should reflect that there is no job on that day.
Tolerate a different number of tools. In this example there are 3 z score states (0=none, 0.5=hammer, and 1.0=drill) but the number of z score states will likely fluctuate between 2 and 10.
Steps 2 and 5 are the parts that are challenging to me at the moment. Any help with this would be greatly appreciated. Thank you.
Only data creation is answered.
Process flow:
From each row of the original data frame, create a data frame with a start date to an end date and add it to the new data frame. (Creation of vertical data)
Add a workload column.
Aggregate the amount of work by date
Add the missing date. (dfs.reindex ())
Add columns for the week, day of the week, and month of the month.
This completes the graph data.
By the way, for verification, I transformed it into a horizontal format with month and day columns like a calendar.
dfs = pd.DataFrame()
for idx, row in df.iterrows():
tmp_date = pd.date_range(row['Start'], row['End'], freq='1D')
tmp_df = pd.DataFrame({'Date':pd.to_datetime(tmp_date), 'Job':row['Job'], 'Tool':row['Tool']})
dfs = dfs.append(tmp_df, ignore_index=True)
dfs['workload'] = dfs['Tool'].apply(lambda x: 1 if x == 'Drill' else 0.5 if x == 'Hammer' else 0.75)
dfs.set_index('Date', inplace=True)
dfs = dfs.groupby(dfs.index)['workload'].sum().to_frame()
dfs = dfs.reindex(pd.date_range(dfs.index.min(), dfs.index.max(), freq='1D',name='Date'), fill_value=0, axis='index')
dfs.reset_index(inplace=True)
import calendar
def getNweek(x):
first_dayofweek = calendar.monthrange(x.year, x.month)[0]
offset = (first_dayofweek - 6) % 7
return (x.day + offset -1) // 7 + 1
dfs['nweek'] = dfs['Date'].apply(lambda x: getNweek(x))
dfs['month'] = dfs['Date'].dt.month
dfs['dayofweek'] = dfs['Date'].dt.dayofweek
dfs.head()
Date workload nweek month dayofweek
0 2020-10-03 0.5 1 10 5
1 2020-10-04 0.5 2 10 6
2 2020-10-05 0.5 2 10 0
3 2020-10-06 0.5 2 10 1
4 2020-10-07 0.5 2 10 2
dfs = dfs.pivot(index='nweek', columns=['month', 'dayofweek'], values='workload')
import itertools
dow = [6,0,1,2,3,4,5]
m = [10,11,12]
new_cols = list(itertools.product(m,dow))
dfs.reindex(new_cols, axis=1)
month 10 11 12
dayofweek 6 0 1 2 3 4 5 6 0 1 ... 3 4 5 6 0 1 2 3 4 5
nweek
1 NaN NaN NaN NaN NaN NaN 0.50 1.25 1.25 0.0 ... 1.0 1.0 1.0 NaN NaN 2.0 2.0 0.5 0.5 0.5
2 0.50 0.50 0.50 0.50 0.50 0.50 0.50 1.00 1.00 1.0 ... 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5
3 0.50 0.50 0.50 0.50 0.50 0.50 0.50 1.00 1.00 1.0 ... 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5
4 0.50 0.50 0.50 0.50 0.50 1.25 1.25 1.00 1.00 1.0 ... 2.0 2.0 2.0 0.5 0.5 0.5 1.0 1.0 1.0 1.0
5 1.25 1.25 1.25 1.25 1.25 1.25 1.25 2.00 2.00 NaN ... NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN
This is my first question on stackoverflow. Go easy on me!
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
Here's some toy data demonstrating what I'm trying to do:
import pandas as pd
import numpy as np
# evenly spaced times
t1 = np.array([0,0.5,1.0,1.5,2.0])
y1 = t1
# unevenly spaced times
t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01])
y2 = 3*t2
df1 = pd.DataFrame(data={'y1':y1,'t':t1})
df2 = pd.DataFrame(data={'y2':y2,'t':t2})
df1 and df2 look like this:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
I'm trying to merge df1 and df2, interpolating y2 on df1.t. The desired result is:
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
I've been reading documentation for pandas.resample, as well as searching previous stackoverflow questions, but haven't been able to find a solution to my particular problem. Any ideas? Seems like it should be easy.
UPDATE:
I figured out one possible solution: interpolate the second series first, then append to the first data frame:
from scipy.interpolate import interp1d
f2 = interp1d(t2,y2,bounds_error=False)
df1['y2'] = f2(df1.t)
which gives:
df1:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
That works, but I'm still open to other solutions if there's a better way.
If you construct a single DataFrame from Series, using time values as index, like this:
>>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0])
>>> y1 = pd.Series(t1, index=t1)
>>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01])
>>> y2 = pd.Series(3*t2, index=t2)
>>> df = pd.DataFrame({'y1': y1, 'y2': y2})
>>> df
y1 y2
0.00 0.0 0.00
0.34 NaN 1.02
0.50 0.5 NaN
1.00 1.0 NaN
1.01 NaN 3.03
1.40 NaN 4.20
1.50 1.5 NaN
1.60 NaN 4.80
1.70 NaN 5.10
2.00 2.0 NaN
2.01 NaN 6.03
You can simply interpolate it, and select only the part where y1 is defined:
>>> df.interpolate('index').reindex(y1)
y1 y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
2.0 2.0 6.0
It's not exactly clear to me how you're getting rid of some of the values in y2, but it seems like if there is more than one for a given timepoint, you only want the first one. Also, it seems like your time values should be in the index. I also added column labels. It looks like this:
import pandas as pd
# evenly spaced times
t1 = [0,0.5,1.0,1.5,2.0]
y1 = t1
# unevenly spaced times
t2 = [0,0.34,1.01,1.4,1.6,1.7,2.01]
# round t2 values to the nearest half
new_t2 = [round(num * 2)/2 for num in t2]
# set y2 values
y2 = [3*z for z in new_t2]
# eliminate entries that have the same index value
for x in range(1, len(new_t2), -1):
if new_t2[x] == new_t2[x-1]:
new_t2.delete(x)
y2.delete(x)
ser1 = pd.Series(y1, index=t1)
ser2 = pd.Series(y2, index=new_t2)
df = pd.concat((ser1, ser2), axis=1)
df.columns = ('Y1', 'Y2')
print df
This prints:
Y1 Y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
1.5 1.5 4.5
1.5 1.5 4.5
2.0 2.0 6.0