This is my first question on stackoverflow. Go easy on me!
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
Here's some toy data demonstrating what I'm trying to do:
import pandas as pd
import numpy as np
# evenly spaced times
t1 = np.array([0,0.5,1.0,1.5,2.0])
y1 = t1
# unevenly spaced times
t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01])
y2 = 3*t2
df1 = pd.DataFrame(data={'y1':y1,'t':t1})
df2 = pd.DataFrame(data={'y2':y2,'t':t2})
df1 and df2 look like this:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
I'm trying to merge df1 and df2, interpolating y2 on df1.t. The desired result is:
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
I've been reading documentation for pandas.resample, as well as searching previous stackoverflow questions, but haven't been able to find a solution to my particular problem. Any ideas? Seems like it should be easy.
UPDATE:
I figured out one possible solution: interpolate the second series first, then append to the first data frame:
from scipy.interpolate import interp1d
f2 = interp1d(t2,y2,bounds_error=False)
df1['y2'] = f2(df1.t)
which gives:
df1:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
That works, but I'm still open to other solutions if there's a better way.
If you construct a single DataFrame from Series, using time values as index, like this:
>>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0])
>>> y1 = pd.Series(t1, index=t1)
>>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01])
>>> y2 = pd.Series(3*t2, index=t2)
>>> df = pd.DataFrame({'y1': y1, 'y2': y2})
>>> df
y1 y2
0.00 0.0 0.00
0.34 NaN 1.02
0.50 0.5 NaN
1.00 1.0 NaN
1.01 NaN 3.03
1.40 NaN 4.20
1.50 1.5 NaN
1.60 NaN 4.80
1.70 NaN 5.10
2.00 2.0 NaN
2.01 NaN 6.03
You can simply interpolate it, and select only the part where y1 is defined:
>>> df.interpolate('index').reindex(y1)
y1 y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
2.0 2.0 6.0
It's not exactly clear to me how you're getting rid of some of the values in y2, but it seems like if there is more than one for a given timepoint, you only want the first one. Also, it seems like your time values should be in the index. I also added column labels. It looks like this:
import pandas as pd
# evenly spaced times
t1 = [0,0.5,1.0,1.5,2.0]
y1 = t1
# unevenly spaced times
t2 = [0,0.34,1.01,1.4,1.6,1.7,2.01]
# round t2 values to the nearest half
new_t2 = [round(num * 2)/2 for num in t2]
# set y2 values
y2 = [3*z for z in new_t2]
# eliminate entries that have the same index value
for x in range(1, len(new_t2), -1):
if new_t2[x] == new_t2[x-1]:
new_t2.delete(x)
y2.delete(x)
ser1 = pd.Series(y1, index=t1)
ser2 = pd.Series(y2, index=new_t2)
df = pd.concat((ser1, ser2), axis=1)
df.columns = ('Y1', 'Y2')
print df
This prints:
Y1 Y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
1.5 1.5 4.5
1.5 1.5 4.5
2.0 2.0 6.0
Related
I'm struggeling to create a pandas sub dataframe based on values of the primary dataframe. The primary dataframe, that will contain over millions of row, is set up as follow:
import pandas as pd
import numpy as np
def get_range(start, stop, step):
return np.linspace(start, stop, int((stop - start) / step) + 1)
range1 = get_range(10, 10, 10)
range2 = get_range(10, 100, 10)
range3 = get_range(1.0, 1.0, 1.0)
range4 = get_range(7, 100, 1)
range5 = get_range(1.0, 1.0, 1.0)
range6 = get_range(0.2, 2.0, 0.01)
range7 = get_range(0.2, 2.0, 0.01)
df = pd.DataFrame(
index=pd.MultiIndex.from_product(
[
range1,
range2,
range3,
range4,
range5,
range6,
range7,
],
names=["col1", "col2", "col3", "col4", "col5", "col6", "col7"]
)
).reset_index()
The above results in the following dataframe (30M+ rows), which is created super fast in about ~5/6 sec:
col1 col2 col3 col4 col5 col6 col7
0 10.0 10.0 1.0 7.0 1.0 0.2 0.20
1 10.0 10.0 1.0 7.0 1.0 0.2 0.21
2 10.0 10.0 1.0 7.0 1.0 0.2 0.22
3 10.0 10.0 1.0 7.0 1.0 0.2 0.23
4 10.0 10.0 1.0 7.0 1.0 0.2 0.24
... ... ... ... ... ... ... ...
30795335 10.0 100.0 1.0 100.0 1.0 2.0 1.96
30795336 10.0 100.0 1.0 100.0 1.0 2.0 1.97
30795337 10.0 100.0 1.0 100.0 1.0 2.0 1.98
30795338 10.0 100.0 1.0 100.0 1.0 2.0 1.99
30795339 10.0 100.0 1.0 100.0 1.0 2.0 2.00
Now, based on these values, I want to calculate a new table that wil have the number of rows given in 'col4'. So for every row the table is dynamically created. Below is given the function that will create the table with all the value I need. In this function I will calculate only one value, becuase if I understand the solution I want, I can complete the whole table by myself.
def calculate_table(col4, col5, col6):
table = []
prev_value = 0
for msotc in range(1, col4+1, 1):
max_value = (prev_value * col6) + col5
if max_value < 100:
table.append(max_value)
else:
table.append("*")
prev_value = max_value
return table
So eg. with the following values from the primary dataframe:
col4 = 7
col5 = 1
col6 = 2
The calculated table will be:
[0, 1, 3, 7, 15, 31, 63, '*']
t_col1 t_col2 t_col3 etc.
0 0.0
1 1.0
2 3.0
3 7.0
4 15.0
5 31.0
6 63.0
7 0
So after the caculation for every row, the result can be adde to eg. 'col8', I want to be able to search something like:
eg. Select every row where ['col8']['t_col1'] > 30 and ['col2'] > 50
Which will result is a the selection on the primary dataframe.
Because of the huge amount of rows where the calculation has to take place for the sub table/dataframe, I'm really looking for the fastest way to have the calculation take place. So maybe the function needs to be rewritten to be as fast as possible and also the way to apply the function on every row is importent.
I am trying to figure out a way to generate a 'z score' from a pandas df for use in a calendar heatmap.
Here is an general example of what I'm trying to emulate. It shows day of the week along 'x' axis and weeks along the 'y' axis. Each date has a numerical value 'z score' assigned to it, and creating this z score is where I'm running into trouble.
My df is created from a csv file listing several different tasks with the following columns and some example data:
Job,Tool,Start,End
A,Hammer,2020-10-03,2020-11-02
A,Drill,2020-11-05,2020-12-02
A,Hammer,2020-12-03,2020-12-30
This data works well for a gantt chart, but it needs to be modified a bit for use with a heatmap. I have been able to use pandas to generate just the dates that matter:
def calendarmap():
d1 = min(dff['Start'])
d2 = max(dff['End'])
delta = d2 - d1
dates_that_matter = [d1 + dt.timedelta(i) for i in range(delta.days+1)]
etc
Regardless of the heatmap method used (sns, go.heatmap, etc), I need to create a list that corresponds to the tool used (z score).
fig.add_trace(go.Heatmap(z = z, x = x, y = y)
I would like to write a simple script that:
Iterates through my dates_that_matter
Checks to see if that date is between a Start or End date for every row in my df
If the date is present in my df, it should write a z score to a list corresponding to each unique tool. With this example data I would be happy with Hammer = 0.5 and Drill = 1.0.
If date is not present, z score assigned should be 0. The date will still be present, but it should reflect that there is no job on that day.
Tolerate a different number of tools. In this example there are 3 z score states (0=none, 0.5=hammer, and 1.0=drill) but the number of z score states will likely fluctuate between 2 and 10.
Steps 2 and 5 are the parts that are challenging to me at the moment. Any help with this would be greatly appreciated. Thank you.
Only data creation is answered.
Process flow:
From each row of the original data frame, create a data frame with a start date to an end date and add it to the new data frame. (Creation of vertical data)
Add a workload column.
Aggregate the amount of work by date
Add the missing date. (dfs.reindex ())
Add columns for the week, day of the week, and month of the month.
This completes the graph data.
By the way, for verification, I transformed it into a horizontal format with month and day columns like a calendar.
dfs = pd.DataFrame()
for idx, row in df.iterrows():
tmp_date = pd.date_range(row['Start'], row['End'], freq='1D')
tmp_df = pd.DataFrame({'Date':pd.to_datetime(tmp_date), 'Job':row['Job'], 'Tool':row['Tool']})
dfs = dfs.append(tmp_df, ignore_index=True)
dfs['workload'] = dfs['Tool'].apply(lambda x: 1 if x == 'Drill' else 0.5 if x == 'Hammer' else 0.75)
dfs.set_index('Date', inplace=True)
dfs = dfs.groupby(dfs.index)['workload'].sum().to_frame()
dfs = dfs.reindex(pd.date_range(dfs.index.min(), dfs.index.max(), freq='1D',name='Date'), fill_value=0, axis='index')
dfs.reset_index(inplace=True)
import calendar
def getNweek(x):
first_dayofweek = calendar.monthrange(x.year, x.month)[0]
offset = (first_dayofweek - 6) % 7
return (x.day + offset -1) // 7 + 1
dfs['nweek'] = dfs['Date'].apply(lambda x: getNweek(x))
dfs['month'] = dfs['Date'].dt.month
dfs['dayofweek'] = dfs['Date'].dt.dayofweek
dfs.head()
Date workload nweek month dayofweek
0 2020-10-03 0.5 1 10 5
1 2020-10-04 0.5 2 10 6
2 2020-10-05 0.5 2 10 0
3 2020-10-06 0.5 2 10 1
4 2020-10-07 0.5 2 10 2
dfs = dfs.pivot(index='nweek', columns=['month', 'dayofweek'], values='workload')
import itertools
dow = [6,0,1,2,3,4,5]
m = [10,11,12]
new_cols = list(itertools.product(m,dow))
dfs.reindex(new_cols, axis=1)
month 10 11 12
dayofweek 6 0 1 2 3 4 5 6 0 1 ... 3 4 5 6 0 1 2 3 4 5
nweek
1 NaN NaN NaN NaN NaN NaN 0.50 1.25 1.25 0.0 ... 1.0 1.0 1.0 NaN NaN 2.0 2.0 0.5 0.5 0.5
2 0.50 0.50 0.50 0.50 0.50 0.50 0.50 1.00 1.00 1.0 ... 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5
3 0.50 0.50 0.50 0.50 0.50 0.50 0.50 1.00 1.00 1.0 ... 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5
4 0.50 0.50 0.50 0.50 0.50 1.25 1.25 1.00 1.00 1.0 ... 2.0 2.0 2.0 0.5 0.5 0.5 1.0 1.0 1.0 1.0
5 1.25 1.25 1.25 1.25 1.25 1.25 1.25 2.00 2.00 NaN ... NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN
I got an example pandas dataframe like this:
a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
I want to add a new column, linear to the column, which is the linear regression output of fit a on b. Now I got:
from sklearn.linear_model import LinearRegression
repr = LinearRegression()
repr.fit(df['a'].as_matrix().reshape(-1,1),df['b'].as_matrix().reshape(-1,1))
repr.predict(df['a'].as_matrix().reshape(-1,1)) # This will give the linear regression outcome for whole column
Now I want to incrementally do linear regression on series a, so the first entry of linear will be b[0], and the second will be b[0]/a[0]*a[1], and the third will be the linear regression outcome of the first two entries, and so on and so forth. I have no clue how to do that with pandas, except for iterating through all the entries, is there a batter way?
You can use expanding with some custom apply functions. Interesting way to do LR...
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_table(StringIO(""" a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
10 10.0 0.4
11 11.0 0.35
12 12.0 0.3
13 13.0 0.28
14 14.0 0.27
15 15.0 0.22"""), sep='\s+')
df = df.sort_values(by='a')
ax = df.plot(x='a',y='b',kind='scatter')
m, b = np.polyfit(df['a'],df['b'],1)
lin_reg = lambda x, m, b : m*x + b
df['lin'] = lin_reg(df['a'], m, b)
def make_m(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[0]
def make_b(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[1]
df['new'] = df['a'].expanding().apply(make_m, raw=True)*df['a'] + df['a'].expanding().apply(make_b, raw=True)
# df = df.sort_values(by='a')
ax.plot(df.a,df.lin)
ax.plot(df.a,df.new)
I have a dataframe consisting of 5 decreasing series (290 rows each) whose values are comprised between 0 and 1.
The data looks like that:
A B C D E
0.60 0.998494 1.0 1.0 1.0 1.0
0.65 0.997792 1.0 1.0 1.0 1.0
0.70 0.996860 1.0 1.0 1.0 1.0
0.75 0.995359 1.0 1.0 1.0 1.0
0.80 0.992870 1.0 1.0 1.0 1.0
I want to reindex the dataframe so that I have 0.01 increments between each row. I've tried pd.DataFrame.reindex but to no avail: that returns a dataframe where most of the values are np.NaN
import pandas as pd
df = pd.read_csv('http://pastebin.com/raw/yeHdk2Gq', index_col=0)
print df.reindex(np.arange(0.6, 3.5, 0.025)).head()
Which returns only two valid rows, and converts the 288 others to NaN:
A B C D E
0.600 0.998494 1.0 1.0 1.0 1.0
0.625 NaN NaN NaN NaN NaN
0.650 0.997792 1.0 1.0 1.0 1.0
0.675 NaN NaN NaN NaN NaN
0.700 NaN NaN NaN NaN NaN ##This row existed before reindexing
Pandas can't match the new index with the intial values, although there doesn't seem to be rounding issues (the initial index has no more than 2 decimals).
This seems somehow related to my data as the following works as intended:
df = pd.DataFrame(np.random.randn(10,3), columns=['A', 'B', 'C'])\
.reindex(np.arange(1, 10, 0.5))
print df.head()
Which gives:
A B C
1.0 0.206539 0.346656 2.578709
1.5 NaN NaN NaN
2.0 1.164226 2.693394 1.183696
2.5 NaN NaN NaN
3.0 -0.532072 -1.044149 0.818853
Thanks for your help!
This is because the precision of numpy.
In [31]: np.arange(0.6, 3.5, 0.025).tolist()[0:10]
Out[31]:
[0.6, 0.625, 0.65, 0.675, 0.7000000000000001, 0.7250000000000001,
0.7500000000000001, 0.7750000000000001, 0.8000000000000002, 0.8250000000000002]
As pointed out by #Danche and #EdChum, that was actually a NumPy rounding issue. The following works:
df = pd.read_csv('http://pastebin.com/raw/yeHdk2Gq', index_col=0)\
.reindex([round(i, 5) for i in np.arange(0.6, 3.5, 0.01)])\
.interpolate(kind='cubic', axis=0)
Returns as intended:
A B C D E
0.60 0.998494 1.0 1.0 1.0 1.0
0.61 0.998354 1.0 1.0 1.0 1.0
0.62 0.998214 1.0 1.0 1.0 1.0
0.63 0.998073 1.0 1.0 1.0 1.0
0.64 0.997933 1.0 1.0 1.0 1.0
Thanks
I'm using Pandas 0.13.0 and I try to do a sliding average based on the value of the index.
The index values are not equally distributed.
The index is sorted with increasing and unique values.
import pandas as pd
import Quantities as pq
f = {
'A': [ 0.0, 0.1, 0.2, 0.5, 1.0, 1.4, 1.5] * pq.m,
'B': [10.0, 11.0, 12.0, 15.0, 20.0, 30.0, 50.0] * pq.kPa
}
df = pd.DataFrame(f)
df.set_index(df['A'], inplace=True)
The DataFrame gives:
in: print df
out:
A B
A
0.00 0.00 m 10.0 kPa
0.10 0.10 m 11.0 kPa
0.20 0.20 m 12.0 kPa
0.50 0.50 m 15.0 kPa
1.00 1.00 m 20.0 kPa
1.40 1.40 m 30.0 kPa
1.50 1.50 m 50.0 kPa
Now I would like to do the average of the column B for each x value of the index, between x and x+c, c being a user defined criterion.
For the sake of this example, c = 0.40.
The averaging process would give:
A B C
A
0.00 0.00 m 10.0 kPa 11.0 kPa = (10.0 + 11.0 + 12.0) / 3
0.10 0.10 m 11.0 kPa 12.7 kPa = (11.0 + 12.0 + 15.0) / 3
0.20 0.20 m 12.0 kPa 13.5 kPa = (12.0 + 15.0) / 2
0.50 0.50 m 15.0 kPa 15.0 kPa = (15.0) / 1
1.00 1.00 m 20.0 kPa 25.0 kPa = (20.0 + 30.0) / 2
1.40 1.40 m 30.0 kPa 40.0 kPa = (30.0 + 50.0) / 2
1.50 1.50 m 50.0 kPa 50.0 kPa = (50.0) / 1
Note that because the index values are not evenly space, sometimes the x+c won't be found. It is ok for now, though I will definitely add a way to take the average value at x+c between the value just before and the value just after x+c, so I get a more accurate average.
I tried the solution found here from Zelazny7:
pandas rolling computation with window based on values instead of counts
But I can't make it work for my case, where the search is made on the index.
I also looked at:
Pandas Rolling Computations on Sliding Windows (Unevenly spaced)
But I don't understand how to apply it to my case.
Any idea how to solve this problem in a efficient pandas approach? (using apply, map or rolling?)
Thanks.
What you needed to do from the answer you linked to was to turn the index into a series so you can then call apply on it. The other key thing here is that you also have to index the constructed series the same as your df index as the default is to just create an index from scratch like 0,1,2,3...
In [26]:
def f(x, c):
ser = df.loc[(df.index >= x) & (df.index <= x + c),'B']
return ser.mean()
df['C'] = pd.Series(data = df.index, index = df.index).apply(lambda x: f(x,c=0.4))
df
Out[26]:
A B C
A
0.0 0.0 10 11.000000
0.1 0.1 11 12.666667
0.2 0.2 12 13.500000
0.5 0.5 15 15.000000
1.0 1.0 20 25.000000
1.4 1.4 30 40.000000
1.5 1.5 50 50.000000