I have a data frame that one of its columns represents how many corns produced in this time stamp.
for example
timestamp corns_produced another_column
1 5 4
2 0 1
3 0 3
4 3 4
The dataframe is big.. 100,000+ rows
I want to calculate moving average and std for 1000 time stamps of corn_produced.
Luckily it is pretty easy using rolling :
my_df.rolling(1000).mean()
my_df.rolling(1000).std().
But the problem is I want to ignore the zeros, meaning if in the last 1000 timestamps there are only 5 instances in which corn was produced, I want to do the mean and std on those 5 elements.
How do I ignore the zeros ?
Just to clarify, I don't want to do the following x = my_df[my_df['corns_produced'] != 0], and than do rolling on x, because it ignores the time stamps and doesn't give me the result I need
You can use Rolling.apply:
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].mean()))
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].std()))
A faster solution: first set all zeros to np.nan, then take a rolling mean. If you are dealing with large data, it will be much faster
Related
I have a dataframe similar to the one shown below and was wondering how I can loop through and calculate fitting parameters every set number of days. For example, I would like to be able to input 30 days and have be able to get new constants for the first 30 days, then the first 60 days and so on until the end of the date range.
ID date amount delta_t
1 2020/1/1 10.2 0
1 2020/1/2 11.2 1
2 2020/1/1 12.3 0
2 2020/1/2 13.3 1
I would like to have the parameters stored in another dataframe which is what I am currently doing for the entire dataset but that is over the whole time period rather than n day blocks. Then using the constants for each set period I will calculate the graph points and plot them.
Right now I am using groupby to group the wells by ID then using the apply method to calculate the constants for each ID. This works for the entire dataframe but the constants will change if I am only using 30 day periods.
I don't know if there is a way in the apply method to more easily do this and output the constants either to a new column or a seperate dataframe that is one row per ID. Any input is greatly appreciated.
def parameters(x):
variables, _ = curve_fit(expo, x['delta_t'], x['amount'])
return pd.Series({'param1': variables[0], 'param2': variables[1], 'param3': variables[2]})
param_series = df_filt.groupby('ID').apply(parameters)
count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
6 0.20853071107951396 0.26561990841073535 0.3661990387594055 0.15720649076873264
7 -0.0488049712326824 0.02909288268076153 0.18643283476719688 -0.1438092892727158
8 0.017648470149950992 0.10136455179350337 0.2722686729095633 -0.07928001803992157
9 0.4693208827819954 0.6601182040950377 1.0 0.2858790498612906
10 0.07597883305423633 0.0720868097090368 0.06089458880790768 0.08522329510499728
I want to manipulate this normalized dataframe to do something similar to the .corr method python has built in but want to modify it. I want to create my own method for correlation and build a heatmap which I know how to do.
My end result is a dataframe which will be NxN with 0 or 1 values that meets criterias below. In the table I show above it will be 4x4.
The following steps are the criteria for my correlation method:
Loop through each column as the reference and subtract all the other columns from it.
As we loop I also want to disregard absolute values if both the reference and the correlating column have normalized values of less than 0.2.
For the remaining, if the difference values are less than 10 percent, it means the correlations is good and I start building it with 1 for positive correlation and 0 if any of the difference of the count values is greater than 10%.
all the diagonals will have a 1 for good correlation to each other and the other cells will have either 0 or 1.
The following is what I have but when I drop the deadband values, it does not catch all for some reason.
subdf = []
deadband = 0.2
for i in range(len(df2_norm.columns)):
# First, let's drop non-zero above deadband values in each row
df2_norm_drop = df2_norm.drop(df2_norm[(df2_norm.abs().iloc[:,i] < deadband) & \
(df2_norm.abs().iloc[:,i] > 0)].index)
# Take difference of first detail element normalized value to chart allowable
# normalized value
subdf.append(pd.DataFrame(df2_norm.subtract(df2_norm.iloc[:,i], axis =0)))
I know it looks a lot but would really appreciate any help. Thank you!
I wanted to calculate the mean and standard deviation of a sample. The sample is two columns, first is a time and second column, separated by space is value. I don't know how to calculate mean and standard deviation of the second column of vales using python, maybe scipy? I want to use that method for large sets of data.
I also want to check which number of a set is seven times higher than standard deviation.
Thanks for help.
time value
1 1.17e-5
2 1.27e-5
3 1.35e-5
4 1.53e-5
5 1.77e-5
The mean is 1.418e-5 and the standard deviation is 2.369-6.
To answer your first question, assuming your samplee's dataframe is df, the following should work:
import pandas as pd
df = pd.DataFrame({'time':[1,2,3,4,5], 'value':[1.17e-5,1.27e-5,1.35e-5,1.53e-5,1.77e-5]}
df will be something like this:
>>> df
time value
0 1 0.000012
1 2 0.000013
2 3 0.000013
3 4 0.000015
4 5 0.000018
Then to obtain the standard deviation and mean of the value column respectively, run the following and you will get the outputs:
>>> df['value'].std()
2.368966019173766e-06
>>> df['value'].mean()
1.418e-05
To answer your second question, try the following:
std = df['value'].std()
df = df[(df.value > 7*std)]
I am assuming you want to obtain the rows at which value is greater than 7 times the sample standard deviation. If you actually want greater than or equal to, just change > to >=. You should then be able to obtain the following:
>>> df
time value
4 5 0.000018
Also, following #Mad Physicist's suggestion of adding Delta Degrees of Freedom ddof=0 (if you are unfamiliar with this, checkout Delta Degrees of Freedom Wiki), doing so results in the following:
std = df['value'].std(ddof=0)
df = df[(df.value > 7*std)]
with output:
>>> df
time value
3 4 0.000015
4 5 0.000018
P.S. If I am not wrong, its a convention here to stick to one question a post, not two.
I have a dataframe that represents time series probabilities. Each value in column 'Single' represents the probability of that event in that time period (where each row represents one time period). Each value in column 'Cumulative' represents the probability of that event occurring every time period until that point (ie it is the product of every value in 'Single' from time 0 until now).
A simplified version of the dataframe looks like this:
Single Cumulative
0 0.990000 1.000000
1 0.980000 0.990000
2 0.970000 0.970200
3 0.960000 0.941094
4 0.950000 0.903450
5 0.940000 0.858278
6 0.930000 0.806781
7 0.920000 0.750306
8 0.910000 0.690282
9 0.900000 0.628157
10 0.890000 0.565341
In order to calculate the 'Cumulative' column based on the 'Single' column I am looping through the dataframe like this:
for index, row in df.iterrows():
df['Cumulative'][index] = df['Single'][:index].prod()
In reality, there is a lot of data and looping is a drag on performance, is it at all possible to achieve this without looping?
I've tried to find a way to vectorize this calculation or even use the pandas.DataFrame.apply function, but I don't believe I'm able to reference the current index value in either of those methods.
There's a built in function for this in Pandas:
df.cumprod()
I have multiple timeseries that are outputs of various algorithms. These algorithms can have various parameters and they produce timeseries as a result:
timestamp1=1;
value1=5;
timestamp2=2;
value2=8;
timestamp3=3;
value3=4;
timestamp4=4;
value4=12;
resultsOfAlgorithms=[
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'200',
'result-of-algorithm':[[timestamp1,value1],[timestamp2,value2]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp1,value1],[timestamp3,value3]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
},
{
'algorithm':'delta',
'param-a':'12',
'param-b':'50',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
}
]
I would like to be able to filter the timeseries by algorithm and parameters and plot filtered timeseries to see how given parameters affect the output. To do that I need to know all the occurring values for given parameter and then to be able to select timeseries with desired parameters. E.g. I would like to plot all results of minmax algorithm with param-b==30. There are 2 results that were produced with minmax algorithm and param-b==30. Thus I would like to have a plot with 2 timeseries in it.
Is this possible with pandas or is this out of pandas functionality? How could this be implemented?
Edit:
Searching more the internet I think I am looking for a way to use hierarchical indexing. Also the timeseries should stay separated. Each result is a an individual time-series. It should not be merged together with other result. I need to filter the results of algorithms by parameters used. The result of filter should be still a list of timeseries.
Edit 2:
There are multiple sub-problems:
Find all existing values for each parameter (user does not know all the values since parameters can be auto-generated by system)
user selects some of values for filtering
One way this could be provided by user is a dictionary (but more-user friendly ideas are welcome):
filter={
'param-b':[30,50],
'algorithm':'minmax'
}
Timeseries from resultsOfAlgorithms[1:2] (2nd and 3rd result) are given as a result of filtering, since these results were produced by minmax algorithm and param-b was 30. Thus in this case
[
[[timestamp1,value1],[timestamp3,value3]],
[[timestamp1,value1],[timestamp3,value3]]
]
The result of filtering will return multiple time series, which I want to plot and compare.
user wants to try various filters to see how they affect results
I am doing all this in Jupyter notebook. And I would like to allow user to try various filters with the least hassle possible.
Timestamps in results are not shared. Timestamps between results are not necessarily shared. E.g. all timeseries might occur between 1pm and 3 pm and have roundly same amount of values but the timestamps nor the amount of values are not identical.
So there are two options here, one is to clean up the dict first, then convert it easily to a dataframe, the second is to convert it to a dataframe, then clean up the column that will have nested lists in it. For the first solution, you can just restructure the dict like this:
import pandas as pd
from collections import defaultdict
data = defaultdict(list)
for roa in resultsOfAlgorithms:
for i in range(len(roa['result-of-algorithm'])):
data['algorithm'].append(roa['algorithm'])
data['param-a'].append(roa['param-a'])
data['param-b'].append(roa['param-b'])
data['time'].append(roa['result-of-algorithm'][i][0])
data['value'].append(roa['result-of-algorithm'][i][1])
df = pd.DataFrame(data)
In [31]: df
Out[31]:
algorithm param-a param-b time value
0 minmax 12 200 1 5
1 minmax 12 200 2 8
2 minmax 12 30 1 5
3 minmax 12 30 3 4
4 minmax 12 30 2 8
5 minmax 12 30 4 12
6 delta 12 50 2 8
7 delta 12 50 4 12
And from here you can do whatever analysis you need with it, whether it's plotting or making the time column the index or grouping and aggregating, and so on. You can compare this to making a dataframe first in this link:
Splitting a List inside a Pandas DataFrame
Where they basically did the same thing, with splitting a column of lists into multiple rows. I think fixing the dictionary will be easier though, depending on how representative your fairly simple example is of the real data.
Edit: If you wanted to turn this into a multi-index, you can add one more line:
df_mi = df.set_index(['algorithm', 'param-a', 'param-b'])
In [25]: df_mi
Out[25]:
time value
algorithm param-a param-b
minmax 12 200 1 5
200 2 8
30 1 5
30 3 4
30 2 8
30 4 12
delta 12 50 2 8
50 4 12