Create new Pandas Dataframe from observations which meets specific criteria - python

I have two original dataframes.
One contains limits: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value,
UL - upper limit,
LL - lower limit
And another one original data: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be:
(showing only ID and column where observations are out of limit)
The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it
Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part

Approach
reshape
merge
calculate
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column
ID
value
target
UL
LL
deviate
feat_1
123
12.5
12
15
9
0.0416667
feat_1
456
18
12
15
9
0.5
feat_1
789
9
12
15
9
-0.25
feat_2
123
9.6
9
10
8
0.0666667
feat_2
456
3
9
10
8
-0.666667
feat_2
789
11
9
10
8
0.222222
feat_3
123
100
90
120
60
0.111111
feat_3
456
100
90
120
60
0.111111
feat_3
789
100
90
120
60
0.111111

First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes. Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.
My code is below. It should be easy enough to customise it to your case.
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )

I solved my issue maybe in bit clumsy way but it works as desired...
If someone have better answer I will still appreciate:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit

Related

Pandas: How to round the values which are closest to the whole number for the values more than 1?

Following is the dataframe. I would like to round the values in 'Period' which are closest to the whole numbers. For example : 1.005479452 rounded to 1.0000, 2.002739726 rounded to 2.0000, 3.002739726 rounded to 3.00000, 5.005479452 rounded to 5.0000, 12.01369863 rounded to 12.0000 and so on. I have a big list. I am trying to do so because in later program I have to concatenate this dataframe with other dataframes based on 'period' column.
df = period rate
0.931506849 -0.001469
0.994520548 0.008677
1.005479452 0.11741125
1.008219178 0.073975
1.010958904 0.147474833
1.994520548 -0.007189219
2.002739726 0.1160815
2.005479452 0.06995
2.008219178 0.026808
2.010958904 0.1200695
2.980821918 -0.007745727
3.002739726 0.192208333
3.010958904 0.119895833
3.019178082 0.151857267
3.021917808 0.016165
3.863013699 0.005405321
4 0.06815
4.002739726 0.1240695
4.016438356 0.2410323
4.019178082 0.0459375
4.021917808 0.03161
4.997260274 0.0682
5.005479452 0.1249955
5.01369863 0.03260875
5.016438356 0.238069083
5.019178082 0.04590625
5.021917808 0.0120625
12.01369863 0.136991
12.01643836 0.053327917
12.01917808 0.2309365
I am trying to do something like below but couldn't move further.
df['period'] = np.where(df.period>1, df.period.round(), df.period.round(decimals = 4))
You can apply a lambda function. This one will check it the value is greater than one before rounding to whole, otherwise rounding to 4 decimal places for values less than one. I think that's what you seem to want?
df['period'] = df['period'].apply(lambda x: round(x, 0) if x > 1 else round(x, 4))
I built a function that basically iterates from 1 to whatever the max whole value should be in the dataframe. This should be faster than a solution that just iterates row-by-row, though it does assume that the dataframe is sorted (like in your example).
import pandas as pd
df = pd.DataFrame(
{
"period": [0.931506849, 0.994520548, 1.005479452, 1.008219178, 1.010958904, 1.994520548, 2.002739726, 2.005479452, 2.008219178, 2.010958904, 2.980821918, 3.002739726, 3.010958904, 3.019178082, 3.021917808, 3.863013699, 4, 4.002739726, 4.016438356, 4.019178082, 4.021917808, 4.997260274, 5.005479452, 5.01369863, 5.016438356, 5.019178082, 5.021917808, 12.01369863, 12.01643836, 12.01917808]
}
)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.005479
3 1.008219
4 1.010959
"""
def process_df(df: pd.DataFrame) -> pd.DataFrame:
df_range_vals = [round(period) for period in df['period'].tolist()]
out_df = df.loc[df['period'] < 1]
for base in range(1, max(df_range_vals) + 1):
# only keep the ones in the range we want
temp_df = df.loc[(df['period'] >= base) & (df['period'] < base + 1)]
# if there's nothing to change, then just skip
if temp_df.empty:
continue
temp_df.loc[temp_df.first_valid_index(), 'period'] = temp_df.loc[temp_df.first_valid_index(), 'period'].round(0)
out_df = out_df.append(temp_df, ignore_index = True)
return out_df
df = process_df(df)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.000000
3 1.008219
4 1.010959
"""
Try:
# Sort so that we know what is closes to whole no
df.sort_values(by=['period'])
# Create a new column and round everything. This is done to do
# partition effectively
df['round_period'] = df['period'].round()
df_of_values_close_to_whole_number = list(df.groupby('round_period').tail(1)['period'])
def round_func(x, df_of_val_close_to_whole_number):
return '{:.5f}'.format(round(x)) if x in df_of_val_close_to_whole_number and x > 1 else x
# Apply round only to values closer to whole number.
df['period'].apply(round_func, args=(df_of_values_close_to_whole_number,))
Output
0 0.931507
1 0.994521
2 1.00548
3 1.00822
4 1.00000
5 1.99452
6 2.00274
7 2.00548
8 2.00822
9 2.00000
10 2.98082
11 3.00274
12 3.01096
13 3.01918
14 3.00000
15 3.86301
16 4
17 4.00274
18 4.01644
19 4.01918
20 4.00000
21 4.99726
22 5.00548
23 5.0137
24 5.01644
25 5.01918
26 5.00000
27 12.0137
28 12.0164
29 12.00000
Name: period, dtype: object

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Is there a pandas way of getting the averages between consecutive rows?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!
I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)
This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587

Speeding up the insertion of null rows into a large Pandas DataFrame?

I have a Pandas data frame with hundreds of millions of rows that looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/05/16 A 1 60
01/02/16 B 1 59
01/04/16 B 1 90
01/10/16 B 1 84
For each unique combination (call it b) of Attribute A x Attribute B, I need to fill in empty dates starting from the oldest date for that unique group b to the maximum date in the entire dataframe df. That is, so it looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/02/16 A 1 0
01/03/16 A 1 0
01/04/16 A 1 0
01/05/16 A 1 60
01/02/16 B 1 59
01/03/16 B 1 0
01/04/16 B 1 90
01/05/16 B 1 0
01/06/16 B 1 0
01/07/16 B 1 0
01/08/16 B 1 84
and then calculate the coefficient of variation (standard deviation/mean) for each unique combination's values (after inserting 0s). My code is this:
final = pd.DataFrame()
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
I don't have a ready solution, however this is how I suggest you approach the problem:
Understand what makes this slow
Find ways to make the critical parts faster
Or, alternatively, find a new approach
Here's the analysis of your code using line profiler:
Timer unit: 1e-06 s
Total time: 0.028074 s
File: <ipython-input-54-ad49822d490b>
Function: foo at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo():
2 1 875 875.0 3.1 final = pd.DataFrame()
3 1 302 302.0 1.1 max_date = df['Date'].max()
4 3 3343 1114.3 11.9 for name, group in df.groupby(['Attribute_A','Attribute_B']):
5 2 836 418.0 3.0 idx = pd.date_range(group['Date'].min(),
6 2 3601 1800.5 12.8 max_date)
7
8 2 6713 3356.5 23.9 temp = group.set_index('Date').reindex(idx, fill_value=0)
9 2 1961 980.5 7.0 coeff_var = temp['Value'].std()/temp['Value'].mean()
10 2 10443 5221.5 37.2 final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
In conclusion, the .reindex and concat statements take 60% of the time.
A first approach that saves 42% of time in my measurement is to collect the data for the final data frame as a list of rows, and create the dataframe as the very last step. Like so:
newdata = []
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
newdata.append({'Attribute_A': name[0], 'Attribute_B': name[1],'Coeff_Var':coeff_var})
final = pd.DataFrame.from_records(newdata)
Using timeit to measure best execution times I get
your solution: 100 loops, best of 3: 11.5 ms per loop
improved concat: 100 loops, best of 3: 6.67 ms per loop
Details see this ipython notebook
Note: Your mileage may vary - I used the sample data provided in the original post. You should run the line profiler on a subset of your real data - the dominating factor in regards to time use may well be something else then.
I am not sure if my way is faster than the way that you set up, but here goes:
df = pd.DataFrame({'Date': ['1/1/2016', '1/5/2016', '1/2/2016', '1/4/2016', '1/10/2016'],
'Attribute A': ['A', 'A', 'B', 'B', 'B'],
'Attribute B': [1, 1, 1, 1, 1],
'Value': [50, 60, 59, 90, 84]})
unique_attributes = df['Attribute A'].unique()
groups = []
for i in unique_attributes:
subset = df[df['Attribute A'] ==i]
dates = subset['Date'].tolist()
Dates = pd.date_range(dates[0], dates[-1])
subset.set_index('Date', inplace=True)
subset.index = pd.DatetimeIndex(subset.index)
subset = subset.reindex(Dates)
subset['Attribute A'].fillna(method='ffill', inplace=True)
subset['Attribute B'].fillna(method='ffill', inplace=True)
subset['Value'].fillna(0, inplace=True)
groups.append(subset)
result = pd.concat(groups)

python pandas: How dropping items in dateframe

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

Categories