Python pandas. How to include rows before specific conditioned rows? - python

I have a csv file, which for ex. kinda looks. like this:
duration
concentration
measurement
1.2
0
10
1.25
0
12
...
...
...
10.3
0
11
10.5
10
100
10.6
20
150
10.67
30
156
10.75
0
12.5
11
0
12
...
...
...
I filtered all the rows with the concentration 0 with the following code.
dF2 = dF1[dF1["concentration"]>10][["duration","measurement","concentration"]]
But I would like to have 100(or n specific) extra rows, with the concentration hold on 0, before the rows with concentrations bigger than 10 begins, that I can have a baseline when plotting the data.
Does anybody had experience with a similar problem/ could somebody help me please...

You can use boolean masks for boolean indexing:
# number of baseline rows to keep
n = 2
# cols to keep
cols = ['duration', 'measurement', 'concentration']
# is the concentration greater than 10?
m1 = dF1['concentration'].gt(10)
# is the row one of the n initial concentration 0?
m2 = dF1['concentration'].eq(0).cumsum().le(n)
# if you have values in between 0 and 10 and do not want those
# m2 = (m2:=dF1['concentration'].eq(0)) & m2.cumsum().le(n)
# or
# m2 = df.index.isin(dF1[dF1['concentration'].eq(0)].head(n).index)
# keep rows where either condition is met
dF2 = dF1.loc[m1|m2, cols]
If you only want to keep initial rows before the first value above threshold, change m2 to:
# keep up to n initial rows with concentration=0
# only until the first row above threshold is met
m2 = dF1['concentration'].eq(0).cumsum().le(n) & ~m1.cummax()
output:
duration measurement concentration
0 1.20 10.0 0
1 1.25 12.0 0
4 10.60 150.0 20
5 10.67 156.0 30

You can filter the records and concat to have desired results
n = 100 # No of initial rows with concentratin 0 required
dF2 = pd.concat([dF1[dF1["concentration"]==0].head(n),dF1[dF1["concentration"]>10]])[["duration","measurement","concentration"]]

You can simply filter the data frame for when the concentration is zero, and select the top 100 or top n rows from your filtered data frame using the 'head' and append that to your dF2.
n = 100 # you can change this to include the number of rows you want.
df_baseline = dF1[dF1["concentration"] == 0][["duration","measurement","concentration"]].head(n)
dF2 = dF1[dF1["concentration"]>10][["duration","measurement","concentration"]]
df_final = df_baseline.append(df2)

Related

find max value for index in dataframe and latest value

In the following dataframa I would like to substract the max value of u from each user to the value of u with the corresponding row of the max value of t. So it should be like 21 (u-max) - 18 (u-value of t-max).
The dataframe is grouped by ['user','t']
user t u
1 0.0 -1.14
2.30 2.8
2.37 9.20
2.40 21
2.45 18
2 ... ...
If t wasn't part of the index, I would have used something like df.groupby().agg({'u':'max'}) and df.groupby().agg({'t':'max'}) ,but since it isn't, I don't know how I could use agg()on t
(edit)
I found out that I can use df.reset_index(level=['t'], inplace=True) to change t into a column, but now I realise that if I would use
df.groupby(['user']).agg({"t":'max'}) , that the corresponding u values would be missing
The goal is to create a new dataframe that contains the values like this:
user (U_max - U_tmax)
1 3
2 ...
Let's start by re-creating a dataframe similar to yours, with the below code:
import pandas as pd
import numpy as np
cols = ['user', 't', 'u']
df = pd.DataFrame(columns=cols)
size = 10
df['user'] = np.random.randint(1,3, size=size)
df['t'] = np.random.uniform(0.0,3.0, size=size)
df = df.groupby(['user','t']).sum()
df['u'] = np.random.randint(-30,30, size=len(df))
print(df)
The output is something like:
u
user t
1 0.545562 19
0.627296 23
0.945533 -13
1.697278 -18
1.904453 -10
2.008375 5
2.296342 -2
2 0.282291 14
1.461548 -6
2.594966 -19
The first thing we'll need to do in order to work on this df is to reset the index, so:
df = df.reset_index()
Now we have all our columns back and we can use them to apply our final groupby() function.
We can start by grouping by user, which is what we need, specifying u and t as columns, so that we can access them in a lambda function.
In this lambda function, we will subtract from the max value of u and the corresponding u value for the max value of t.
So, the max value of u must be something like:
x['u'].max()
And the u value of max of t should look like:
x['u'][x['t'].idxmax()])
So as you can see we've found the index for the max value of t, and used it to slice x['u'].
Here is the final code:
df = df.reset_index()
df = df.groupby(['user'])['u', 't'].apply(lambda x: (x['u'].max() - x['u'][x['t'].idxmax()]) )
print(df)
Final output:
user
1 25
2 33
Gross Error Check:
max of u for user 1 is 23
max of t for user 1 is 2.296342, and the corresponding u is -2
23 - (-2) = 25
max of u for user 2 is 14
max of t for user 2 is 2.594966, and the corresponding u is -19
14 - (-19) = 33
Bonus tip: If you'd like to rename the returned column from groupby, use reset_index() along with set_index() after the groupby operation:
df = df.reset_index(name='(U_max - U_tmax)').set_index('user')
It will yield:
(U_max - U_tmax)
user
1 25
2 33

Iterate rows and find sum of rows not exceeding a number

Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})

Randomly introduce NaN values in pandas dataframe

How could I randomly introduce NaN values ​​into my dataset for each column taking into account the null values ​​already in my starting data.
I want to have for example 20% of NaN values ​​by column.
For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values ​​by column to reach 20% per column:
A: 10% nan
B: 15% nan
C: 8% nan
For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:
df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.
import numpy as np
import pandas as pd
A = pd.Series(np.arange(99))
# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN
###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()
# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN
A.isna().mean()
Obviously, it will not always be exactly 20%...
Update
Applying it for the whole dataframe
for col in df:
ori_rat = df[col].isna().mean()
if ori_rat >= 0.2: continue
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
df.loc[vals_to_nan, col] = np.NaN
Update 2
I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.
Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))
nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}
for col in df:
for i, row_value in df[col].iteritems():
if random.random() <= nan_percent[col]:
df[col][i] = np.nan
#after adding nan
print(df.head(10))
Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)
I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:
nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}
for col, perc in nan_percent.items():
df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
df.loc[df['null'] == 1, col] = np.nan
df.drop(columns=['null'], inplace=True)

Merge subgroup into adjacent subgroup after groupby

If we run the following code
np.random.seed(0)
features = ['f1','f2','f3']
df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
df[f] = np.digitize(df[f], bins=[0.13,0.66])
df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)
df.groupby(features)['target'].agg(['mean','count']).head(9)
We get average values for each grouping of the feature set:
mean count
f1 f2 f3
0 0 0 0.571429 7
1 0.414634 41
2 0.428571 28
1 0 0.490909 55
1 0.467337 199
2 0.486726 113
2 0 0.518519 27
1 0.446281 121
2 0.541667 72
In the table above, some of the groups has too few observations and I want to merge it into 'adjacent' group by some rules. For example, I may want to merge the group [0,0,0] with group [0,0,1] since it has no more than 30 observations. I wonder if there is any good way of operating such group combinations according to columns values without creating a separate dictionary? More specifically, I may want to merge from the smallest count group to its adjacent group (the next group within the index order) until the total number of groups is no more than 10.
A simple way to do it is with a loop for on indexes meeting your condition:
df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
# check again your condition in case at the previous iteration
# merging the row has increase the count above your cirteria
if df_group['count'].loc[ind] <= 58:
# add the count values to the next row
df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
# do anything you want on mean
# drop the row
df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)

How to apply a python function to splitted 'from the end' pandas sub-dataframes and get a new dataframe?

The problem
Starting from a pandas dataframe df made of dim_df rows, I need a new
dataframe df_new obtained by applying a function to every sub-dataframe of dimension dim_blk, ideally splitted starting from the last row (so the first block, not the last, may have or not the right number of rows, dim_blk), in the most efficient way (may be vectorized?).
Example
In the following example the dataframe is made of few rows, but the real dataframe will be made of millions of rows, that's why I need an efficient solution.
dim_df = 7 # dimension of the starting dataframe
dim_blk = 3 # number of rows of the splitted block
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
print(df)
Output:
TEST
0 1
1 2
2 3
3 4
4 5
5 6
6 7
The splitted blocks I want:
1 # note: this is the first block composed by a <= dim_blk number of rows
2,3,4
5,6,7 # note: this is the last block and it has dim_blk number of rows
I've done so (I don't know if this is the efficient way):
lst = np.arange(dim_df, 0, -dim_blk) # [7 4 1]
lst_mod = lst[1:] # [4 1] to cut off the last empty sub-dataframe
split_df = np.array_split(df, lst_mod[::-1]) # splitted by reversed list
print(split_df)
Output:
split_df: [
TEST
0 1,
TEST
1 2
2 3
3 4,
TEST
4 5
5 6
6 7]
For example:
print(split_df[1])
Output:
TEST
1 2
2 3
3 4
How can I get a new dataframe, df_new, where every row is made by two columns, min and max (just an example) calculated for every blocks?
I.e:
# df_new
Min Max
0 1 1
1 2 4
2 5 7
Thank you,
Gilberto
You can convert the split_df into dataframe and then create a dataframe using min and max functions i.e
split_df = pd.DataFrame(np.array_split(df['TEST'], lst_mod[::-1]))
df_new = pd.DataFrame({"MIN":split_df.min(axis=1),"MAX":split_df.max(axis=1)}).reset_index(drop=True)
Output:
MAX MIN
0 1.0 1.0
1 4.0 2.0
2 7.0 5.0
Moved solution from question to answer:
The Solution
I've think laterally and found a very speedy solution:
Apply a rolling function to the entire dataframe
Choose every num_blk rows starting from the end
The code (with different values):
import numpy as np
import pandas as pd
import time
dim_df = 500000
dim_blk = 240
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
start_time = time.time()
df['MAX'] = df['TEST'].rolling(dim_blk).max()
df['MIN'] = df['TEST'].rolling(dim_blk).min()
df[['MAX', 'MIN']] = df[['MAX', 'MIN']].fillna(method='bfill')
df_split = pd.DataFrame(columns=['MIN', 'MAX'])
df_split['MAX'] = df['MAX'][-1::-dim_blk][::-1]
df_split['MIN'] = df['MIN'][-1::-dim_blk][::-1]
df_split.reset_index(inplace=True)
del(df_split['index'])
print(df_split.tail())
print('\n\nEND\n\n')
print("--- %s seconds ---" % (time.time() - start_time))
Time Stats
The original code stops after 545 secs. The new code stops after 0,16 secs. Awesome!

Categories