Pandas - Randomly Replace 10% of rows with other rows - python

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15

You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp

#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Related

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Merging rows in a dataframe based on reoccurring values

I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21

How to select data 3 times in row dataframe greater threshold - pandas

I want to select 3 residual data that only pass through the threshold in a row, where my threshold is 3. Here I attach the csv data to the link and what I currently do is for the filter. where I need the time criteria there. Consecutive data are those that pass the threshold and are sequentially timed
df[df.residual_value >= 3]
Data csv
IIUC, you want to filter the rows that are greater or equal than 3, only if 3 consecutive rows match the criterion. You can use rolling+min:
processing:
df[df['col'].rolling(window=3).min().shift(-2).ge(3)]
example dataset:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0,10,100)})
>>> df.head(15)
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
10 7
11 6
12 8
13 8
14 1
output:
col
2 3
3 3
4 7
5 9
9 4
10 7
11 6
...

pandas add value to new column based on condition

I've been searching around but couldn't find the answer I was looking for, so I apologize for asking what I would imagine is a repetitive question.
I have two dataframes - df1 is a list of transaction data and df2 is a sort of key. df1['code'] references a column in df2.
If the code for the transaction found in df1 is in df2, I'd like to append a value to that df1 entry in a new column identifying that the transaction was valid. If the code is not in df2, I'd like to note the opposite in that same new column.
I understand how I might do this with a 'for' loop, but my understanding is I should learn how to use pandas without relying on that.
Thanks in advance for the help!
Use numpy.where():
df1['new_col'] = numpy.where(df1['df1_code'].isin(df2['df2_code']), 'VALID', 'INVALID')
Sample DF
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'code':range(5,15), 'transaction':range(10)})
>>> df2 = pd.DataFrame({'code':range(12,22), 'transaction':range(7,17)})
>>> df1
code transaction
0 5 0
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
6 11 6
7 12 7
8 13 8
9 14 9
>>> df2
code transaction
0 12 7
1 13 8
2 14 9
3 15 10
4 16 11
5 17 12
6 18 13
7 19 14
8 20 15
9 21 16
>>> df1['new_col'] = np.where(df1['code'].isin(df2['code']), 'VALID', 'INVALID')
>>> df1
code transaction new_col
0 5 0 INVALID
1 6 1 INVALID
2 7 2 INVALID
3 8 3 INVALID
4 9 4 INVALID
5 10 5 INVALID
6 11 6 INVALID
7 12 7 VALID
8 13 8 VALID
9 14 9 VALID

Python Pandas: Get 2 set of random samples per group

I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements
You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64
This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.
There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.

Categories