pandas add value to new column based on condition - python

I've been searching around but couldn't find the answer I was looking for, so I apologize for asking what I would imagine is a repetitive question.
I have two dataframes - df1 is a list of transaction data and df2 is a sort of key. df1['code'] references a column in df2.
If the code for the transaction found in df1 is in df2, I'd like to append a value to that df1 entry in a new column identifying that the transaction was valid. If the code is not in df2, I'd like to note the opposite in that same new column.
I understand how I might do this with a 'for' loop, but my understanding is I should learn how to use pandas without relying on that.
Thanks in advance for the help!

Use numpy.where():
df1['new_col'] = numpy.where(df1['df1_code'].isin(df2['df2_code']), 'VALID', 'INVALID')
Sample DF
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'code':range(5,15), 'transaction':range(10)})
>>> df2 = pd.DataFrame({'code':range(12,22), 'transaction':range(7,17)})
>>> df1
code transaction
0 5 0
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
6 11 6
7 12 7
8 13 8
9 14 9
>>> df2
code transaction
0 12 7
1 13 8
2 14 9
3 15 10
4 16 11
5 17 12
6 18 13
7 19 14
8 20 15
9 21 16
>>> df1['new_col'] = np.where(df1['code'].isin(df2['code']), 'VALID', 'INVALID')
>>> df1
code transaction new_col
0 5 0 INVALID
1 6 1 INVALID
2 7 2 INVALID
3 8 3 INVALID
4 9 4 INVALID
5 10 5 INVALID
6 11 6 INVALID
7 12 7 VALID
8 13 8 VALID
9 14 9 VALID

Related

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Replacing all values greater than a value of a specific column in pandas

I have tried several versions but all of them throw a warning, starting with:
colName = 'age'
df_plot[colName][df_plot[colName]>10] = 10
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Based on the link from the warning then:
df_plot.loc[:, (colName, df_plot[colName]>10)] = 10
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Next
df_plot.loc[colName, df_plot[colName] > 10] = 10
TypeError: 'Series' objects are mutable, thus they cannot be hashed
And finally based on a stack-overflow answer also:
df_plot[colName] = df_plot[colName].apply(lambda x: [y if y <= 10 else 10 for y in x])
TypeError: 'float' object is not iterable
What am I doing wrong here?
There's an easier way to clip values:
df_plot[colName] = df_plot[colName].clip(upper=10)
Your first attempt seems correct to me, and works without warnings on my machine.
df = pd.DataFrame({'age':range(5,15), 'size':range(10)})
colName = 'age'
df[colName][df[colName]>10] = 10
print(df)
Output:
age size
0 5 0
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
6 10 6
7 10 7
8 10 8
9 10 9
Have you tried numpy.where()? Let me use #Zvika's sample df.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'age':range(5,15), 'size':range(10)})
>>> df
age size
0 5 0
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
6 11 6
7 12 7
8 13 8
9 14 9
>>> df['age'] = np.where(df['age']>10, [10],df['age'])
>>> df
age size
0 5 0
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
6 10 6
7 10 7
8 10 8
9 10 9

Pandas - Randomly Replace 10% of rows with other rows

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Transpose in Pandas

I imported the data from csv file with pandas. I want to split the column which includes 50 (0 to 49) values into 5 rows each having ten values. Can anyone tell me how i can do this transpose in form of pandas frame?
Let me rephrase what i said:
I attached the data that i have. I wanted to select the second column, and split it into two rows each having 10 values.
That is the code i have done so far:(I couldn't get the picture of 50 rows so i have only put 20 rowsenter image description here)
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df.iloc[:50,:2]
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(mycolumn=np.random.randint(10, size=50)))
using numpy and reshape'ing, ignoring indices
pd.DataFrame(df.mycolumn.values.reshape(5, -1))
0 1 2 3 4 5 6 7 8 9
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
​

Python Pandas: Get 2 set of random samples per group

I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements
You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64
This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.
There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.

Categories