Pythonic way to randomly assign pandas dataframe entries - python

Suppose we have a data frame
In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
In [2]: df
Out[3]:
A B C D
0 45 88 44 92
1 62 34 2 86
2 85 65 11 31
3 74 43 42 56
4 90 38 34 93
5 0 94 45 10
.. .. .. .. ..
How can I randomly replace x% of all entries with a value, such as None?
In [4]: something(df, percent=25)
Out[5]:
A B C D
0 45 88 None 92
1 62 34 2 86
2 None None 11 31
3 74 43 None 56
4 90 38 34 None
5 None 94 45 10
.. .. .. .. ..
I've found information about sampling particular axes, and I can imagine a way of randomly generating integers within the dimensions of my data frame and setting those equal to None, but that doesn't feel very Pythonic.
Edit: forgot 'way' in title

You could combine DataFrame.where and np.random.uniform:
In [37]: df
Out[37]:
A B C D
0 1 0 2 2
1 2 2 0 3
2 3 0 0 3
3 0 2 3 1
In [38]: df.where(np.random.uniform(size=df.shape) > 0.3, None)
Out[38]:
A B C D
0 1 0 2 None
1 2 2 0 3
2 3 0 None None
3 None 2 3 None
It's not the most concise, but gets the job done.
Note though that you should ask yourself whether you really want to do this if you still have computations to do. If you put None in a column, then pandas is going to have to use the slow object dtype instead of something fast like int64 or float64.

Related

Python: How to repeat each combination of rows in Dataframe ranging 1 to n?

Have got a dataframe df like below:
Store Aisle Table
11 59 2
11 61 3
Need to expand each combination of row 3 times generating new column 'bit' with range value as below:
Store Aisle Table Bit
11 59 2 1
11 59 2 2
11 59 2 3
11 61 3 1
11 61 3 2
11 61 3 3
Have tried the below code but didn't worked out.
df.loc[df.index.repeat(range(3))]
Help me out! Thanks in Advance.
You should provide a number, not a range to repeat. Also, you need a bit of processing:
(df.loc[df.index.repeat(3)]
.assign(Bit=lambda d: d.groupby(level=0).cumcount().add(1))
.reset_index(drop=True)
)
output:
Store Aisle Table Bit
0 11 59 2 1
1 11 59 2 2
2 11 59 2 3
3 11 61 3 1
4 11 61 3 2
5 11 61 3 3
Alternatively, using MultiIndex.from_product:
idx = pd.MultiIndex.from_product([df.index, range(1,3+1)], names=(None, 'Bit'))
(df.reindex(idx.get_level_values(0))
.assign(Bit=idx.get_level_values(1))
)
df = df.iloc[np.repeat(np.arange(len(df)), 3)]
df['Bit'] = list(range(1, len(df)//3+1))*3

Retrieving future value in Python using offset variable from another column

I'm trying to figure out how to retrieve values from future dates using an offset variable in a separate row in Python. For instance, I have the dataframe df below, and I'd like to find a way to produce Column C:
Orig A Orig B Desired Column C
54 1 76
76 4 46
14 3 46
35 1 -3
-3 0 -3
46 0 46
64 0 64
93 0 93
72 0 72
Any help is much appreciated, thank you!
You can use NumPy for a vectorised solution:
import numpy as np
idx = np.arange(df.shape[0]) + df['OrigB'].values
df['C'] = df['OrigA'].iloc[idx].values
print(df)
OrigA OrigB C
0 54 1 76
1 76 4 46
2 14 3 46
3 35 1 -3
4 -3 0 -3
5 46 0 46
6 64 0 64
7 93 0 93
8 72 0 72
import pandas as pd
dict = {"Orig A": [54,76,14,35,-3,46,64,93,72],
"Orig B": [1,4,3,1,0,0,0,0,0],
"Desired Column C": [76,46,46,-3,-3,46,64,93,72]}
df = pd.DataFrame(dict)
df["desired_test"] = [df["Orig A"].values[i+j] for i,j in enumerate(df["Orig B"].values)]
df
Orig A Orig B Desired Column C desired_test
0 54 1 76 76
1 76 4 46 46
2 14 3 46 46
3 35 1 -3 -3
4 -3 0 -3 -3
5 46 0 46 46
6 64 0 64 64
7 93 0 93 93
8 72 0 72 72

Performing operations on grouped rows in python

I have a dataframe where pic_code value may repeat. If it repeats, I want to set the variable "keep" to "t" for the pic_code that is closest to its mpe_wgt.
For example, the second pic_code has "keep" set to t since it has the "weight" closest to its corresponding "mpe_weight". My code results in "keep" staying 'f' for all and "diff" staying "100" for all.
df['keep']='f'
df['diff']=100
def cln_df(data):
if pd.unique(data['mpe_wgt']).shape==(1,):
data['keep'][0:1]='t'
elif pd.unique(data['mpe_wgt']).shape!=(1,):
data['diff']=abs(data['weight']-(data['mpe_wgt']/100))
data['keep'][data['diff']==min(data['diff'])]='t'
return data
df=df.groupby('pic_code').apply(cln_df)
df before
pic_code weight mpe_wgt keep diff
1234 45 34 f 100
1234 32 23 f 100
45344 54 35 f 100
234 76 98 f 100
234 65 12 f 100
df output should be
pic_code weight mpe_wgt keep diff
1234 45 34 f 11
1234 32 23 t 9
45344 54 35 t 100
234 76 98 t 22
234 65 12 f 53
I'm fairly new to python so please keep the solutions as simple as possible. I really want to make my method work so please don't get too fancy. Thanks in advance for your help.
This is one way. Note I am using Boolean values True / False in place of strings "t" and "f". This is just good practice.
Note that all the below operations are vectorised, while groupby.apply with a custom function certainly is not.
Setup
print(df)
pic_code weight mpe_wgt
0 1234 45 34
1 1234 32 23
2 45344 54 35
3 234 76 98
4 234 65 12
Solution
# calculate difference
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
# sort by pic_code, then by diff
df = df.sort_values(['pic_code', 'diff'])
# define keep column as True only for non-duplicates by pic_code
df['keep'] = ~df.duplicated('pic_code')
Result
print(df)
pic_code weight mpe_wgt diff keep
3 234 76 98 22 True
4 234 65 12 53 False
1 1234 32 23 9 True
0 1234 45 34 11 False
2 45344 54 35 19 True
Use:
df['keep'] = df.assign(closest=(df['mpe_wgt']-df['weight']).abs())\
.sort_values('closest').duplicated(subset=['pic_code'])\
.replace({True:'f',False:'t'})
Output:
pic_code weight mpe_wgt keep
0 1234 45 34 f
1 1234 32 23 t
2 45344 54 35 t
3 234 76 98 t
4 234 65 12 f
Maybe you can try cumcount
df['diff'] = (df['weight'] - df['mpe_wgt']).abs()
df['keep'] = df.sort_values('diff').groupby('pic_code').cumcount().eq(0)
df
pic_code weight mpe_wgt diff keep
0 1234 45 34 11 False
1 1234 32 23 9 True
2 45344 54 35 19 True
3 234 76 98 22 True
4 234 65 12 53 False
Using eval and assign to execute similar logic as other answers.
m = dict(zip([False, True], 'tf'))
f = lambda d: d.sort_values('diff').duplicated('pic_code').map(m)
df.eval('diff=abs(weight - mpe_wgt)').assign(keep=f)
pic_code weight mpe_wgt keep diff
0 1234 45 34 f 11.0
1 1234 32 23 t 9.0
2 45344 54 35 t 19.0
3 234 76 98 t 22.0
4 234 65 12 f 53.0

convert index into columns - pandas

df=pd.DataFrame({'c1':[12,45,21,49],'c2':[67,86,28,55]})
I'd like to convert the index into columns
c1 c2
0 1 2 3 0 1 2 3
12 45 21 49 67 86 28 55
I tried combining stack and unstack but so far without success
Use unstack + to_frame + T:
df=pd.DataFrame({'c1':[12,45,21,49],'c2':[67,86,28,55]})
print (df.unstack().to_frame().T)
c1 c2
0 1 2 3 0 1 2 3
0 12 45 21 49 67 86 28 55
Or DataFrame + numpy.ravel + numpy.reshape with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df.columns, df.index])
print (pd.DataFrame(df.values.ravel().reshape(1, -1), columns=mux))
c1 c2 c3
0 1 2 3 0 1 2 3 0 1 2 3
0 12 67 67 45 86 86 21 28 28 49 55 55

pandas set one column equal to 1 but both df changes

I'm trying to get df b column D to be 1, however, when I run this code, it also changes df a column D to 1 also... why is that, why are the variables linked? and how to I just change df b only?
import pandas as pd, os, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
a=df
b=df
b['D']=1
output:
>>> a
A B C D
0 98 84 3 1
1 13 35 76 1
2 17 84 28 1
3 22 9 41 1
4 54 3 20 1
>>> b
A B C D
0 98 84 3 1
1 13 35 76 1
2 17 84 28 1
3 22 9 41 1
4 54 3 20 1
>>>
a, b and df are references to the same object. When you change b['D'], you are actually changing that column of the actual object. Instead, it looks like you want to copy the DataFrame:
import pandas as pd, os, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
a=df.copy()
b=df.copy()
b['D']=1
which yields
b.head()
Out:
A B C D
0 63 52 92 1
1 98 35 43 1
2 24 87 70 1
3 38 4 7 1
4 71 30 25 1
a.head()
Out:
A B C D
0 63 52 92 80
1 98 35 43 78
2 24 87 70 26
3 38 4 7 48
4 71 30 25 61
There are also detailed responses here.
Don't use = when trying to copy a dataframe
use pd.DataFrame.copy(yourdataframe) instead
a = pd.DataFrame.copy(df)
b = pd.DataFrame.copy(df)
b['D'] = 1
This should solve your problem
You should use copy. Change
a=df
b=df
to
a=df.copy()
b=df.copy()
Check out this reference where this issue is discussed a bit more in depth. I also had this confusion when I started using Pandas.

Categories