Python Dataframe - Get max value between specific number vs. column value - python

When I have a below df, I want to get a column 'C' which has max value between specific value '15' and column 'A' within the condition "B == 't'"
testdf = pd.DataFrame({"A":[20, 16, 7, 3, 8],"B":['t','t','t','t','f']})
testdf
A B
0 20 t
1 16 t
2 7 t
3 3 t
4 8 f
I tried this:
testdf.loc[testdf['B']=='t', 'C'] = max(15,(testdf.loc[testdf['B']=='t','A']))
And desired output is:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
Could you help me to get the output? Thank you!

Use np.where with clip:
testdf['C'] = np.where(testdf['B'].eq('t'),
testdf['A'].clip(15), df['A'])
Or similarly with series.where:
testdf['C'] = (testdf['A'].clip(15)
.where(testdf['B'].eq('t'), testdf['A'])
)
output:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8

You could also use the update method:
testdf['C'] = testdf['A']
A B C
0 20 t 20
1 16 t 16
2 7 t 7
3 3 t 3
4 8 f 8
values = testdf.A[testdf.B.eq('t')].clip(15)
values
Out[16]:
0 20
1 16
2 15
3 15
Name: A, dtype: int64
testdf.update(values.rename('C'))
A B C
0 20 t 20.0
1 16 t 16.0
2 7 t 15.0
3 3 t 15.0
4 8 f 8.0

To apply any formula to individual values in a dataframe you can use
df['column'] =df['column'].apply(lambda x: anyFunc(x))
x here will catch individual values of column one by one and pass it to the function where you can manipulate it and return back.

Related

How to randomly choose a string from a list and to iterate it over dataframe based on condition?

Let's say I have the following df -
data={'Location':[1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4]}
df = pd.DataFrame(data=data)
df
Location
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 3
14 3
15 4
16 4
17 4
In addition, I have the following dict:
Unlock={
1:"A",
2:"B",
3:"C",
4:"D",
5:"E",
6:"F",
7:"G",
8:"H",
9:"I",
10:"J"
}
I'd like to create another column that will randomly select a string from the 'Unlock' dict based on the condition that Location<=Unlock. So for example - for Location 2 some rows will get 'A' and some rows will get 'B'.
I've tried to do the following but with no luck (I'm getting an error) -
df['Name']=np.select(df['Location']<=Unlock,np.random.choice(Unlock,size=len(df))
Thanks in advance for your help!
You can convert your dictionary values to a list, and randomly select the values of a subset of this list: only up to Location number of elements.
With Python versions >= 3.7, dict maintains insertion order. For lower versions - see below.
lst = list(Unlock.values())
df['Name'] = df['Location'].transform(lambda loc: np.random.choice(lst[:loc]))
Example output:
Location Name
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 2 B
7 2 A
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 3 C
14 3 B
15 4 A
16 4 C
17 4 D
If you are using a lower version of Python, you can Build a list of dictionary values, sorted by key:
lst = [value for key, value in sorted(Unlock.items())]
For a vectorial method, multiply by a random value (0,1] and ceil, then map with your dictionary.
This will give you an equiprobable value between 1 and the current value (included):
import numpy as np
df['random'] = (np.ceil(df['Location'].mul(1-np.random.random(size=len(df))))
.astype(int).map(Unlock)
)
output (reproducible with np.random.seed(0)):
Location random
0 1 A
1 1 A
2 1 A
3 2 B
4 2 A
5 2 B
6 2 A
7 2 B
8 2 B
9 3 B
10 3 C
11 3 B
12 3 B
13 3 C
14 3 A
15 4 A
16 4 A
17 4 D

Python dataframe rank each column based on row values

I have a data frame. I want to rank each column based on its row value
Ex:
xdf = pd.DataFrame({'A':[10,20,30],'B':[5,30,20],'C':[15,3,8]})
xdf =
A B C
0 10 5 15
1 20 30 3
2 30 20 8
Expected result:
xdf =
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
OR
xdf =
A B C A_Rk B_Rk C_Rk
0 10 5 15 2 3 1
1 20 30 3 2 1 2
2 30 20 8 1 2 3
Why I need this:
I want to track the trend of each column and how it is changing. I would like to show this by the plot. Maybe a bar plot showing how many times A got Rank1, 2, 3, etc.
My approach:
xdf[['Rk_1','Rk_2','Rk_3']] = ""
for i in range(len(xdf)):
xdf.loc[i,['Rk_1','Rk_2','Rk_3']] = dict(sorted(dict(xdf[['A','B','C']].loc[i]).items(),reverse=True,key=lambda item:item[1])).keys()
Present output:
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
I am iterating through each row, converting each row, column into a dictionary, sorting the values, and then extracting the keys (columns). Is there a better approach? My actual data frame has 10000 rows, 12 columns to be ranked. I just executed and it took around 2 minutes.
You should be able to get your desired dataframe by using:
ranked = xdf.join(xdf.rank(ascending=False, method='first', axis=1), rsuffix='_rank')
This'll give you:
A B C A_rank B_rank C_rank
0 10 5 15 2.0 3.0 1.0
1 20 30 3 2.0 1.0 3.0
2 30 20 8 1.0 2.0 3.0
Then do whatever you need to do plotting wise.

How to replace values in selected rows columns with an array in dataframe?

I have a df
df = pd.DataFrame(data={'A': [1,2,3,4,5,6,7,8],
'B': [10,20,30,40,50,60,70,80]})
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
5 6 60
6 7 70
7 8 80
which I selected a few rows from.
Then I have a dictionary containig values that I should insert in B column
if key matches with value in A column of df
my_dict = {2: 39622884,
4: 82709546,
5: 28166511,
7: 89465652}
When I use the following assignment
df.loc[df['A'].isin(my_dict.keys())]['B'] = list(my_dict.values())
I get the error:
ValueError: Length of values does not match length of index
The desirable output is
A B
0 1 10
1 2 39622884
2 3 30
3 4 82709546
4 5 50
5 6 28166511
6 7 89465652
7 8 80
What is the correct way to implement this procedure?
You can make do with map and fillna:
df['B'] = df['A'].map(my_dict).fillna(df['B'])
Output:
A B
0 1 10.0
1 2 39622884.0
2 3 30.0
3 4 82709546.0
4 5 28166511.0
5 6 60.0
6 7 89465652.0
7 8 80.0

How to impute missing values based on other variables

I have a dataframe like below:
df = pd.DataFrame({'one' : pd.Series(['a', 'b', 'c', 'd','aa','bb',np.nan,'b','c',np.nan, np.nan] ),
'two' : pd.Series([10, 20, 30, 40,50,60,10,20,30,40,50])} )
In which first column is the variables, second column is the values. Variable value is constant, which will never change.
example 'a' value is 10, whenever 'a' is presented corrsponding value will be10
Here some values missing in first column eg: NaN 10 which is a, NaN 40 which is d like wise dataframe contains 200 variables.
Values are not continuous variables, those are discrete and unsortable
In this case how can we impute missing values.
Expected output should be :
Please help me on this.
Regards,
Venkat.
I think in general it would be better to group and fill. We use DataFrame.groupby:
df.groupby('two').apply(lambda x: x.ffill().bfill())
It can be done without using groupby but you have to sort by both columns:
df.sort_values(['two','one']).ffill().sort_index()
Below I show you how the method proposed in another answer may fail:
Here is an example:
df=pd.DataFrame({'one':['a',np.nan,'c','d',np.nan,'c','b','b',np.nan,'a'],'two':[10,20,30,40,10,30,20,20,30,10]})
print(df)
one two
0 a 10
1 NaN 20
2 c 30
3 d 40
4 NaN 10
5 c 30
6 b 20
7 b 20
8 NaN 30
9 a 10
df.sort_values(['two']).fillna(method='ffill').sort_index()
one two
0 a 10
1 a 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As you can see the proposed method in another of the answers fails here(see row 1). This occurs because some NaN Value can be the first for a specific value of the column 'two' and is filled with the value of the upper group.
This don't happen if we group first:
df.groupby('two').apply(lambda x: x.ffill().bfill())
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As I said we can use DataFrame.sort_values ​​but we need to sort for both columns.I recommend you this method.
df.sort_values(['two','one']).ffill().sort_index()
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
Here it is:
df.ffill(inplace=True)
output:
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50
Try this:
df = df.sort_values(['two']).fillna(method='ffill').sort_index()
Which will give you
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50

Python Pandas: Get 2 set of random samples per group

I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements
You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64
This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.
There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.

Categories