How do i replace the values of a dataframe on a condition - python

I have a pandas dataframe with more than 50 columns. All the data except the 1st column is float. I want to replace any value greater than 5.75 with 100. Can someone advise any function to do the same.
The replace function is not working as to_value can only take "=" function, and not the greater than function.

This can be done using
df['ColumnName'] = np.where(df['ColumnName'] > 5.75, 100, df['First Season'])

You can make a custom function and pass it to apply:
import pandas as pd
import random
df = pd.DataFrame({'col_name': [random.randint(0,10) for x in range(100)]})
def f(x):
if x >= 5.75:
return 100
return x
df['modified'] = df['col_name'].apply(f)
print(df.head())
col_name modified
0 2 2
1 5 5
2 7 100
3 1 1
4 9 100

If you have a dataframe:
import pandas as pd
import random
df = pd.DataFrame({'first_column': [random.uniform(5,6) for x in range(10)]})
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 5.923234
8 5.489964
9 5.127154
Then check if the value is greater than 5.75:
df[df > 5.75] = 100
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 100.000000
8 5.489964
9 5.127154

import numpy as np
import pandas as pd
#Create df
np.random.seed(0)
df = pd.DataFrame(2*np.random.randn(100,50))
for col_name in df.columns[1:]: #Skip first column
df.loc[:,col_name][df.loc[:,col_name] > 5.75] = 100

np.where(df.value > 5.75, 100, df.value)

Related

Python Dataframe find minimum among multiple set of columns

I am having a data frame of four columns. I want to find the minimum among the first two columns and the last two columns for each row.
Code:
np.random.seed(0)
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
xdf['ab_min'] = xdf[['a','b']].min(axis=1)
xdf['cd_min'] = xdf[['c','d']].min(axis=1)
xdf['minimum'] = xdf['ab_min'].list()+xdf['cd_min'].list()
Expected answer:
xdf['minimum']
0 [ab_min,cd_min]
1 [ab_min,cd_min]
2 [ab_min,cd_min]
3 [ab_min,cd_min]
Present answer:
AttributeError: 'Series' object has no attribute 'list'
Select the columns ab_min and cd_min then use to_numpy to convert it to numpy array and assign the result to minimum column
xdf['minimum'] = xdf[['ab_min', 'cd_min']].to_numpy().tolist()
>>> xdf['minimum']
0 [3.23307959607905, 1.9836323494587338]
1 [6.189440334168731, 1.0578078219990983]
2 [3.1194570407645217, 1.2816570607783184]
3 [1.9170068676155894, 7.158027504597937]
4 [0.6244579166416464, 8.568849995324166]
5 [4.108986697339397, 0.6201685780268684]
6 [4.170639127277155, 2.3385281968695693]
7 [2.0831140755567814, 5.94063873401418]
8 [0.4887113296319978, 6.380570614449363]
9 [2.844815261473105, 0.9146457613970793]
Name: minimum, dtype: object
try this:
import pandas as pd
import numpy as np
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
print(xdf)
ab = xdf['ab_min'] = xdf[['a','b']].min(axis=1)
cd = xdf['cd_min'] = xdf[['c','d']].min(axis=1)
blah = pd.concat([ab, cd], axis=1)
print(blah)
results:
You can use .apply with a lambda function along axis=1:
xdf['minimum'] = xdf.apply(lambda x: [x[['a','b']].min(),x[['c','d']].min()], axis=1)
Result:
>>> xdf
a b c d minimum
0 0.662634 4.166338 8.864823 9.004818 [0.6626341544146663, 8.864822751494284]
1 6.854054 6.163417 6.510728 0.049498 [6.163416966676091, 0.04949754019059838]
2 6.389760 4.462319 2.435369 3.732534 [4.462318678134215, 2.4353686460846893]
3 4.628735 7.571098 1.900726 9.046384 [4.628735362058981, 1.9007255361271058]
4 3.203285 4.364302 2.473973 2.911911 [3.203285015796596, 2.4739732602476727]
5 5.357440 3.166420 9.908758 0.910704 [3.166420385020304, 0.91070444348338]
6 8.120486 6.395869 0.970977 5.278279 [6.395868901095546, 0.9709769503958143]
7 1.574765 7.184971 3.835641 4.495135 [1.574765093192545, 3.835640598199231]
8 8.688497 0.069061 0.771772 8.971878 [0.06906065557899743, 0.7717717844423222]
9 5.455920 2.630342 1.966357 7.374366 [2.6303421168291843, 1.966357159086991]

create a new data frame from existing data frame based on condition

I have a data frame df
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,1,0,1,0], [1,0,1,1,0,0], [1,1,0,0,0,1],[1,0,1,0,1,1],
[0,0,1,0,0,1]]))
df
Now, from data frame df I like to create a new data frame based on condition
Condition: if a column contain three or more than three '1' then the new data frame column value is '1' otherwise '0'
expected output of new data frame
1 0 1 0 0 1
You can also get it without apply. You could sum along the rows, axis=0, and creating a boolean with gt(2):
res = df.sum(axis=0).gt(2).astype(int)
print(res)
0 1
1 0
2 1
3 0
4 0
5 1
dtype: int32
As David pointed out, the result of the above is a series. If you require a dataframe, you can chain to_frame() at the end of it
You could do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,1,0,1,0], [1,0,1,1,0,0], [1,1,0,0,0,1],[1,0,1,0,1,1],
[0,0,1,0,0,1]]))
df_res = pd.DataFrame(df.apply(lambda c: 1 if np.sum(c) > 2 else 0))
In [6]: df_res
Out[6]:
0
0 1
1 0
2 1
3 0
4 0
5 1
Instead of np.sum(c) you can also do c.sum()
And if you want it transposed just do the following instead:
df_res = pd.DataFrame(df.apply(lambda c: 1 if c.sum() > 2 else 0)).T

Python Pandas pandas correlation one column vs all

I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])

Dropping every row with len >2 Pandas python

Suppose I have a dataframe
. Values
0 25
1 897
2 48
3 28
4 214
5 25
I am trying to drop all rows with len > 2 with the following code but nothing happens when I run it.
import pandas as pd
df = pd.read_csv('File.csv')
for index in df.index:
if len(df.loc[index, 'Sevens']) > 2:
df.drop([index])
else:
pass
Use Series.str.len in boolean indexing:
df1 = df[df['Value'].str.len() <=2]
If values was numbers:
df1 = df[df['Value'].astype(str).str.len() <=2]

how to append/insert an item at the beginning of a series?

imaging i have a series looks like this:
Out[64]:
2 0
3 1
80 1
83 1
84 2
85 2
how can i append an item at the very beginning of this series? the native pandas.Series.append function only appends at the end.
thanks a lot
There is a pandas.concat function...
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
See the Merge, Join, and Concatenate documentation.
Using concat, or append, the resulting series will have duplicate indices:
for concat():
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
Out[143]:
0 1
0 2
1 3
2 4
and for append():
import pandas as pd
a = pd.Series([2,3,4])
a.append(pd.Series([1]))
Out[149]:
0 2
1 3
2 4
0 1
This could be a problem in the future, since a[0] (if you assign the result to a) will return two values for either case.
My solutions are in this case:
import pandas as pd
a = pd.Series([2,3,4])
b = [1]
b[1:] = a
pd.Series(b)
Out[199]:
0 1
1 2
2 3
3 4
or, by reindexing with concat():
import pandas as pd
a = pd.Series([2,3,4])
a.index = a.index + 1
pd.concat([pd.Series([1]), a])
Out[208]:
0 1
1 2
2 3
3 4
In case you need to prepend a single value from a different Series b, say its last value, this is what works for me:
import pandas as pd
a = pd.Series([2, 3, 4])
b = pd.Series([0, 1])
pd.concat([b[-1:], a])
Similarly, you can use append with a list or tuple of series (so long as you're using pandas version .13 or greater)
import pandas as pd
a = pd.Series([2,3,4])
pd.Series.append((pd.Series([1]), a))

Categories