Pandas Data-Frame : Conditionally select columns - python

I have a pandas data-frame as shown below...
C1 C2 C3 C4
0 -1 -3 3 0.75
1 10 20 30 -0.50
I want to add only the first two columns in each row which have a value less than zero. For example, the series I would get for the above case would be as below...
CR
0 -4
1 0
I know how to apply other functions like the below...
df.iloc[:, :-2].abs().sum(axis = 1)
Is there a way using lambda functions?

It seems you need select by iloc with where and sum:
df = df.iloc[:,:2].where(df < 0).sum(axis=1)
print (df)
0 -4.0
1 0.0
dtype: float64
If need solution with selection by callable:
df = df.iloc[:, lambda df: [0,1]].where(df < 0).sum(axis=1)
print (df)
0 -4.0
1 0.0
dtype: float64
For lambda function in python is applicable here too.
lambda in pandas:
#sample data
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
Get difference by rows .apply(axis=0) what is default so same .apply():
#instead function f1 is possible use lambda, if function is simple
print (df.apply(lambda x: x.max() - x.min()))
A 8
B 8
C 8
D 7
E 6
dtype: int64
def f1(x):
#print (x)
return x.max() - x.min()
print (df.apply(f1))
A 8
B 8
C 8
D 7
E 6
dtype: int64
Get difference by columns .apply(axis=1)
#instead function f2 is possible use lambda, if function is simple
print (df.apply(lambda x: x.max() - x.min(), axis=1))
0 5
1 5
2 8
3 9
4 4
dtype: int64
def f2(x):
#print (x)
return x.max() - x.min()
print (df.apply(f2, axis=1))
0 5
1 5
2 8
3 9
4 4
dtype: int64

Related

How to replace 0 values with mean based on groupby

I have a dataframe with two features: gps_height (numeric) and region (categorical).
The gps_height contains a lot of 0 values, which are missing values in this case. I want to fill the 0 values with the mean of the coherent region.
My reasoning is as follows:
1. Drop the zero values and take the mean values of gps_height, grouped by region
df[df.gps_height !=0].groupby(['region']).mean()
But how do I replace the zero values in my dataframe with those mean values?
Sample data:
gps_height region
0 1390 Iringa
1 1400 Mara
2 0 Iringa
3 250 Iringa
...
Use:
df = pd.DataFrame({'region':list('aaabbbccc'),
'gps_height':[2,3,0,3,4,5,1,0,0]})
print (df)
region gps_height
0 a 2
1 a 3
2 a 0
3 b 3
4 b 4
5 b 5
6 c 1
7 c 0
8 c 0
Replace 0 to missing values, and then replace NANs by fillna with means by GroupBy.transformper groups:
df['gps_height'] = df['gps_height'].replace(0, np.nan)
df['gps_height']=df['gps_height'].fillna(df.groupby('region')['gps_height'].transform('mean'))
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
Or filter out 0 values, aggregate means and map all 0 rows:
m = df['gps_height'] != 0
s = df[m].groupby('region')['gps_height'].mean()
df.loc[~m, 'gps_height'] = df['region'].map(s)
#alternative
#df['gps_height'] = np.where(~m, df['region'].map(s), df['gps_height'])
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
I ended up facing the same problem that #ahbon raised: what if there are more than one column to group by? And this was the closest question that I found to my problem. After a serious struggle, I came to a solution.
As far as I know (there are pandas specific functions to do similar things) It could not be an elegant/orthodox one, so I'd appreciate some feedback.
There it goes:
import pandas as pd
import random
random.seed(123)
df = pd.DataFrame({"A":list('a'*4+'b'*4+'c'*4+'d'*4),
"B":list('xy'*8),
"C":random.sample(range(17), 16)})
print(df)
A B C
0 a x 1
1 a y 8
2 a x 16
3 a y 12
4 b x 6
5 b y 4
6 b x 14
7 b y 0
8 c x 13
9 c y 5
10 c x 2
11 c y 9
12 d x 10
13 d y 11
14 d x 3
15 d y 15
First get the indices of 0 values to retrieve the non zero data and get the mean by group.
idx = list(df[df["C"] != 0].index)
data_to_group = df.iloc[idx,]
grouped_data = pd.DataFrame(data_to_group.groupby(["A", "B"])["C"].mean())
And now the tricky part. Here is where I get the impression that it could be a more elegant solution:
Stack, unstack and reset index
Then merge with the subset of rows in df where C is 0; drop C from the first and keep C from the second
Finaly update df with this subset with no zero in C.
grouped_data = grouped_data.stack().unstack().reset_index()
zero_rows = df[df.C == 0]
zero_rows_replaced = pd.merge(left = zero_rows, right = grouped_data,
how = "left", on=["A", "B"],
suffixes=('_x','')).drop('C_x', axis=1)
zero_rows_replaced = zero_rows_replaced.set_index(zero_rows.index.copy())
df.update(zero_rows_replaced)
print(df)
A B C
0 a x 1
1 a y 8
2 a x 16
3 a y 12
4 b x 6
5 b y 4
6 b x 14
7 b y 4
8 c x 13
9 c y 5
10 c x 2
11 c y 9
12 d x 10
13 d y 11
14 d x 3
15 d y 15

Is there an easy way to compute the intersection of two different indexes in a dataframe?

For example, if I have a DataFrame consisting of 5 rows (0-4) and 5 columns (A-E), I want to say, 0A * E3. Or more pseudo-like df[0,A] * df[3,E]?
I think you need select values by DataFrame.loc and then multiple:
a = df.loc[0,'A'] * df.loc[3,'E']
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
a = df.loc[0,'A'] * df.loc[3,'E']
print (a)
16
Btw, your pseodo code is very close to real solution.

Pandas row-wise mapper

Does Pandas contain an easy method to apply a mapper to each row at at time?
For example:
import pandas as pd
df = pd.DataFrame(
[[j + (3*i) for j in range(3)] for i in range(4)],
columns=['a','b','c']
)
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
And then apply some mapper (in pseudocode)
df_ret = df.rowmap(lambda d: d['a'] + d['c'])
print(df_ret)
0
0 2
1 8
2 14
3 20
Note, adding numbers really isn't the point here. The point is to have a row-wise mapper.
You can use apply with parameter axis=1:
df_ret = df.apply(lambda d: d['a'] + d['c'], axis=1)
print(df_ret)
0 2
1 8
2 14
3 20
dtype: int64
but faster is use vectorized solutions:
print (df.a + df.c)
0 2
1 8
2 14
3 20
print (df.a.add(df.c))
0 2
1 8
2 14
3 20
dtype: int64
print (df[['a','c']].sum(axis=1))
0 2
1 8
2 14
3 20
dtype: int64
dtype: int64
fastest solution: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html as it is internally optimized

Python Pandas: select column with the number of unique values greater than 10

In R, we can use sapply to extract columns with the number of unique values greater than 10 by:
X[, sapply(X, function(x) length(unique(x))) >=10]
How can we do the same thing in Python Pandas?
Also, how can we choose columns with missing proportion less than 10% like what we can do in R:
X[, sapply(X, function(x) sum(is.na(x))/length(x) ) < 0.1]
Thanks.
You can use nunique with apply, because it works only with Series:
print (df.ix[:, df.apply(lambda x: x.nunique()) >= 10])
and second isnull with mean:
print (df.ix[:, df.isnull().mean() < 0.1])
Sample:
df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,4,np.nan],
'C':[7,8,9],
'D':[3,3,5]})
print (df)
A B C D
0 1.0 4.0 7 3
1 NaN 4.0 8 3
2 3.0 NaN 9 5
print (df.ix[:, df.apply(lambda x: x.nunique()) >= 2])
A C D
0 1.0 7 3
1 NaN 8 3
2 3.0 9 5
print (df.isnull().sum())
A 1
B 1
C 0
D 0
dtype: int64
print (df.isnull().sum() / len(df.index))
A 0.333333
B 0.333333
C 0.000000
D 0.000000
dtype: float64
print (df.isnull().mean())
A 0.333333
B 0.333333
C 0.000000
D 0.000000
dtype: float64
print (df.ix[:, df.isnull().sum() / len(df.index) < 0.1])
C D
0 7 3
1 8 3
2 9 5
Or:
print (df.ix[:, df.isnull().mean() < 0.1])
C D
0 7 3
1 8 3
2 9 5

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

Categories