Pandas row-wise mapper - python

Does Pandas contain an easy method to apply a mapper to each row at at time?
For example:
import pandas as pd
df = pd.DataFrame(
[[j + (3*i) for j in range(3)] for i in range(4)],
columns=['a','b','c']
)
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
And then apply some mapper (in pseudocode)
df_ret = df.rowmap(lambda d: d['a'] + d['c'])
print(df_ret)
0
0 2
1 8
2 14
3 20
Note, adding numbers really isn't the point here. The point is to have a row-wise mapper.

You can use apply with parameter axis=1:
df_ret = df.apply(lambda d: d['a'] + d['c'], axis=1)
print(df_ret)
0 2
1 8
2 14
3 20
dtype: int64
but faster is use vectorized solutions:
print (df.a + df.c)
0 2
1 8
2 14
3 20
print (df.a.add(df.c))
0 2
1 8
2 14
3 20
dtype: int64
print (df[['a','c']].sum(axis=1))
0 2
1 8
2 14
3 20
dtype: int64
dtype: int64

fastest solution: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html as it is internally optimized

Related

Counting repeating words with numpy and pandas Python

I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?
import numpy as np
import pandas as pd
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])
Expected Output:
Value Repetition Count
1 1
3 3
12 3
22 1
43 4
Define a dataframe df based on the array a. Then, use .groupby() + .size() to get the size/count of unique values, as follows:
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})
df.groupby('Value').size().reset_index(name='Repetition Count')
Result:
Value Repetition Count
0 1 1
1 3 3
2 12 3
3 22 1
4 43 4
Edit
If you want also the percentages of counts, you can use:
(df.groupby('Value', as_index=False)
.agg(**{'Repetition Count': ('Value', 'size'),
'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)
Result:
Value Repetition Count Percent
0 1 1 8.33
1 3 3 25.00
2 12 3 25.00
3 22 1 8.33
4 43 4 33.33
or use .value_counts with normalize=True
pd.Series(a).value_counts(normalize=True).mul(100)
Result:
43 33.333333
12 25.000000
3 25.000000
22 8.333333
1 8.333333
dtype: float64
You can use groupby:
>>> pd.Series(a).groupby(a).count()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Or value_counts():
>>> pd.Series(a).value_counts().sort_index()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Easiest if you make a pandas dataframe from np.array and then use value_counts().
df = pd.DataFrame(data=a, columns=['col1'])
print(df.col1.value_counts())
43 4
12 3
3 3
22 1
1 1

Pandas 'partial melt' or 'group melt'

I have a DataFrame like this
>>> df = pd.DataFrame([[1,1,2,3,4,5,6],[2,7,8,9,10,11,12]],
columns=['id', 'ax','ay','az','bx','by','bz'])
>>> df
id ax ay az bx by bz
0 1 1 2 3 4 5 6
1 2 7 8 9 10 11 12
and I want to transform it into something like this
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12
This is an unpivot / melt problem, but I don't know of any way to melt by keeping these groups intact. I know I can create projections across the original dataframe and then concat those but I'm wondering if I'm missing some common melt tricks from my toolbelt.
Set_index, convert columns to multi index and stack,
df = df.set_index('id')
df.columns = [df.columns.str[1], df.columns.str[0]]
new_df = df.stack().reset_index().rename(columns = {'level_1': 'name'})
id name x y z
0 1 a 1 2 3
1 1 b 4 5 6
2 2 a 7 8 9
3 2 b 10 11 12
Not melt wide_to_long with stack and unstack
pd.wide_to_long(df,['a','b'],i='id',j='drop',suffix='\w+').stack().unstack(1)
Out[476]:
drop x y z
id
1 a 1 2 3
b 4 5 6
2 a 7 8 9
b 10 11 12
An addition to the already excellent answers; pivot_longer from pyjanitor can help to abstract the reshaping :
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'id',
names_to = ('name', '.value'),
names_pattern = r"(.)(.)")
id name x y z
0 1 a 1 2 3
1 2 a 7 8 9
2 1 b 4 5 6
3 2 b 10 11 12

applying several functions in transform in pandas

After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()
I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17
With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500

Pandas Data-Frame : Conditionally select columns

I have a pandas data-frame as shown below...
C1 C2 C3 C4
0 -1 -3 3 0.75
1 10 20 30 -0.50
I want to add only the first two columns in each row which have a value less than zero. For example, the series I would get for the above case would be as below...
CR
0 -4
1 0
I know how to apply other functions like the below...
df.iloc[:, :-2].abs().sum(axis = 1)
Is there a way using lambda functions?
It seems you need select by iloc with where and sum:
df = df.iloc[:,:2].where(df < 0).sum(axis=1)
print (df)
0 -4.0
1 0.0
dtype: float64
If need solution with selection by callable:
df = df.iloc[:, lambda df: [0,1]].where(df < 0).sum(axis=1)
print (df)
0 -4.0
1 0.0
dtype: float64
For lambda function in python is applicable here too.
lambda in pandas:
#sample data
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
Get difference by rows .apply(axis=0) what is default so same .apply():
#instead function f1 is possible use lambda, if function is simple
print (df.apply(lambda x: x.max() - x.min()))
A 8
B 8
C 8
D 7
E 6
dtype: int64
def f1(x):
#print (x)
return x.max() - x.min()
print (df.apply(f1))
A 8
B 8
C 8
D 7
E 6
dtype: int64
Get difference by columns .apply(axis=1)
#instead function f2 is possible use lambda, if function is simple
print (df.apply(lambda x: x.max() - x.min(), axis=1))
0 5
1 5
2 8
3 9
4 4
dtype: int64
def f2(x):
#print (x)
return x.max() - x.min()
print (df.apply(f2, axis=1))
0 5
1 5
2 8
3 9
4 4
dtype: int64

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

Categories