Using Pandas to "applymap" with access to index/column? - python

What's the most effective way to solve the following pandas problem?
Here's a simplified example with some data in a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=['a','b','c','d'],
index=np.random.randint(0,10,size=10))
This data looks like this:
a b c d
1 0 0 9 9
0 2 2 1 7
3 9 3 4 0
2 5 0 9 4
1 7 7 7 2
6 4 4 6 4
1 1 6 0 0
7 8 0 9 3
5 0 0 8 3
4 5 0 2 4
Now I want to apply some function f to each value in the data frame (the function below, for example) and get a data frame back as a resulting output. The tricky part is the function I'm applying depends on the value of the index I am currently at.
def f(cell_val, row_val):
"""some function which needs to know row_val to use it"""
try:
return cell_val/row_val
except ZeroDivisionError:
return -1
Normally, if I wanted to apply a function to each individual cell in the data frame, I would just call .applymap() on f. Even if I had to pass in a second argument ('row_val', in this case), if the argument was a fixed number I could just write a lambda expression such as lambda x: f(x,i) where i is the fixed number I wanted. However, my second argument varies depending on the row in the data frame I am currently calling the function from, which means that I can't just use .applymap().
How would I go about solving a problem like this efficiently? I can think of a few ways to do this, but none of them feel "right". I could:
loop through each individual value and replace them one by one, but that seems really awkward and slow.
create a completely separate data frame containing (cell value, row value) tuples and use the builtin pandas applymap() on my tuple data frame. But that seems pretty hacky and I'm also creating a completely separate data frame as an extra step.
there must be a better solution to this (a fast solution would be appreciated, because my data frame could get very large).

IIUC you can use div with axis=0 plus you need to convert the Index object to a Series object using to_series:
In [121]:
df.div(df.index.to_series(), axis=0).replace(np.inf, -1)
Out[121]:
a b c d
1 0.000000 0.000000 9.000000 9.000000
0 -1.000000 -1.000000 -1.000000 -1.000000
3 3.000000 1.000000 1.333333 0.000000
2 2.500000 0.000000 4.500000 2.000000
1 7.000000 7.000000 7.000000 2.000000
6 0.666667 0.666667 1.000000 0.666667
1 1.000000 6.000000 0.000000 0.000000
7 1.142857 0.000000 1.285714 0.428571
5 0.000000 0.000000 1.600000 0.600000
4 1.250000 0.000000 0.500000 1.000000
Additionally as division by zero results in inf you need to call replace to replace those rows with -1

Here's how you can add the index to the dataframe
pd.DataFrame(df.values + df.index.values[:, None], df.index, df.columns)

Related

Iterating through excel row using pandas

To begin with I would like to mention that I new to python. I am trying to iterate over rows in pandas. My data comes from an excel file and looks like this:
I would like to create a loop that calculates the mean of specific rows. For instance row 0,1,2 and then 9,10,11 and so on.
What I have already done:
import pandas as pd
import numpy as np
df = pd.read_excel("Excel name.xlsx")
for i in range([0,1,2],154,3)
x =df.iloc[[i]].mean()
print(x)
But I am not getting results. Any idea? Thank you in advance.
What I am doing and my code actually works is:
x1= df.iloc[[0,1,2]].mean()
x2= df.iloc[[9,10,11]].mean()
x3= df.iloc[[18,19,20]].mean()
x4= df.iloc[[27,28,29]].mean()
x5= df.iloc[[36,37,38]].mean()
x6= df.iloc[[45,46,47]].mean()
....
....
....
x17= df.iloc[[146,147,148]].mean()
What if I had 100 x? It would be impossible to code. So my question is if there is a way to automate this procedure with a loop.
Dont loop, rather select all rows by using little maths - here integer division with modulo by 9 and selecting 0,1,2 values in Index.isin, and then aggregate mean:
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(10, size=(20, 3)))
mask = (df.index % 9).isin([0,1,2])
print(df[mask].groupby(df[mask].index // 9).mean())
0 1 2
0 4.000000 5.666667 6.666667
1 3.666667 6.000000 8.333333
2 6.500000 8.000000 7.000000
Detail:
print(df[mask])
0 1 2
0 4 5 9
1 0 6 5
2 8 6 6
9 1 6 7
10 5 6 9
11 5 6 9
18 4 9 7
19 9 7 7

apply() method: Normalize the first column by the sum of the second

I'm having trouble understanding how a function works:
""" the apply() method lets you apply an arbitrary function to the group
result. The function take a DataFrame and returns a Pandas object (a df or
series) or a scalar.
For example: normalize the first column by the sum of the second"""
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
print (df); print (df.groupby('key').apply(norm_by_data2))
(Excerpt from: "Python Data Science Handbook", Jake VanderPlas pp. 167)
Returns this:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
key data1 data2
0 A 0.000000 5
1 B 0.142857 0
2 C 0.166667 3
3 A 0.375000 3
4 B 0.571429 7
5 C 0.416667 9
For me, the best way to understand how this works is by manually calculating the values.
Can someone explain how to manually arrive to the second value of the column 'data1': 0.142857
It's 1/7? but where do this values come from?
Thanks!
I got it!!
The sum of column B for each group is:
A: 5 + 3 = 8
B: 0 + 7 = 7
C: 3 + 9 = 12
For example, to arrive to 0.142857, divide 1 in the sum of group B (it's 7) : 1/7 = 0.142857

Input contains NaN, infinity or a value too large for dtype('float64') when I scale my data

I am trying to normalize my data like this :
scaler = MinMaxScaler()
trainX=scaler.fit_transform(X_data_train)
and I get this error :
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
X_data_train is a pandas DataFrame of size (95538, 550). What is really odd is that when I write
print (X_data_train.min().min())
it gives -5482.4473 and similarly for the max, I get 28738212.0, which does not seem for me to be extra-high values...
Moreover, based on the command given by the 54+ voted answer, I did check I have no NaN or Infinity for sure. Moreover, I don't have blanks in my csvor things like that, as I checked the dimensions
So, where is the problem ??
You can also check NaNs and inf:
df = pd.DataFrame({'B':[4,5,4,5,5,np.inf],
'C':[7,8,9,4,2,3],
'D':[np.nan,3,5,7,1,0],
'E':[5,3,6,9,2,4]})
print (df)
B C D E
0 4.000000 7 NaN 5
1 5.000000 8 3.0 3
2 4.000000 9 5.0 6
3 5.000000 4 7.0 9
4 5.000000 2 1.0 2
5 inf 3 0.0 4
nan = df[df.isnull().any(axis=1)]
print (nan)
B C D E
0 4.0 7 NaN 5
inf = df[df.eq(np.inf).any(axis=1)]
print (inf)
B C D E
5 inf 3 0.0 4
If want find all index with at least one NaNs in rows:
print (df.index[np.isnan(df).any(axis=1)])
Int64Index([0], dtype='int64')
And columns:
print (df.columns[np.isnan(df).any()])
Index(['D'], dtype='object')

Baffled by dataframe groupby.diff()

I have just read this question:
In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user') shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values?
I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.

What does .div do in Pandas (Python)

I'm confused as to the highlighted line. What exactly is this line doing. What does .div do? I tried to look through the documentation which said
"Floating division of dataframe and other, element-wise (binary operator truediv)"
I'm not exactly sure what this means. Any help would be appreciated!
You can divide one dataframe by another and pandas will automagically aligned the index and columns and subsequently divide the appropriate values. EG df1 / df2
If you divide a dataframe by series, pandas automatically aligns the series index with the columns of the dataframe. It maybe that you want to align the index of the series with the index of the dataframe instead. If this is the case, then you will have to use the div method.
So instead of:
df / s
You use
df.div(s, axis=0)
Which says to align the index of s with the index of df then perform the division while broadcasting over the other dimension, in this case columns.
In the above example, what it is essentially doing is dividing pclass_xt on axis 0, by the array/series which pclass_xt.sum(0) has generated. In pclass_xt.sum(0), .sum is summing up values along the axis=1, which gives you the total of both survived and not survived along all the pclasses. Then, .div is simply dividing the entire dataframe along 0 axis with the sum generated i.e. a row is divided by the sum of that row.
import pandas as pd,numpy as np
data={"A":np.arange(10),"B":np.random.randint(1,10,10),"C":np.random.random(10)}
#print(data)
df2=pd.DataFrame(data=data)
print("DataFrame values:\n",df2)
s1=pd.Series(np.arange(1,11))
print("s1 series values:\n",s1)
print("Result of Division:\n",df2.div(s1,axis=0))
**#So here, How the div is working as mention below:-
#df Row1/s1 Row1 -0/1 4/1 0.305/1
#df Row2/s1 Row2 -1/2 9/2 0.821/2**
#################Output###########################
DataFrame values:
A B C
0 0 2 0.265396
1 1 2 0.055646
2 2 7 0.963006
3 3 9 0.958677
4 4 6 0.256558
5 5 6 0.859066
6 6 8 0.818831
7 7 4 0.656055
8 8 6 0.885797
9 9 4 0.412497
s1 series values:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
Result of Division:
A B C
0 0.000000 2.000000 0.265396
1 0.500000 1.000000 0.027823
2 0.666667 2.333333 0.321002
3 0.750000 2.250000 0.239669
4 0.800000 1.200000 0.051312
5 0.833333 1.000000 0.143178
6 0.857143 1.142857 0.116976
7 0.875000 0.500000 0.082007
8 0.888889 0.666667 0.098422
9 0.900000 0.400000 0.041250

Categories