I'm confused as to the highlighted line. What exactly is this line doing. What does .div do? I tried to look through the documentation which said
"Floating division of dataframe and other, element-wise (binary operator truediv)"
I'm not exactly sure what this means. Any help would be appreciated!
You can divide one dataframe by another and pandas will automagically aligned the index and columns and subsequently divide the appropriate values. EG df1 / df2
If you divide a dataframe by series, pandas automatically aligns the series index with the columns of the dataframe. It maybe that you want to align the index of the series with the index of the dataframe instead. If this is the case, then you will have to use the div method.
So instead of:
df / s
You use
df.div(s, axis=0)
Which says to align the index of s with the index of df then perform the division while broadcasting over the other dimension, in this case columns.
In the above example, what it is essentially doing is dividing pclass_xt on axis 0, by the array/series which pclass_xt.sum(0) has generated. In pclass_xt.sum(0), .sum is summing up values along the axis=1, which gives you the total of both survived and not survived along all the pclasses. Then, .div is simply dividing the entire dataframe along 0 axis with the sum generated i.e. a row is divided by the sum of that row.
import pandas as pd,numpy as np
data={"A":np.arange(10),"B":np.random.randint(1,10,10),"C":np.random.random(10)}
#print(data)
df2=pd.DataFrame(data=data)
print("DataFrame values:\n",df2)
s1=pd.Series(np.arange(1,11))
print("s1 series values:\n",s1)
print("Result of Division:\n",df2.div(s1,axis=0))
**#So here, How the div is working as mention below:-
#df Row1/s1 Row1 -0/1 4/1 0.305/1
#df Row2/s1 Row2 -1/2 9/2 0.821/2**
#################Output###########################
DataFrame values:
A B C
0 0 2 0.265396
1 1 2 0.055646
2 2 7 0.963006
3 3 9 0.958677
4 4 6 0.256558
5 5 6 0.859066
6 6 8 0.818831
7 7 4 0.656055
8 8 6 0.885797
9 9 4 0.412497
s1 series values:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
Result of Division:
A B C
0 0.000000 2.000000 0.265396
1 0.500000 1.000000 0.027823
2 0.666667 2.333333 0.321002
3 0.750000 2.250000 0.239669
4 0.800000 1.200000 0.051312
5 0.833333 1.000000 0.143178
6 0.857143 1.142857 0.116976
7 0.875000 0.500000 0.082007
8 0.888889 0.666667 0.098422
9 0.900000 0.400000 0.041250
Related
To begin with I would like to mention that I new to python. I am trying to iterate over rows in pandas. My data comes from an excel file and looks like this:
I would like to create a loop that calculates the mean of specific rows. For instance row 0,1,2 and then 9,10,11 and so on.
What I have already done:
import pandas as pd
import numpy as np
df = pd.read_excel("Excel name.xlsx")
for i in range([0,1,2],154,3)
x =df.iloc[[i]].mean()
print(x)
But I am not getting results. Any idea? Thank you in advance.
What I am doing and my code actually works is:
x1= df.iloc[[0,1,2]].mean()
x2= df.iloc[[9,10,11]].mean()
x3= df.iloc[[18,19,20]].mean()
x4= df.iloc[[27,28,29]].mean()
x5= df.iloc[[36,37,38]].mean()
x6= df.iloc[[45,46,47]].mean()
....
....
....
x17= df.iloc[[146,147,148]].mean()
What if I had 100 x? It would be impossible to code. So my question is if there is a way to automate this procedure with a loop.
Dont loop, rather select all rows by using little maths - here integer division with modulo by 9 and selecting 0,1,2 values in Index.isin, and then aggregate mean:
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(10, size=(20, 3)))
mask = (df.index % 9).isin([0,1,2])
print(df[mask].groupby(df[mask].index // 9).mean())
0 1 2
0 4.000000 5.666667 6.666667
1 3.666667 6.000000 8.333333
2 6.500000 8.000000 7.000000
Detail:
print(df[mask])
0 1 2
0 4 5 9
1 0 6 5
2 8 6 6
9 1 6 7
10 5 6 9
11 5 6 9
18 4 9 7
19 9 7 7
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])
I have a dataframe with a bunch of columns labelled in 'YYYY-MM' format, along with several other columns. I need to collapse the date columns into calendar quarters and take the mean; I was able to do it manually, but there are a few hundred date columns in my real data and I'd like to not have to map every single one of them by hand. I'm generating the initial df from a CSV; I didn't see anything in read_csv that seemed like it would help, but if there's anything I can leverage there that would be great. I found dataframe.dt.to_period("Q") that will convert a datetime object to quarter, but I'm not quite sure how to apply that here, if I can at all.
Here's a sample df (code below):
foo bar 2016-04 2016-05 2016-06 2016-07 2016-08
0 6 5 3 3 5 8 1
1 9 3 6 9 9 7 8
2 8 5 8 1 9 9 4
3 5 8 1 2 3 5 6
4 4 5 1 2 7 2 6
This code will do what I'm looking for, but I had to generate mapping by hand:
mapping = {'2016-04':'2016q2', '2016-05':'2016q2', '2016-06':'2016q2', '2016-07':'2016q3', '2016-08':'2016q3'}
df = df.set_index(['foo', 'bar']).groupby(mapping, axis=1).mean().reset_index()
New df:
foo bar 2016q2 2016q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
Code to generate the initial df:
df = pd.DataFrame(np.random.randint(1, 11, size=(5, 7)), columns=('foo', 'bar', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08')) '2016-07', '2016-08'))
Use a callable that gets applied to the index values. Use axis=1 to apply it to the column values instead.
(df.set_index(['foo', 'bar'])
.groupby(lambda x: pd.Period(x, 'Q'), axis=1)
.mean().reset_index())
foo bar 2016Q2 2016Q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
The solution is quite short:
Start from copying "monthly" columns to another DataFrame and converting
column names to PeriodIndex:
df2 = df.iloc[:, 2:]
df2.columns = pd.PeriodIndex(df2.columns, freq='M')
Then, to get the result, resample columns by quarter,
compute the mean (for each quarter) and join with 2 "initial" columns:
df.iloc[:, :2].join(df2.resample('Q', axis=1).agg('mean'))
data = [[2,2,2,3,3,3],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5]]
df = pd.DataFrame(data, columns = ['A','1996-04','1996-05','2000-07','2000-08','2010-10'])
# separate year columns and other columns
# separate year columns
df3 = df.iloc[:, 1:]
# separate other columns
df2 = df.iloc[:,0]
#apply groupby using period index
df3=df3.groupby(pd.PeriodIndex(df3.columns, freq='Q'), axis=1).mean()
final_df = pd.concat([df3,df2], axis=1)
print(final_df)
output is attached in image:
I'm having trouble understanding how a function works:
""" the apply() method lets you apply an arbitrary function to the group
result. The function take a DataFrame and returns a Pandas object (a df or
series) or a scalar.
For example: normalize the first column by the sum of the second"""
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
print (df); print (df.groupby('key').apply(norm_by_data2))
(Excerpt from: "Python Data Science Handbook", Jake VanderPlas pp. 167)
Returns this:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
key data1 data2
0 A 0.000000 5
1 B 0.142857 0
2 C 0.166667 3
3 A 0.375000 3
4 B 0.571429 7
5 C 0.416667 9
For me, the best way to understand how this works is by manually calculating the values.
Can someone explain how to manually arrive to the second value of the column 'data1': 0.142857
It's 1/7? but where do this values come from?
Thanks!
I got it!!
The sum of column B for each group is:
A: 5 + 3 = 8
B: 0 + 7 = 7
C: 3 + 9 = 12
For example, to arrive to 0.142857, divide 1 in the sum of group B (it's 7) : 1/7 = 0.142857
What's the most effective way to solve the following pandas problem?
Here's a simplified example with some data in a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=['a','b','c','d'],
index=np.random.randint(0,10,size=10))
This data looks like this:
a b c d
1 0 0 9 9
0 2 2 1 7
3 9 3 4 0
2 5 0 9 4
1 7 7 7 2
6 4 4 6 4
1 1 6 0 0
7 8 0 9 3
5 0 0 8 3
4 5 0 2 4
Now I want to apply some function f to each value in the data frame (the function below, for example) and get a data frame back as a resulting output. The tricky part is the function I'm applying depends on the value of the index I am currently at.
def f(cell_val, row_val):
"""some function which needs to know row_val to use it"""
try:
return cell_val/row_val
except ZeroDivisionError:
return -1
Normally, if I wanted to apply a function to each individual cell in the data frame, I would just call .applymap() on f. Even if I had to pass in a second argument ('row_val', in this case), if the argument was a fixed number I could just write a lambda expression such as lambda x: f(x,i) where i is the fixed number I wanted. However, my second argument varies depending on the row in the data frame I am currently calling the function from, which means that I can't just use .applymap().
How would I go about solving a problem like this efficiently? I can think of a few ways to do this, but none of them feel "right". I could:
loop through each individual value and replace them one by one, but that seems really awkward and slow.
create a completely separate data frame containing (cell value, row value) tuples and use the builtin pandas applymap() on my tuple data frame. But that seems pretty hacky and I'm also creating a completely separate data frame as an extra step.
there must be a better solution to this (a fast solution would be appreciated, because my data frame could get very large).
IIUC you can use div with axis=0 plus you need to convert the Index object to a Series object using to_series:
In [121]:
df.div(df.index.to_series(), axis=0).replace(np.inf, -1)
Out[121]:
a b c d
1 0.000000 0.000000 9.000000 9.000000
0 -1.000000 -1.000000 -1.000000 -1.000000
3 3.000000 1.000000 1.333333 0.000000
2 2.500000 0.000000 4.500000 2.000000
1 7.000000 7.000000 7.000000 2.000000
6 0.666667 0.666667 1.000000 0.666667
1 1.000000 6.000000 0.000000 0.000000
7 1.142857 0.000000 1.285714 0.428571
5 0.000000 0.000000 1.600000 0.600000
4 1.250000 0.000000 0.500000 1.000000
Additionally as division by zero results in inf you need to call replace to replace those rows with -1
Here's how you can add the index to the dataframe
pd.DataFrame(df.values + df.index.values[:, None], df.index, df.columns)