why pandas is giving different result when multiplying by index - python

I have a pandas dataframe to which I am adding a column, that column is the product of row index value and a constant. My question why pandas give different result as shown below:
c = pd.DataFrame(
columns = ['a', 'b'],
index = [0,30,45]
)
c.loc[:, ['a', 'b']] = c.index*60, 3
c.to_clipboard()
# a b
# 0 Int64Index([0, 1800, 2700], dtype='int64') 3
# 30 Int64Index([0, 1800, 2700], dtype='int64') 3
# 45 Int64Index([0, 1800, 2700], dtype='int64') 3
Second way, this what I was expecting the result to be except I don't want c column but a column should be as below:
c = pd.DataFrame(
columns = ['a', 'b'],
index = [0,30,45]
)
c.loc[:, ['a', 'c']] = c.index*60, 3
c.to_clipboard()
# a b c
# 0 0 NaN 3
# 30 1800 NaN 3
# 45 2700 NaN 3

Related

How to use the pandas 'isin' function to give actual values of the df row instead of a boolean expression?

I have two dataframes and I'm comparing their columns labeled 'B'. If the value of column B in df2 matches the value of column B in df1, I want to extract the value of column C from df2 and add it to a new column in df1.
Example:
df1
df2
Expected Result of df1:
I've tried the following. I know that this checks if there's a match of column B in both the dataframes - it returns a boolean value of True/False in the 'New' column. Is there a way to extract the value indicated under column 'C' when there's a match and add it to the 'New' column in df1 instead of the boolean values?
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df1['New'] = df2['B'].isin(df1['B'])
import pandas as pd
df1 = pd.DataFrame({'B': ['a', 'b', 'f', 'd', 'h'], 'C':[1, 5, 777, 10, 3]})
df2 = pd.DataFrame({'B': ['k', 'l', 'f', 'j', 'h'], 'C':[0, 9, 555, 15, 1]})
ind = df2[df2['B'].isin(df1['B'])].index
df1.loc[ind, 'new'] = df2.loc[ind, 'C']
df2
B C
0 k 0
1 l 9
2 f 555
3 j 15
4 h 1
Output df1
B C new
0 a 1 NaN
1 b 5 NaN
2 f 777 555.0
3 d 10 NaN
4 h 3 1.0
Here in ind are obtained indexes of rows df2 where there are matches. Further using loc, where on the left are the row indices, on the right are the column names.

Pandas split dataframe on grouped index

Given a dataframe like
df = pd.DataFrame({
'A': ['a', 'b', 'b'],
'B': ['x', 'x', 'y'],
'C': [1, 2, 3]
})
agg = df.groupby(['A', 'B']).agg('sum')
I get
C
A B
a x 1
b x 2
b y 3
Now I would like to transform this to:
C_x C_y
A 1 NaN
a 2 NaN
b NaN 3
How can I split agg into columns on the second index?

Subtract one row from another in Pandas DataFrame

I am trying to subtract one row from another in a Pandas DataFrame. I have multiple descriptor columns preceding one numerical column, forcing me to set the index of the DataFrame on the two descriptor columns.
When I do this I get a KeyError on whatever the first column name listed in the set_index() list of columns is. In this case it is 'COL_A':
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
df.reset_index(inplace=True)
KeyError: 'COL_A'
I did not give this a second thought and cannot figure out why the KeyError is how this resolves.
I came upon this question for a quick answer. Here's what my solution ended up being.
>>> df = pd.DataFrame(data=[[5,5,5,5], [3,3,3,3]], index=['r1', 'r2'])
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
>>> df.loc['r3'] = df.loc['r1'] - df.loc['r2']
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
r3 2 2 2 2
>>>
Not sure I understand you correctly:
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
gives:
COL_A COL_B COL_C
0 A B 4
1 A B 2
then
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
yields:
COL_A COL_B
A B 4.0
B 0.5
If you now want to subtract, say row 0 from row 1, you can:
df.iloc[1].subtract(df.iloc[0])
to get:
COL_C -3.5

How can I check the ID of a pandas data frame in another data frame in Python?

Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).

python pandas dataframe : fill nans with a conditional mean

I have the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A NaN
6 B NaN
And I want indexes 5 and 6 to be filled with the conditional mean of 'Vals' based on the 'Cat' column, namely 2 and 4.5
The following code works fine:
means = df.groupby('Cat').Vals.mean()
for i in df[df.Vals.isnull()].index:
df.loc[i, 'Vals'] = means[df.loc[i].Cat]
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A 2
6 B 4.5
But I'm looking for something nicer, like
df.Vals.fillna(df.Vals.mean(Conditionally to column 'Cat'))
Edit: I found this, which is one line shorter, but I'm still not happy with it:
means = df.groupby('Cat').Vals.mean()
df.Vals = df.apply(lambda x: means[x.Cat] if pd.isnull(x.Vals) else x.Vals, axis=1)
We wish to "associate" the Cat values with the missing NaN locations.
In Pandas such associations are always done via the index.
So it is natural to set Cat as the index:
df = df.set_index(['Cat'])
Once this is done, then fillna works as desired:
df['Vals'] = df['Vals'].fillna(means)
To return Cat to a column, you could then of course use reset_index:
df = df.reset_index()
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
means = df.groupby(['Cat'])['Vals'].mean()
df = df.set_index(['Cat'])
df['Vals'] = df['Vals'].fillna(means)
df = df.reset_index()
print(df)
yields
Cat Vals
0 A 1.0
1 A 2.0
2 A 3.0
3 B 4.0
4 B 5.0
5 A 2.0
6 B 4.5

Categories