I have a dataframe (df) with a multi index consisting of 3 indexes, 'A', 'B', and 'C' say, and I have a column called Quantity containing floats.
What I would like to do is perform a groupby on 'A' and 'B' summing the values in Quantity. How would I do this? The standard way of working does not work because pandas does no recognize the indexes as columns and if I use something like
df.groupby(level=0).sum()
it seems I can only select a single level. How would one go about this?
You can specify multiple levels like:
df.groupby(level=[0, 1]).sum()
#alternative
df.groupby(level=['A','B']).sum()
Or pass parameter level to sum:
df.sum(level=[0, 1])
#alternative
df.sum(level=['A','B'])
Sample:
df = pd.DataFrame({'A':[1,1,2,2,3],
'B':[3] * 5,
'C':[3,4,5,4,5],
'Quantity':[1.0,3,4,5,6]}).set_index(['A','B','C'])
print (df)
Quantity
A B C
1 3 3 1.0
4 3.0
2 3 5 4.0
4 5.0
3 3 5 6.0
df1 = df.groupby(level=[0, 1]).sum()
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.groupby(level=['A','B']).sum()
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.sum(level=[0, 1])
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.sum(level=['A','B'])
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
Related
Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I am trying to sort each row of a DataFrame element wise.
Input:
A B C
0 10 5 6
1 3 6 5
2 1 2 3
Output:
A B C
0 10 6 5
1 6 5 3
2 3 2 1
It feels this should be easy but I've been failing for while... Very much a beginner in Python.
Use np.sort with swap ordering by indexing:
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
Pandas solution, slowier, is apply sorting for each row separately, convert to array and then to Series:
f = lambda x: pd.Series(x.sort_values(ascending=False).to_numpy(), index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10 6 5
1 6 5 3
2 3 2 1
If possible missing values for me working:
print (df)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1)[:, ::-1],
index=df.index,
columns=df.columns)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
In pandas is possible use na_position parameter for specify order of them:
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='first').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 NaN 5.0 3.0
2 NaN 2.0 1.0
f = lambda x: pd.Series(x.sort_values(ascending=False, na_position='last').to_numpy(),
index=df.columns)
df1 = df.apply(f, axis=1)
print (df1)
A B C
0 10.0 6.0 5.0
1 5.0 3.0 NaN
2 2.0 1.0 NaN
I need to interpolate the NaN values over a Dataframe but I want that interpolation to get the first values of the DataFrame in case the NaN value is the last value. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"a": [1,2,3], "b":[1,2,np.nan]})
So the DataFrame is:
a b
0 1 1.0
1 2 2.0
2 3 NaN
But when I interpolate the nan values like:
df.interpolate(method="linear", inplace=True)
I got:
a b
0 1 1.0
1 2 2.0
2 3 2.0
The interpolation doesn't use the first value to do it. My desired output wold be to fill in with the value of 1.5 because of that circular interpolation.
One possible solution is add first row, interpolate and remove last row:
df = df.append(df.iloc[0]).interpolate(method="linear").iloc[:-1]
print (df)
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 1.5
EDIT:
More general solution:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df = pd.concat([df] * 3).interpolate(method="linear").iloc[len(df):-len(df)]
print (df)
a b
0 1 1.333333
1 2 1.000000
2 3 2.000000
3 4 1.666667
Or if need working only with last non missing values:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df1 = df.ffill().iloc[[-1]]
df2 = df.bfill().iloc[[0]]
df = pd.concat([df1, df, df2]).interpolate(method="linear").iloc[1:-1]
print (df)
a b
0 1 1.5
1 2 1.0
2 3 2.0
3 4 1.5
I need to create a column that computes the difference between another column's elements:
Column A Computed Column
10 blank # nothing to compute for first record
9 1 # = 10-9
7 2 # = 9-7
4 3 # = 7-4
I am assuming this is a lambda function, but i am not sure how to reference the elements in 'Column A'
Any help/direction you can provide would be great- thanks!
You can do it by shifting the column.
import pandas as pd
dict1 = {'A': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
df['Computed'] = df['A'].shift() - df['A']
print(df)
giving
A Computed
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
EDIT: OP extended his requirement to multi columns
dict1 = {'A': [10,9,7,4], 'B': [10,9,7,4], 'C': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
columns_to_update = ['A', 'B']
for col in columns_to_update:
df['Computed'+col] = df[col].shift() - df[col]
print(df)
By using the columns_to_update, you can choose the columns you want.
A B C ComputedA ComputedB
0 10 10 10 NaN NaN
1 9 9 9 1.0 1.0
2 7 7 7 2.0 2.0
3 4 4 4 3.0 3.0
Use diff.
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = df.A.diff(-1).shift(1)
Output:
df
Out[140]:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
I would just do:
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
The reason for abs() is because diff() computes the difference between current - previous whereas you want previous - current. This method is already built-in to the Series class, so using abs() will get you the correct result by taking the absolute value either way.
To support:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
>>> df
# Output
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df2['B'] = abs(df2['A'].diff())
>>> df2
# Output
A B
0 10 NaN
1 4 6.0
2 7 3.0
3 9 2.0
To still out perform that of #cosmic_inquiry's solution:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df['B'] = df['A'].diff() * -1
df2['B'] = df2['A'].diff() * -1
>>> df
# Output:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
>>> df2
# Output:
A B
0 10 NaN
1 4 6.0
2 7 -3.0
3 9 -2.0
I'm working with a huge dataframe in python and sometimes I need to add an empty row or several rows in a definite position to dataframe. For this question I create a small dataframe df in order to show, what I want to achieve.
> df = pd.DataFrame(np.random.randint(10, size = (3,3)), columns =
> ['A','B','C'])
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 8 1 9
Let's say I need to add an empty row, if I have a zero-value in the column 'C'. Here the empty row should be added after the second row. So at the end I want to have a new dataframe like:
>new_df
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 nan nan nan
> 3 8 1 9
I tried with concat and append, but I didn't get what I want to. Could you help me please?
You can try in this way:
l = df[df['C']==0].index.tolist()
for c, i in enumerate(l):
dfs = np.split(df, [i+1+c])
df = pd.concat([dfs[0], pd.DataFrame([[np.NaN, np.NaN, np.NaN]], columns=df.columns), dfs[1]], ignore_index=True)
print df
Input:
A B C
0 4 3 0
1 4 0 4
2 4 4 2
3 3 2 1
4 3 1 2
5 4 1 4
6 1 0 4
7 0 2 0
8 2 0 3
9 4 1 3
Output:
A B C
0 4.0 3.0 0.0
1 NaN NaN NaN
2 4.0 0.0 4.0
3 4.0 4.0 2.0
4 3.0 2.0 1.0
5 3.0 1.0 2.0
6 4.0 1.0 4.0
7 1.0 0.0 4.0
8 0.0 2.0 0.0
9 NaN NaN NaN
10 2.0 0.0 3.0
11 4.0 1.0 3.0
Last thing: it can happen that the last row has 0 in 'C', so you can add:
if df["C"].iloc[-1] == 0 :
df.loc[len(df)] = [np.NaN, np.NaN, np.NaN]
Try using slice.
First, you need to find the rows where C == 0. So let's create a bool df for this. I'll just name it 'a':
a = (df['C'] == 0)
So, whenever C == 0, a == True.
Now we need to find the index of each row where C == 0, create an empty row and add it to the df:
df2 = df.copy() #make a copy because we want to be safe here
for i in df.loc[a].index:
empty_row = pd.DataFrame([], index=[i]) #creating the empty data
j = i + 1 #just to get things easier to read
df2 = pd.concat([df2.ix[:i], empty_row, df2.ix[j:]]) #slicing the df
df2 = df2.reset_index(drop=True) #reset the index
I must say... I don't know the size of your df and if this is fast enough, but give it a try
In case you know the index where you want to insert a new row, concat can be a solution.
Example dataframe:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
Your new row as a dataframe with index 1:
new_row = pd.DataFrame({'A': np.nan, 'B': np.nan,'C': np.nan}, index=[1])
Inserting your new row after the second row:
new_df = pd.concat([df.loc[:1], new_row, df.loc[2:]]).reset_index(drop=True)
# A B C
# 0 1.0 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN NaN NaN
# 3 3.0 6.0 9.0
something like this should work for you:
for key, row in df.iterrows():
if row['C'] == 0:
df.loc[key+1] = pd.Series([np.nan])