The following is the dataframe,
a b
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
I want to add a new column, call it sum which takes the sum of it's respective row values.
Expected output
a b sum
0 1 3 4
1 2 4 6
2 3 5 8
3 4 6 10
4 5 7 12
5 6 8 14
6 7 9 16
How to achieve this using pandas map, apply, Applymap functions?
My Code
df = pd.DataFrame({
'a': [1,2,3,4,5,6,7],
'b': [3,4,5,6,7,8,9]
})
def sum(df):
return df['a']+df['b']
# Methods I tried
df['sum'] = df.apply(sum(df))
df['sum']=df[['a',"b"]].map(sum)
df['sum'] = df.apply(lambda x: x['a'] + x['b'])
Note: This is just a dummy code. The original code has a function which returns different output for each individual rows and it ain't as simple as applying sum function. So I request you to make a custom sum function and implement those methods, so that I'll learn and apply the same to my code.
You can use the pandas sum function like below:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [3, 4, 5, 6, 7, 8, 9]})
df["sum"] = df.sum(axis=1)
print(df)
And if you have to use lambda with apply you can try:
import pandas as pd
def add(a, b):
return a + b
df = pd.DataFrame({
'a': [1,2,3,4,5,6,7],
'b': [3,4,5,6,7,8,9]
})
df['sum'] = df.apply(lambda row : add(row['a'], row['b']), axis = 1)
print(df)
Related
I created a list as a mean of 2 other columns, the length of the list is same as the number of rows in the dataframe. But when I try to add that list as a column to the dataframe, the entire list gets assigned to each row instead of only corresponding values of the list.
glucose_mean = []
for i in range(len(df)):
mean = (df['h1_glucose_max']+df['h1_glucose_min'])/2
glucose_mean.append(mean)
df['glucose'] = glucose_mean
data after adding list
I think you overcomplicated it. You don't need for-loop but only one line
df['glucose'] = (df['h1_glucose_max'] + df['h1_glucose_min']) / 2
EDIT:
If you want to work with every row separatelly then you can use .apply()
def func(row):
return (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
df['glucose'] = df.apply(func, axis=1)
And if you really need to use for-loop then you can use .iterrows() (or similar functions)
glucose_mean = []
for index, row in df.iterrows():
mean = (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
glucose_mean.append(mean)
df['glucose'] = glucose_mean
Minimal working example:
import pandas as pd
data = {
'h1_glucose_min': [1,2,3],
'h1_glucose_max': [4,5,6],
}
df = pd.DataFrame(data)
# - version 1 -
df['glucose_1'] = (df['h1_glucose_max'] + df['h1_glucose_min']) / 2
# - version 2 -
def func(row):
return (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
df['glucose_2'] = df.apply(func, axis=1)
# - version 3 -
glucose_mean = []
for index, row in df.iterrows():
mean = (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
glucose_mean.append(mean)
df['glucose_3'] = glucose_mean
print(df)
You do not need to iterate over your frame. Use this instead (example for a pseudo data frame):
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8], 'col2': [10, 9, 8, 7, 6, 5, 4, 100]})
df['mean_col1_col2'] = df[['col1', 'col2']].mean(axis=1)
df
-----------------------------------
col1 col2 mean_col1_col2
0 1 10 5.5
1 2 9 5.5
2 3 8 5.5
3 4 7 5.5
4 5 6 5.5
5 6 5 5.5
6 7 4 5.5
7 8 100 54.0
-----------------------------------
As you can see in the following example, your code is appending an entire column each time the for loop executes, so when you assign glucose_mean list as a column, each element is a list instead of a single element:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[2, 3, 4, 5]})
glucose_mean = []
for i in range(len(df)):
glucose_mean.append(df['col1'])
print((glucose_mean[0]))
df['col2'] = [5, 6, 7, 8]
print(df)
Output:
0 1
1 2
2 3
3 4
Name: col1, dtype: int64
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
How can I create a Pandas DataFrame that shows the relative position of each value, when those values are sorted from low to high for each column?
So in this case, how can you transform 'df' into 'dfOut'?
import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'A': [12, 18, 9, 21, 24, 15],
'B': [18, 22, 19, 14, 14, 11],
'C': [5, 7, 7, 9, 12, 9]})
# How to assign a value to the order in the column, when sorted from low to high?
dfOut = pd.DataFrame({'A': [2, 4, 1, 5, 6, 3],
'B': [3, 5, 4, 2, 2, 1],
'C': [1, 2, 2, 3, 4, 3]})
If you need to map the same values to the same output, try using the rank method of a DataFrame. Like this:
>> dfOut = df.rank(method="dense").astype(int) # Type transformation added to match your output
>> dfOut
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The rank method computes the rank for each column following a specific criteria. According to the Pandas documentation, the "dense" method ensures that "rank always increases by 1 between groups", and that might match your use case.
Original answer: In case that repeated numbers are not required to map to the same out value, np.argsort could be applied on each column to retrieve the position of each value that would sort the column. Combine this with the apply method of a DataFrame to apply the function on each column and you have this:
>> dfOut = df.apply(lambda column: np.argsort(column.values)))
>> dfOut
A B C
0 2 5 0
1 0 3 1
2 5 4 2
3 1 0 3
4 3 2 5
5 4 1 4
Here is my attempt using some functions:
def sorted_idx(l, num):
x = sorted(list(set(l)))
for i in range(len(x)):
if x[i]==num:
return i+1
def output_list(l):
ret = [sorted_idx(l, elem) for elem in l]
return ret
dfOut = df.apply(lambda column: output_list(column))
print(dfOut)
I make reduce the original list to unique values and then sort. Finally, I return the index+1 where the element in the original list matches this unique, sorted list to get the values you have in your expected output.
Output:
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
I'm struggling to figure out how to do a couple of transformation with pandas. I want a new dataframe with the sum of the values from the columns in the original. I also want to be able to merge two of these 'summed' dataframes.
Example #1: Summing the columns
Before:
A B C D
1 4 7 0
2 5 8 1
3 6 9 2
After:
A B C D
6 15 24 3
Right now I'm getting the sums of the columns I'm interested in, storing them in a dictionary, and creating a dataframe from the dictionary. I feel like there is a better way to do this with pandas that I'm not seeing.
Example #2: merging 'summed' dataframes
Before:
A B C D F
6 15 24 3 1
A B C D E
1 2 3 4 2
After:
A B C D E F
7 17 27 7 2 1
First question:
Summing the columns
Use sum then convert Series to DataFrame and transpose
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6],
'C': [7, 8, 9], 'D': [0, 1, 2]})
df1 = df1.sum().to_frame().T
print(df1)
# Output:
A B C D
0 6 15 24 3
Second question:
Merging 'summed' dataframes
Use combine
df2 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3], 'D': [4], 'E': [2]})
out = df1.combine(df2, sum, fill_value=0)
print(out)
# Output:
A B C D E
0 7 17 27 7 2
First part, use DataFrame.sum() to sum the columns then convert Series to dataframe by .to_frame() and finally transpose:
df_sum = df.sum().to_frame().T
Result:
print(df_sum)
A B C D
0 6 15 24 3
Second part, use DataFrame.add() with parameter fill_value, as follows:
df_sum2 = df1.add(df2, fill_value=0)
Result:
print(df_sum2)
A B C D E F
0 7 17 27 7 2.0 1.0
What is the Pandas equivalent of top_n() in dplyr?
In R dplyr 0.8.5:
> df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 6))
> df %>% top_n(2, wt=x)
x
1 10
2 6
3 6
As the dplyr documentation highlights, note that we get more than 2 values here because there's a tie: top_n() either takes all rows with a value, or none.
My attempt in Pandas 1.0.1:
df = pd.DataFrame({'x': [10, 4, 1, 6, 3, 1, 6]})
df = df.sort_values('x', ascending=False)
df.groupby('x').head(2)
Result:
x
0 10
3 6
6 6
1 4
4 3
2 1
5 1
Expected results:
x
1 10
2 6
3 6
Use parameter keep='all' in DataFrame.nlargest, sorting here is not necessary:
df = df.nlargest(2, 'x', keep='all')
print(df)
x
0 10
3 6
6 6
IUC, try series.nlargest with series.isin:
df[df['x'].isin(df['x'].nlargest(2))]#.reset_index(drop=True)
x
0 10
3 6
6 6
top_n in dplyr is superseded by slice_max/slice_min. See:
https://dplyr.tidyverse.org/reference/top_n.html
With datar in python, you can do it in a similar way:
>>> import pandas as pd
>>> from datar.all import f, slice_max
>>>
>>> df = pd.DataFrame({'x': [10, 4, 1, 6, 3, 1, 6]})
>>> df
x
<int64>
0 10
1 4
2 1
3 6
4 3
5 1
6 6
>>> df >> slice_max(n=3, order_by=f.x)
x
<int64>
0 10
3 6
6 6
Disclaimer: I am the author of the datar package.
So let us say I have a dataframe, which is created like this and has 3 products A,B,C
df = pd.DataFrame({'type' : ['A','A','B','B','C','C'], 'x' : [1,2,3,4,5,6]})
Which you can print and see looks like below
type x
0 A 1
1 A 2
2 B 3
3 B 4
4 C 5
5 C 6
Now I create a function called f, which returns tuple
def f(x):
return x*2, x*3, x*4
And I apply this on the dataframe with groupby on type
df.groupby('type').apply(lambda x : f(x.x))
And now the result is a series of 3 array as below. But how do I merge it back to the dataframe correctly
type
A ([2, 4], [3, 6], [4, 8])
B ([6, 8], [9, 12], [12, 16])
C ([10, 12], [15, 18], [20, 24])
dtype: object
What I want to see is
type x a b c
A 1 2 3 4
A 2 4 6 8
B 3 6 9 12
B 4 8 12 16
C 5 10 15 20
C 6 12 18 24
EDITED:
Please note that I gave f function as a very simple example and it looks like why can't I directly create a new column with multiplication. But imagine a more complex function f that uses 3 columns and then generates tuples that it not straight forward column multiplication
That is why I asked this question
The real function in question is talib.BBANDS
Assuming that in your real case: the groupby is needed, your function takes several columns as input and return several columns as output, your function could return a dataframe:
def f(x):
return pd.DataFrame({'a':x*2, 'b':x*3, 'c':x*4}, index=x.index)
# then assign directly or use join
df[['a','b','c']] = df.groupby('type').apply(lambda x : f(x.x))
print (df)
type x a b c
0 A 1 2 3 4
1 A 2 4 6 8
2 B 3 6 9 12
3 B 4 8 12 16
4 C 5 10 15 20
5 C 6 12 18 24
Edit with the name of the function used talib.BBANDS, then I guess you can create a wrapper:
def f(x):
upper, middle, lower = talib.BBANDS(x, ...) #enter the parameter you need
return pd.DataFrame({'upper':upper, 'middle':middle, 'lower':lower },
index=x.index)
df[['upper','middle','lower']] = df.groupby('type').apply(lambda x : f(x.x))
import pandas as pd
df = pd.DataFrame({'type' : ['A','A','B','B','C','C'], 'x' : [1,2,3,4,5,6]})
newcol=df['x']**2
df['x**2']=newcol
df
output: