Get the sum of absolutes of columns for a dataframe - python

If I have a dataframe and I want to sum the values of the columns I could do something like
import pandas as pd
studentdetails = {
"studentname":["Ram","Sam","Scott","Ann","John","Bobo"],
"mathantics" :[80,90,85,70,95,100],
"science" :[85,95,80,90,75,100],
"english" :[90,85,80,70,95,100]
}
index_labels=['r1','r2','r3','r4','r5','r6']
df = pd.DataFrame(studentdetails ,index=index_labels)
print(df)
df3 = df.sum()
print(df3)
col_list= ['studentname', 'mathantics', 'science']
print( df[col_list].sum())
How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns?
I tried abs in several way but it did not work
Edit:
studentname mathantics science english
r1 Ram 80 85 90
r2 Sam 90 95 -85
r3 Scott -85 80 80
r4 Ann 70 90 70
r5 John 95 -75 95
r6 Bobo 100 100 100
Expected output
mathantics 520
science 525
english 520
Edit2:
The col_list cannot include string value columns

You need numeric columns for absolute values:
col_list = df.columns.difference(['studentname'])
df[col_list].abs().sum()
df.set_index('studentname').abs().sum()
df.select_dtypes(np.number).abs().sum()

Related

Get column values for containing a value

I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C

Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY

I have two Pandas DataFrame with different columns number.
df1 is a single row DataFrame:
a X0 b Y0 c
0 233 100 56 shark -23
df2, instead, is multiple rows Dataframe:
d X0 e f Y0 g h
0 snow 201 32 36 cat 58 336
1 rain 176 99 15 tiger 63 845
2 sun 193 81 42 dog 48 557
3 storm 100 74 18 shark 39 673 # <-- This row
4 cloud 214 56 27 wolf 66 406
I would to verify if the df1's row is in df2, but considering X0 AND Y0 columns only, ignoring all other columns.
In this example the df1's row match the df2's row at index 3, that have 100 in X0 and 'shark' in Y0.
The output for this example is:
True
Note: True/False as output is enough for me, I don't care about index of matched row.
I founded similar questions but all of them check the entire row...
Use df.merge with an if condition check on len:
In [219]: if len(df1[['X0', 'Y0']].merge(df2)):
...: print(True)
...:
True
OR:
In [225]: not (df1[['X0', 'Y0']].merge(df2)).empty
Out[225]: True
Try this:
df2[(df2.X0.isin(df1.X0))&(df2.Y0.isin(df1.Y0))]
Output:
d X0 e f Y0 g h
3 storm 100 74 18 shark 39 673
duplicated
df2.append(df1).duplicated(['X0', 'Y0']).iat[-1]
True
Save a tad bit of time
df2[['X0', 'Y0']].append(df1[['X0', 'Y0']]).duplicated().iat[-1]

Use lambda with pandas to calculate a new column conditional on existing column

I need to create a new column in a pandas DataFrame which is calculated as the ratio of 2 existing columns in the DataFrame. However, the denominator in the ratio calculation will change based on the value of a string which is found in another column in the DataFrame.
Example. Sample dataset :
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
I need to create a new DataFrame column df['ratio'] based on the condition of df['hand'].
If df['hand']=='left' then df['ratio'] = df['exp_force'] / df['left_max']
If df['hand']=='both' then df['ratio'] = df['exp_force'] / df['both_max']
You can use np.where():
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
df['ratio'] = np.where((df['hand']=='left'), df['exp_force'] / df['left_max'], df['exp_force'] / df['both_max'])
df
Out[42]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Alternatively, in a real-life scenario, if you have lots of conditions and results, then you can use np.select(), so that you don't have to keep repeating your np.where() statement as I have done a lot in my older code. It's better to use np.select in these situations:
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
c1 = (df['hand']=='left')
c2 = (df['hand']=='both')
r1 = df['exp_force'] / df['left_max']
r2 = df['exp_force'] / df['both_max']
conditions = [c1,c2]
results = [r1,r2]
df['ratio'] = np.select(conditions,results)
df
Out[430]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Enumerate
for i,e in enumerate(df['hand']):
if e == 'left':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'left_max']
if e == 'both':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'both_max']
df
Output:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
You can use the apply() method of your dataframe :
df['ratio'] = df.apply(
lambda x: x['exp_force'] / x['left_max'] if x['hand']=='left' else x['exp_force'] / x['both_max'],
axis=1
)

Pandas: Adding a new column with random numbers to DF based on multiple criteria from row

I am a beginner. I've looked all over and read a bunch of related questions but can't quite figure this out. I know I am the problem and that I'm missing something, but I'm hoping someone will be kind and help me out. I am attempting to convert data from one video game (a college basketball simulation) into data consistent with another video game's (pro basketball simulation) format.
I have a DF that has columns:
Name, Pos, Height, Weight, Shot, Points
With values such as:
Jon Smith, C, 84, 235, Exc, 19.4
Greg Jones, PG, 72, 187, Fair, 12.0
I want create a new column for "InsideScoring". What I'd like to do is assign a player a randomly generated number within a certain range based on what position they played, height, weight, shot rating and points scored.
I tried a bunch of attempts like:
df1['InsideScoring'] = 0
df1.loc[(df1.Pos == "C") &
(df1.Height > 82) &
(df1.Points > 19.0) &
(df1.Weight > 229), 'InsideScoring'] = np.random.randint(85,100)
When I do this, all the players (row at column "InsideScoring") that meet the criteria get assigned the same value between 85 and 100 rather than a random mix of numbers between 85 and 100.
Eventually, what I want to do is go through the list of players and based on those four criteria, assign values from different ranges. Any ideas appreciated.
Pandas: Create a new column with random values based on conditional
Numpy "where" with multiple conditions
My recommendation would be to use np.select here. You set up your conditions, your outputs, and you're good to go. However, to avoid iteration, but also to avoid assigning the same random value to every column that meets the condition, create random values equal to the length of your DataFrame, and select from those:
Setup
df = pd.DataFrame({
'Name': ['Chris', 'John'],
'Height': [72, 84],
'Pos': ['PG', 'C'],
'Weight': [165, 235],
'Shot': ['Amazing', 'Fair'],
'Points': [999, 25]
})
Name Height Pos Weight Shot Points
0 Chris 72 PG 165 Amazing 999
1 John 84 C 235 Fair 25
Now set up your ranges and your conditions (Create as many of these as you like):
cond1 = df.Pos.eq('C') & df.Height.gt(80) & df.Weight.gt(200)
cond2 = df.Pos.eq('PG') & df.Height.lt(80) & df.Weight.lt(200)
range1 = np.random.randint(85, 100, len(df))
range2 = np.random.randint(50, 85, len(df))
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 72
1 John 84 C 235 Fair 25 89
Now to verify this doesn't assign values more than once:
df = pd.concat([df]*5)
... # Setup the ranges and conditions again
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 56
1 John 84 C 235 Fair 25 96
0 Chris 72 PG 165 Amazing 999 74
1 John 84 C 235 Fair 25 93
0 Chris 72 PG 165 Amazing 999 63
1 John 84 C 235 Fair 25 97
0 Chris 72 PG 165 Amazing 999 55
1 John 84 C 235 Fair 25 95
0 Chris 72 PG 165 Amazing 999 60
1 John 84 C 235 Fair 25 90
And we can see that random values are assigned, even though they all match one of two conditions. While this is less memory efficient than iterating and picking a random value, since we are creating a lot of unused numbers, it will still be faster as these are vectorized operations.

filter pandas dataframe based in another column

this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)

Categories