I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C
I need to create a new column in a pandas DataFrame which is calculated as the ratio of 2 existing columns in the DataFrame. However, the denominator in the ratio calculation will change based on the value of a string which is found in another column in the DataFrame.
Example. Sample dataset :
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
I need to create a new DataFrame column df['ratio'] based on the condition of df['hand'].
If df['hand']=='left' then df['ratio'] = df['exp_force'] / df['left_max']
If df['hand']=='both' then df['ratio'] = df['exp_force'] / df['both_max']
You can use np.where():
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
df['ratio'] = np.where((df['hand']=='left'), df['exp_force'] / df['left_max'], df['exp_force'] / df['both_max'])
df
Out[42]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Alternatively, in a real-life scenario, if you have lots of conditions and results, then you can use np.select(), so that you don't have to keep repeating your np.where() statement as I have done a lot in my older code. It's better to use np.select in these situations:
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
c1 = (df['hand']=='left')
c2 = (df['hand']=='both')
r1 = df['exp_force'] / df['left_max']
r2 = df['exp_force'] / df['both_max']
conditions = [c1,c2]
results = [r1,r2]
df['ratio'] = np.select(conditions,results)
df
Out[430]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Enumerate
for i,e in enumerate(df['hand']):
if e == 'left':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'left_max']
if e == 'both':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'both_max']
df
Output:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
You can use the apply() method of your dataframe :
df['ratio'] = df.apply(
lambda x: x['exp_force'] / x['left_max'] if x['hand']=='left' else x['exp_force'] / x['both_max'],
axis=1
)
I am a beginner. I've looked all over and read a bunch of related questions but can't quite figure this out. I know I am the problem and that I'm missing something, but I'm hoping someone will be kind and help me out. I am attempting to convert data from one video game (a college basketball simulation) into data consistent with another video game's (pro basketball simulation) format.
I have a DF that has columns:
Name, Pos, Height, Weight, Shot, Points
With values such as:
Jon Smith, C, 84, 235, Exc, 19.4
Greg Jones, PG, 72, 187, Fair, 12.0
I want create a new column for "InsideScoring". What I'd like to do is assign a player a randomly generated number within a certain range based on what position they played, height, weight, shot rating and points scored.
I tried a bunch of attempts like:
df1['InsideScoring'] = 0
df1.loc[(df1.Pos == "C") &
(df1.Height > 82) &
(df1.Points > 19.0) &
(df1.Weight > 229), 'InsideScoring'] = np.random.randint(85,100)
When I do this, all the players (row at column "InsideScoring") that meet the criteria get assigned the same value between 85 and 100 rather than a random mix of numbers between 85 and 100.
Eventually, what I want to do is go through the list of players and based on those four criteria, assign values from different ranges. Any ideas appreciated.
Pandas: Create a new column with random values based on conditional
Numpy "where" with multiple conditions
My recommendation would be to use np.select here. You set up your conditions, your outputs, and you're good to go. However, to avoid iteration, but also to avoid assigning the same random value to every column that meets the condition, create random values equal to the length of your DataFrame, and select from those:
Setup
df = pd.DataFrame({
'Name': ['Chris', 'John'],
'Height': [72, 84],
'Pos': ['PG', 'C'],
'Weight': [165, 235],
'Shot': ['Amazing', 'Fair'],
'Points': [999, 25]
})
Name Height Pos Weight Shot Points
0 Chris 72 PG 165 Amazing 999
1 John 84 C 235 Fair 25
Now set up your ranges and your conditions (Create as many of these as you like):
cond1 = df.Pos.eq('C') & df.Height.gt(80) & df.Weight.gt(200)
cond2 = df.Pos.eq('PG') & df.Height.lt(80) & df.Weight.lt(200)
range1 = np.random.randint(85, 100, len(df))
range2 = np.random.randint(50, 85, len(df))
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 72
1 John 84 C 235 Fair 25 89
Now to verify this doesn't assign values more than once:
df = pd.concat([df]*5)
... # Setup the ranges and conditions again
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 56
1 John 84 C 235 Fair 25 96
0 Chris 72 PG 165 Amazing 999 74
1 John 84 C 235 Fair 25 93
0 Chris 72 PG 165 Amazing 999 63
1 John 84 C 235 Fair 25 97
0 Chris 72 PG 165 Amazing 999 55
1 John 84 C 235 Fair 25 95
0 Chris 72 PG 165 Amazing 999 60
1 John 84 C 235 Fair 25 90
And we can see that random values are assigned, even though they all match one of two conditions. While this is less memory efficient than iterating and picking a random value, since we are creating a lot of unused numbers, it will still be faster as these are vectorized operations.
this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)