My dataset looks like this, and my code is:
import pandas as pd
df = pd.read_csv("Admission_data.csv")
correlations = df.corr()
correlations
admit gre gpa rank
0.0 380.0 3.609999895095825 3.0
1.0 660.0 3.6700000762939453 3.0
1.0 800.0 4.0 1.0
1.0 640.0 3.190000057220459 4.0
0.0 520.0 2.930000066757202 4.0
1.0 760.0 3.0 2.0
1.0 560.0 2.9800000190734863 1.0
0.0 400.0 3.0799999237060547 2.0
1.0 540.0 3.390000104904175 3.0
0.0 700.0 3.9200000762939453 2.0
0.0 800.0 4.0 4.0
0.0 440.0 3.2200000286102295 1.0
1.0 760.0 4.0 1.0
0.0 700.0 3.0799999237060547 2.0
1.0 700.0 4.0 1.0
0.0 480.0 3.440000057220459 3.0
0.0 780.0 3.869999885559082 4.0
0.0 360.0 2.559999942779541 3.0
0.0 800.0 3.75 2.0
1.0 540.0 3.809999942779541 1.0
0.0 500.0 3.1700000762939453 3.0
1.0 660.0 3.630000114440918 2.0
0.0 600.0 2.819999933242798 4.0
0.0 680.0 3.190000057220459 4.0
1.0 760.0 3.3499999046325684 2.0
1.0 800.0 3.6600000858306885 1.0
1.0 620.0 3.609999895095825 1.0
1.0 520.0 3.740000009536743 4.0
1.0 780.0 3.2200000286102295 2.0
0.0 520.0 3.2899999618530273 1.0
0.0 540.0 3.7799999713897705 4.0
0.0 760.0 3.3499999046325684 3.0
0.0 600.0 3.4000000953674316 3.0
1.0 800.0 4.0 3.0
0.0 360.0 3.140000104904175 1.0
0.0 400.0 3.049999952316284 2.0
0.0 580.0 3.25 1.0
0.0 520.0 2.9000000953674316 3.0
1.0 500.0 3.130000114440918 2.0
1.0 520.0 2.680000066757202 3.0
0.0 560.0 2.4200000762939453 2.0
1.0 580.0 3.319999933242798 2.0
1.0 600.0 3.1500000953674316 2.0
0.0 500.0 3.309999942779541 3.0
0.0 700.0 2.940000057220459 2.0
1.0 460.0 3.450000047683716 3.0
1.0 580.0 3.4600000381469727 2.0
0.0 500.0 2.9700000286102295 4.0
0.0 440.0 2.4800000190734863 4.0
0.0 400.0 3.3499999046325684 3.0
0.0 640.0 3.859999895095825 3.0
0.0 440.0 3.130000114440918 4.0
0.0 740.0 3.369999885559082 4.0
1.0 680.0 3.2699999809265137 2.0
0.0 660.0 3.3399999141693115 3.0
1.0 740.0 4.0 3.0
0.0 560.0 3.190000057220459 3.0
0.0 380.0 2.940000057220459 3.0
0.0 400.0 3.6500000953674316 2.0
0.0 600.0 2.819999933242798 4.0
1.0 620.0 3.180000066757202 2.0
0.0 560.0 3.319999933242798 4.0
0.0 640.0 3.6700000762939453 3.0
1.0 680.0 3.8499999046325684 3.0
0.0 580.0 4.0 3.0
0.0 600.0 3.5899999141693115 2.0
0.0 740.0 3.619999885559082 4.0
0.0 620.0 3.299999952316284 1.0
0.0 580.0 3.690000057220459 1.0
0.0 800.0 3.7300000190734863 1.0
0.0 640.0 4.0 3.0
0.0 300.0 2.9200000762939453 4.0
0.0 480.0 3.390000104904175 4.0
0.0 580.0 4.0 2.0
0.0 720.0 3.450000047683716 4.0
0.0 720.0 4.0 3.0
0.0 560.0 3.359999895095825 3.0
1.0 800.0 4.0 3.0
0.0 540.0 3.119999885559082 1.0
1.0 620.0 4.0 1.0
0.0 700.0 2.9000000953674316 4.0
0.0 620.0 3.069999933242798 2.0
0.0 500.0 2.7100000381469727 2.0
0.0 380.0 2.9100000858306885 4.0
1.0 500.0 3.5999999046325684 3.0
0.0 520.0 2.9800000190734863 2.0
0.0 600.0 3.319999933242798 2.0
0.0 600.0 3.4800000190734863 2.0
0.0 700.0 3.2799999713897705 1.0
1.0 660.0 4.0 2.0
0.0 700.0 3.8299999237060547 2.0
1.0 720.0 3.640000104904175 1.0
0.0 800.0 3.9000000953674316 2.0
0.0 580.0 2.930000066757202 2.0
1.0 660.0 3.440000057220459 2.0
0.0 660.0 3.3299999237060547 2.0
0.0 640.0 3.5199999809265137 4.0
0.0 480.0 3.569999933242798 2.0
0.0 700.0 2.880000114440918 2.0
0.0 400.0 3.309999942779541 3.0
I tried correlations methods, method='pearson' or numeric_value=True etc.. but nothing works
i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
How can we sort the below rows in dataframe wrt to month from Jan to Dec,
currently this dataframe is in alphabetical order.
0 Col1 Col2 Col3 ... Col22 Col23 Col24
1 April 53.0 0.0 ... 11.0 0.0 0.0
2 August 43.0 0.0 ... 11.0 3.0 5.0
3 December 36.0 0.0 ... 4.0 1.0 0.0
4 February 48.0 0.0 ... 16.0 0.0 0.0
5 January 55.0 0.0 ... 24.0 4.0 0.0
6 July 45.0 0.0 ... 4.0 8.0 1.0
7 June 34.0 0.0 ... 4.0 8.0 1.0
8 March 34.0 2.0 ... 24.0 4.0 1.0
9 May 52.0 1.0 ... 3.0 2.0 1.0
10 November 33.0 0.0 ... 7.0 2.0 3.0
11 October 21.0 1.0 ... 7.0 1.0 2.0
12 September 27.0 0.0 ... 5.0 3.0 3.0
We can also use Series.date_range with month_name() and month:
month = pd.date_range(start='2018-01', freq='M', periods=12)
df.loc[df['Col1'].map(dict(zip(month.month_name(),month.month))).sort_values().index]
Col1 Col2 Col3 Col22 Col23 Col24
5 January 55.0 0.0 24.0 4.0 0.0
4 February 48.0 0.0 16.0 0.0 0.0
8 March 34.0 2.0 24.0 4.0 1.0
1 April 53.0 0.0 11.0 0.0 0.0
9 May 52.0 1.0 3.0 2.0 1.0
7 June 34.0 0.0 4.0 8.0 1.0
6 July 45.0 0.0 4.0 8.0 1.0
2 August 43.0 0.0 11.0 3.0 5.0
12 September 27.0 0.0 5.0 3.0 3.0
11 October 21.0 1.0 7.0 1.0 2.0
10 November 33.0 0.0 7.0 2.0 3.0
3 December 36.0 0.0 4.0 1.0 0.0
You can use calender to create a month number integer mapping , then sort the values and reindex:
import calendar
df.reindex(df['Col1'].map({i:e
for e,i in enumerate(calendar.month_name)}).sort_values().index)
Col1 Col2 Col3 ... Col22 Col23 Col24
5 January 55.0 0.0 ... 24.0 4.0 0.0
4 February 48.0 0.0 ... 16.0 0.0 0.0
8 March 34.0 2.0 ... 24.0 4.0 1.0
1 April 53.0 0.0 ... 11.0 0.0 0.0
9 May 52.0 1.0 ... 3.0 2.0 1.0
7 June 34.0 0.0 ... 4.0 8.0 1.0
6 July 45.0 0.0 ... 4.0 8.0 1.0
2 August 43.0 0.0 ... 11.0 3.0 5.0
12 September 27.0 0.0 ... 5.0 3.0 3.0
11 October 21.0 1.0 ... 7.0 1.0 2.0
10 November 33.0 0.0 ... 7.0 2.0 3.0
3 December 36.0 0.0 ... 4.0 1.0 0.0
I couldn't find an efficient away of doing that.
I have below DataFrame in Python with columns from A to Z
A B C ... Z
0 2.0 8.0 1.0 ... 5.0
1 3.0 9.0 0.0 ... 4.0
2 4.0 9.0 0.0 ... 3.0
3 5.0 8.0 1.0 ... 2.0
4 6.0 8.0 0.0 ... 1.0
5 7.0 9.0 1.0 ... 0.0
I need to multiply each of the columns from B to Z by A, (B x A, C x A, ..., Z x A), and save the results on new columns (R1, R2 ..., R25).
I would have something like this:
A B C ... Z R1 R2 ... R25
0 2.0 8.0 1.0 ... 5.0 16.0 2.0 ... 10.0
1 3.0 9.0 0.0 ... 4.0 27.0 0.0 ... 12.0
2 4.0 9.0 0.0 ... 3.0 36.0 0.0 ... 12.0
3 5.0 8.0 1.0 ... 2.0 40.0 5.0 ... 10.0
4 6.0 8.0 0.0 ... 1.0 48.0 0.0 ... 6.0
5 7.0 9.0 1.0 ... 0.0 63.0 7.0 ... 0.0
I was able to calculate the results using below code, but from here I would need to merge with original df. Doesn't sound efficient. There must be a simple/clean way of doing that.
df.loc[:,'B':'D'].multiply(df['A'], axis="index")
That's an example, my real DataFrame has 160 columns x 16k rows.
Create new columns names by list comprehension and then join to original:
df1 = df.loc[:,'B':'D'].multiply(df['A'], axis="index")
df1.columns = ['R{}'.format(x) for x in range(1, len(df1.columns) + 1)]
df = df.join(df1)
print (df)
A B C Z R1 R2
0 2.0 8.0 1.0 5.0 16.0 2.0
1 3.0 9.0 0.0 4.0 27.0 0.0
2 4.0 9.0 0.0 3.0 36.0 0.0
3 5.0 8.0 1.0 2.0 40.0 5.0
4 6.0 8.0 0.0 1.0 48.0 0.0
5 7.0 9.0 1.0 0.0 63.0 7.0
I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)
fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0