Basic Python: How do I normalize a data series? - python

I have a dataframe with 5 columns indexed by date. I would like to normalize these data series by the first item in their lists.
A B C D E
1/1/2017 3 4 1 2 3
1/2/2017 7 4 4 3 3
1/3/2017 2 5 5 4 3
1/4/2017 2 5 3 6 3
1/5/2017 2 2 2 6 6
for example, in column A, i would like to divided everything by 3, the first item on the list. Same for column B to E.
thanks for your help!

In [100]: df.div(df.iloc[0])
Out[100]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0
or
In [101]: df / df.iloc[0]
Out[101]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0

By using div
df.div(df.iloc[0,:],1)
Out[496]:
A B C D E
1/1/2017 1.000000 1.00 1.0 1.0 1.0
1/2/2017 2.333333 1.00 4.0 1.5 1.0
1/3/2017 0.666667 1.25 5.0 2.0 1.0
1/4/2017 0.666667 1.25 3.0 3.0 1.0
1/5/2017 0.666667 0.50 2.0 3.0 2.0

Related

Set the values of rows from one dataframe based on the rows of another dataframe

I made a mock-up example to illustrate my problem, naturally, what I am working with is something way more complex. Reading the example will make everything easier to understand but my goal is to use the row reference values of one dataframe to set the values of a new column of another dataframe. That, taking my example, I want to create a new column in df1 named z1, that column will be formed by considering the values of x1 taking the reference of y2 values of d2.
import numpy as np
import pandas as pd
x1 = np.array([])
for i in np.arange(0, 15, 3):
x1i = np.repeat(i, 3)
x1 = np.append(x1, x1i)
y1 = np.linspace(0, 1, len(x1))
x2 = np.arange(0, 15, 3)
y2 = np.linspace(0, 1, len(x2))
df1 = pd.DataFrame([x1, y1]).T
df2 = pd.DataFrame([x2, y2]).T
df1.columns = ['x1', 'y1']
df2.columns = ['x2', 'y2']
So, we have that df1 is:
x1 y1
0 0.0 0.000000
1 0.0 0.071429
2 0.0 0.142857
3 3.0 0.214286
4 3.0 0.285714
5 3.0 0.357143
6 6.0 0.428571
7 6.0 0.500000
8 6.0 0.571429
9 9.0 0.642857
10 9.0 0.714286
11 9.0 0.785714
12 12.0 0.857143
13 12.0 0.928571
14 12.0 1.000000
and df2 is:
x2 y2
0 0.0 0.00
1 3.0 0.25
2 6.0 0.50
3 9.0 0.75
4 12.0 1.00
What I would like to achieve is:
x1 y1 z1
0 0.0 0.000000 0.00
1 0.0 0.071429 0.00
2 0.0 0.142857 0.00
3 3.0 0.214286 0.25
4 3.0 0.285714 0.25
5 3.0 0.357143 0.25
6 6.0 0.428571 0.50
7 6.0 0.500000 0.50
8 6.0 0.571429 0.50
9 9.0 0.642857 0.75
10 9.0 0.714286 0.75
11 9.0 0.785714 0.75
12 12.0 0.857143 1.00
13 12.0 0.928571 1.00
14 12.0 1.000000 1.00
You can use map for this.
df1['z'] = df1['x1'].map(df2.set_index('x2')['y2'])
x1 y1 z
0 0.0 0.000000 0.00
1 0.0 0.071429 0.00
2 0.0 0.142857 0.00
3 3.0 0.214286 0.25
4 3.0 0.285714 0.25
5 3.0 0.357143 0.25
6 6.0 0.428571 0.50
7 6.0 0.500000 0.50
8 6.0 0.571429 0.50
9 9.0 0.642857 0.75
10 9.0 0.714286 0.75
11 9.0 0.785714 0.75
12 12.0 0.857143 1.00
13 12.0 0.928571 1.00
14 12.0 1.000000 1.00

Relative minimum values in pandas

i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000

Adding column to Pandas DataFrame based on dynamic indexing condition

I have a dataframe with a column that randomly starts a "count" back at 1. My goal is to produce a new_col that divides my current column by the the last value in a count. See below for an example.
This is my current DataFrame:
col
0 1.0
1 2.0
2 3.0
3 1.0
4 2.0
5 1.0
6 2.0
7 3.0
8 4.0
9 5.0
10 1.0
11 2.0
12 3.0
Trying to get an output like so:
col new_col
0 1.0 0.333
1 2.0 0.667
2 3.0 1.000
3 1.0 0.500
4 2.0 1.000
5 1.0 0.200
6 2.0 0.400
7 3.0 0.600
8 4.0 0.800
9 5.0 1.000
10 1.0 0.333
11 2.0 0.667
12 3.0 1.000
This is what I have tried so far:
df['col_bool'] = pd.DataFrame(df['col'] == 1.0)
idx_lst = [x - 2 for x in df.index[df['col_bool']].tolist()]
idx_lst = idx_lst[1:]
mask = (df['col'] != 1.0)
df_valid = df[mask]
for i in idx_lst:
df['new_col'] = 1.0 / df_valid.iloc[i]['col']
df.loc[mask, 'new_col'] = df_valid['col'] / df_valid.iloc[i]['col']
This understandably results in an index error. Maybe I need to make a copy of a DataFrame each time and concat. I believe this would work but I want to ask if I am missing any shortcuts here?
Try:
df['new_col'] = df['col'].div(df.groupby((df['col'] == 1).cumsum()).transform('last'))
Output:
col new_col
0 1.0 0.333333
1 2.0 0.666667
2 3.0 1.000000
3 1.0 0.500000
4 2.0 1.000000
5 1.0 0.200000
6 2.0 0.400000
7 3.0 0.600000
8 4.0 0.800000
9 5.0 1.000000
10 1.0 0.333333
11 2.0 0.666667
12 3.0 1.000000
You can try:
df['new_col'] = df.groupby((df.col.ne(df.col.shift().add(1))).cumsum())[
'col'].transform(lambda x: x.div(len(x)))
Or:
df['new_col'] = df.col.div(df.groupby((df.col.ne(df.col.shift().add(1))).cumsum())
['col'].transform('count'))
OUTPUT:
col new_col
0 1.0 0.333333
1 2.0 0.666667
2 3.0 1.000000
3 1.0 0.500000
4 2.0 1.000000
5 1.0 0.200000
6 2.0 0.400000
7 3.0 0.600000
8 4.0 0.800000
9 5.0 1.000000
10 1.0 0.333333
11 2.0 0.666667
12 3.0 1.000000

How to fill a particular value with mean value of the column between first row and the corresponding row in pandas dataframe

I have a df like this,
A B C D E
1 2 3 0 2
2 0 7 1 1
3 4 0 3 0
0 0 3 4 3
I am trying to replace all the 0 with mean() value between the first row and the 0 value row for the corresponding column,
My expected output is,
A B C D E
1.0 2.00 3.000000 0.0 2.0
2.0 1.00 7.000000 1.0 1.0
3.0 4.00 3.333333 3.0 1.0
1.5 1.75 3.000000 4.0 3.0
Here is main problem need previous mean value if multiple 0 per column, so realy problematic create vectorized solution:
def f(x):
for i, v in enumerate(x):
if v == 0:
x.iloc[i] = x.iloc[:i+1].mean()
return x
df1 = df.astype(float).apply(f)
print (df1)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
Better solution:
#create indices of zero values to helper DataFrame
a, b = np.where(df.values == 0)
df1 = pd.DataFrame({'rows':a, 'cols':b})
#for first row is not necessary count means
df1 = df1[df1['rows'] != 0]
print (df1)
rows cols
1 1 1
2 2 2
3 2 4
4 3 0
5 3 1
#loop by each row of helper df and assign means
for i in df1.itertuples():
df.iloc[i.rows, i.cols] = df.iloc[:i.rows+1, i.cols].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0 2.0
1 2.0 1.00 7.000000 1 1.0
2 3.0 4.00 3.333333 3 1.0
3 1.5 1.75 3.000000 4 3.0
Another similar solution (with mean of all pairs):
for i, j in zip(*np.where(df.values == 0)):
df.iloc[i, j] = df.iloc[:i+1, j].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
IIUC
def f(x):
for z in range(x.size):
if x[z] == 0: x[z] = np.mean(x[:z+1])
return x
df.astype(float).apply(f)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0

Pandas fillna() not working as expected

I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.

Categories