I'm still quite new to Python, and after looking intensively here on SO, I've decided to just ask.
I have a DataFrame, df
df
NO2 NO2 NO3 DK1 DK2
0 1.0 3.0 2.0 1.0 1.0
1 1.0 3.0 3.0 3.0 1.0
2 2.0 2.0 2.0 1.0 1.0
Now, what I want to do is sum up all values in row 0 that are equal to the value in column "DK1" (incl. itself) and return it in a new column.
Then after doing that for row 0, the same procedure for row 1, then row 2, etc.
Desired output:
df2
NO2 NO2 NO3 DK1 DK2 Sum
0 1.0 3.0 2.0 1.0 1.0 3.0
1 1.0 3.0 3.0 3.0 1.0 9.0
2 2.0 2.0 2.0 1.0 1.0 2.0
Compare all values by DF1 column, then multiple this column and last use sum per rows:
df['sum'] = df.eq(df['DK1'], axis=0).mul(df['DK1'], axis=0).sum(axis=1)
print (df)
NO2 NO2.1 NO3 DK1 DK2 sum
0 1.0 3.0 2.0 1.0 1.0 3.0
1 1.0 3.0 3.0 3.0 1.0 9.0
2 2.0 2.0 2.0 1.0 1.0 2.0
Details:
print (df.eq(df['DK1'], axis=0))
NO2 NO2.1 NO3 DK1 DK2
0 True False False True True
1 False True True True False
2 False False False True True
print (df.eq(df['DK1'], axis=0).mul(df['DK1'], axis=0))
NO2 NO2.1 NO3 DK1 DK2
0 1.0 0.0 0.0 1.0 1.0
1 0.0 3.0 3.0 3.0 0.0
2 0.0 0.0 0.0 1.0 1.0
#jezrael, I didn't know how to put this in a comment
NO1 NO2 ... DK1 DK2 sum
0 28.4 28.4 ... 21.0 21.0 2121
1 28.2 28.2 ... 25.1 25.1 25,125,125,125,125,125,125,125,125,1
2 28.0 28.0 ... 25.1 25.1 25,125,125,125,125,125,125,125,125,1
3 28.0 28.0 ... 16.0 16.0 1616
4 28.0 28.0 ... 16.4 16.4 16,416,4
Naturally, my actual dataset is not just ones the simple ones that I started out by - these are my actual values, and the result that I get. Does that help?
Related
i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
What am I missing? I tried appending .round(3) to the end of of the api call but it doesnt work, and it also doesnt work in separate calls. The data types for all columns is numpy.float32.
>>> summary_data = api._get_data(units=list(units.id),
downsample=downsample,
table='summary_tb',
db=db).astype(np.float32)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.37945 70.399887 522.302124
1 20.0 1.0 1.0 1.0 3153.0 0.38449 70.575668 522.428162
2 30.0 1.0 1.0 1.0 3229.0 0.39079 70.575668 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.39438 70.575668 522.651184
4 50.0 1.0 1.0 1.0 3393.0 0.39690 70.663559 522.530090
>>> summary_data = summary_data.round(3)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029
>>> print(type(summary_data))
pandas.core.frame.DataFrame
>>> print([type(summary_data[col][0]) for col in summary_data.columns])
[numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32]
It does in fact look like some form of rounding is taking place, but something weird is happening. Thanks in advance.
EDIT
The point of this is to use 32 bit floating numbers, not 64 bit. I have since used pd.set_option('precision', 3), but according the the documentation this only affects the display, but not the underlying value. As mentioned in a comment below, I am trying to minimize the number of atomic operations. Calculations on 70.575996 vs 70.57600 are more expensive, and this is the issue I am trying to tackle. Thanks in advance.
Hmm, this might be a floating-point issue. You could change the dtype to float instead of np.float32:
>>> summary_data.astype(float).round(3)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400 522.302
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.576 522.428
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.576 522.645
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.576 522.651
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664 522.530
If you change it back to np.float32 afterwards, it re-exhibits the issue:
>>> summary_data.astype(float).round(3).astype(np.float32)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029
I have the following dataframe, with cumulative results quarter by quarter and resets at 1°Q.
I need the Quarter net variation, so I need to subtract column over column except the ones with 1°Q.
from pandas import DataFrame
data = {'Financials': ['EPS','Earnings','Sales','Margin'],
'1°Q19': [1,2,3,4],
'2°Q19': [2,4,6,8],
'3°Q19': [3,6,9,12],
'4°Q19': [4,8,12,16],
'1°Q20': [1,2,3,4],
'2°Q20': [2,4,6,8],
'3°Q20': [3,6,9,12],
'4°Q20': [4,8,12,16]
}
df = DataFrame(data,columns=['Financials','1°Q19','2°Q19','3°Q19','4°Q19',
'1°Q20','2°Q20','3°Q20','4°Q20'])
print(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1 2 3 4 1 2 3 4
1 Earnings 2 4 6 8 2 4 6 8
2 Sales 3 6 9 12 3 6 9 12
3 Margin 4 8 12 16 4 8 12 16
I've started like this and then I got stuck big time:
if ~df.columns.str.contains('1°Q'):
# here I want to substract (1°Q remains unchanged), 2°Q - 1°Q, 3°Q - 2°Q, 4°Q - 3°Q
In order to get this desired result:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
I've tried
new_df = df.diff(axis=1).fillna(df)
print(new_df)
But the result in this case is not the desired one for de 1°Q20:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 -3.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 -6.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 -9.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 -12.0 4.0 4.0 4.0
IIUC, DataFrame.diff with axis=1 and then fill NaN with
DataFrame.fillna
new_df = df.diff(axis=1).fillna(df)
print(new_df)
Financials 1°Q 2°Q 3°Q 4°Q
0 EPS 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0
for expected output:
new_df = new_df.astype(int)
EDIT
df.groupby(df.columns.str.contains('1°Q').cumsum(),axis=1).diff(axis=1).fillna(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
or
df.diff(axis=1).T.mask(df.columns.to_series().str.contains('1°Q')).T.fillna(df)
You can leverage df.shift for the subtraction, and fillna to fix the NaN values left from the shift
df=df.set_index('Financials')
df-(df.shift(1, axis=1).fillna(0))
1°Q 2°Q 3°Q 4°Q
Financials
EPS 1.0 1.0 1.0 1.0
Earnings 2.0 2.0 2.0 2.0
Sales 3.0 3.0 3.0 3.0
Margin 4.0 4.0 4.0 4.0
I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0
I couldn't find an efficient away of doing that.
I have below DataFrame in Python with columns from A to Z
A B C ... Z
0 2.0 8.0 1.0 ... 5.0
1 3.0 9.0 0.0 ... 4.0
2 4.0 9.0 0.0 ... 3.0
3 5.0 8.0 1.0 ... 2.0
4 6.0 8.0 0.0 ... 1.0
5 7.0 9.0 1.0 ... 0.0
I need to multiply each of the columns from B to Z by A, (B x A, C x A, ..., Z x A), and save the results on new columns (R1, R2 ..., R25).
I would have something like this:
A B C ... Z R1 R2 ... R25
0 2.0 8.0 1.0 ... 5.0 16.0 2.0 ... 10.0
1 3.0 9.0 0.0 ... 4.0 27.0 0.0 ... 12.0
2 4.0 9.0 0.0 ... 3.0 36.0 0.0 ... 12.0
3 5.0 8.0 1.0 ... 2.0 40.0 5.0 ... 10.0
4 6.0 8.0 0.0 ... 1.0 48.0 0.0 ... 6.0
5 7.0 9.0 1.0 ... 0.0 63.0 7.0 ... 0.0
I was able to calculate the results using below code, but from here I would need to merge with original df. Doesn't sound efficient. There must be a simple/clean way of doing that.
df.loc[:,'B':'D'].multiply(df['A'], axis="index")
That's an example, my real DataFrame has 160 columns x 16k rows.
Create new columns names by list comprehension and then join to original:
df1 = df.loc[:,'B':'D'].multiply(df['A'], axis="index")
df1.columns = ['R{}'.format(x) for x in range(1, len(df1.columns) + 1)]
df = df.join(df1)
print (df)
A B C Z R1 R2
0 2.0 8.0 1.0 5.0 16.0 2.0
1 3.0 9.0 0.0 4.0 27.0 0.0
2 4.0 9.0 0.0 3.0 36.0 0.0
3 5.0 8.0 1.0 2.0 40.0 5.0
4 6.0 8.0 0.0 1.0 48.0 0.0
5 7.0 9.0 1.0 0.0 63.0 7.0