Need to compare one Pandas (Python) dataframe with values from another dataframe - python

So I've pulled data from an sql server, and inputted into a dataframe. All the data is of discrete form, and increases in a 0.1 step in one direction (0.0, 0.1, 0.2... 9.8, 9.9, 10.0), with multiple power values for each step (e.g. 1000, 1412, 134.5, 657.1 at 0.1), (14.5, 948.1, 343.8 at 5.5) - hopefully you see what I'm trying to say.
I've managed to group the data into these individual steps using the following, and have then taken the mean and standard deviation for each group.
group = df.groupby('step').power.mean()
group2 = df.groupby('step').power.std().fillna(0)
This results in two data frames (group and group2) which have the mean and standard deviation for each of the 0.1 steps. It's then easy to create an upper and lower limit for each step using the following:
upperlimit = group + 3*group2
lowerlimit = group - 3*group2
lowerlimit[lowerlimit<0] = 0
Now comes the bit I'm confused about! I need to go back into the original dataframe and remove rows/instances where the power value is outside these calculated limits (note there is a different upper and lower limit for each 0.1 step).
Here's 50 lines of the sample data:
Index Power Step
0 106.0 5.0
1 200.4 5.5
2 201.4 5.6
3 226.9 5.6
4 206.8 5.6
5 177.5 5.3
6 124.0 4.9
7 121.0 4.8
8 93.9 4.7
9 135.6 5.0
10 211.1 5.6
11 265.2 6.0
12 281.4 6.2
13 417.9 6.9
14 546.0 7.4
15 619.9 7.9
16 404.4 7.1
17 241.4 5.8
18 44.3 3.9
19 72.1 4.6
20 21.1 3.3
21 6.3 2.3
22 0.0 0.8
23 0.0 0.9
24 0.0 3.2
25 0.0 4.6
26 33.3 4.2
27 97.7 4.7
28 91.0 4.7
29 105.6 4.8
30 97.4 4.6
31 126.7 5.0
32 134.3 5.0
33 133.4 5.1
34 301.8 6.3
35 298.5 6.3
36 312.1 6.5
37 505.3 7.5
38 491.8 7.3
39 404.6 6.8
40 324.3 6.6
41 347.2 6.7
42 365.3 6.8
43 279.7 6.3
44 351.4 6.8
45 350.1 6.7
46 573.5 7.9
47 490.1 7.5
48 520.4 7.6
49 548.2 7.9

To put you goal another way, you want to perform some manipulations on grouped data, and then project the results of those manipulations back to the ungrouped rows so you can use them for filtering those rows. One way to do this is with transform:
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.
You can then create the new rows directly:
df['upper'] = df.groupby('step').power.transform(lambda p: p.mean() + 3*p.std().fillna(0))
df['lower'] = df.groupby('step').power.transform(lambda p: p.mean() - 3*p.std().fillna(0))
df.loc[df['lower'] < 0, 'lower'] = 0
And sort accordingly:
df = df[(df.power <= df.upper) & (df.power >= df.lower())]

Related

Multiply values in certain columns by fixed metric if multiple conditions exist

I have a dataset of hockey statistics and I want to apply a weight multiplier to certain statistics based on certain conditions.
A snippet of my dataset:
Player Pos GP G GF/GP S Shots/GP S% TOI/GP
2 Andrew Cogliano 1.0 79.2 11.0 0.1 126.8 1.6 8.3 14.44
12 Artturi Lehkonen 2.0 73.0 14.6 0.2 158.6 2.2 9.3 15.29
28 Cale Makar 4.0 59.3 16.0 0.3 155.0 2.6 9.8 23.67
31 Darren Helm 1.0 66.6 10.5 0.2 125.0 1.9 8.6 14.37
61 Gabriel Landeskog 2.0 72.0 24.3 0.3 196.1 2.7 12.8 19.46
103 Nathan MacKinnon 1.0 73.8 27.8 0.4 274.4 3.7 9.9 19.69
What I am trying to do is create a function that multiplies 'G', 'GF/GP', 'S', and 'Shots/GP' by a specific weight - 1.1 for example.
But I want to only do that for players based on two categories:
Defence ('Pos' = 4.0) with 50 or more games ('GP') and 20 min or more time on ice per game ('TOI/GP')
Offense ('Pos' != 4.0) with 50 or more games ('GP') and 14 min or more time on ice per game ('TOI/GP')
I can identify these groups by:
def_cond = df.loc[(df["Pos"]==4.0) & (df["GP"]>=50) & (df["TOI/GP"] >=20.00)]
off_cond = df.loc[(df["Pos"]!=4.0) & (df["GP"]>=50) & (df["TOI/GP"] >=14.00)]
Output for def_cond:
Player Pos GP G GF/GP S Shots/GP S% TOI/GP
28 Cale Makar 4.0 59.3 16.0 0.3 155.0 2.6 9.8 23.67
41 Devon Toews 4.0 58.8 8.2 0.1 120.5 2.1 6.7 22.14
45 Erik Johnson 4.0 67.4 7.3 0.1 140.9 2.1 5.1 22.22
112 Samuel Girard 4.0 68.0 4.4 0.1 90.8 1.3 5.0 20.75
Issue:
What I want to do is take this output and multiply 'G', 'GF/GP', 'S', and 'Shots/GP' by a weight value - again 1.1 for example.
I have tried various things such as:
if def_cond == True:
df[["G", "GF/GP", S", "Shots/GP"]].multiply(1.1, axis="index")
Or simply
if def_cond == True:
df["G"] = (df["G"]*1.1)
Pretty much everything I try results in the following error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I am new to this so any and all advice is welcome!
I would try this:
def f(df,weight):
for i in df.index:
if (
(df.loc[i,'Pos']==4.0 and df.loc[i,'GP']>=50
and df.loc[i,'TOI/GP']>=20)
or
(df.loc[i,'Pos']!=4.0 and df.loc[i,'GP']>=50
and df.loc[i,'TOI/GP']>=14)
):
df.loc[i,['G','GF/GP','S','Shots/GP']]*=weight
Though, i'm pretty sure, it is not the best solution...

AttributeError: 'list' object has no attribute 'assign'

I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).
I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.

Creating a new column from two columns using a dictionary in Pandas

I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.
The dataframe is below:
df_in ->
unique_id myvalue identif
0 CTA15 19.0 TOP
1 CTA15 22.0 TOP
2 CTA15 28.0 TOP
3 CTA15 18.0 TOP
4 CTA15 22.4 TOP
5 AC007 2.0 TOP
6 AC007 2.3 SDME
7 AC007 2.0 SDME
8 AC007 5.0 SDME
9 AC007 3.0 SDME
10 AC007 31.4 SDME
11 AC007 4.4 SDME
12 CGT6 9.7 BTME
13 CGT6 44.5 BTME
14 TVF5 6.7 BTME
15 TVF5 9.1 BTME
16 TVF5 10.0 BTME
17 BGD1 1.0 BTME
18 BGD1 1.6 NON
19 GHB 51.0 NON
20 GHB 54.0 NON
21 GHB 4.7 NON
So I have created a dictionary based on each group of the 'identif' column as :
md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}
So my goal is to create a new column, say 'chk', based on the following condition:
If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then
I will have 1, otherwise 0.
However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows)
using a function as follows:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
for key, value in mydict.items():
if row[idCol] == key and row[valCol] >= value:
df['chk'] = 1
elif row[idCol] == key and row[valCol] < value:
df['chk'] = 0
return df
Getting the output via the following call:
df_out = myfilter(df_in, 'identif', 'myvalue', md)
So my output will be like:
df_out ->
unique_id myvalue identif chk
0 CTA15 19.0 TOP 0
1 CTA15 22.0 TOP 1
2 CTA15 28.0 TOP 1
3 CTA15 18.0 TOP 0
4 CTA15 22.4 TOP 1
5 AC007 2.0 TOP 0
6 AC007 2.3 SDME 0
7 AC007 2.0 SDME 0
8 AC007 5.0 SDME 0
9 AC007 3.0 SDME 0
10 AC007 31.4 SDME 1
11 AC007 4.4 SDME 0
12 CGT6 9.7 BTME 0
13 CGT6 44.5 BTME 1
14 TVF5 6.7 BTME 0
15 TVF5 9.1 BTME 0
16 TVF5 10.0 BTME 0
17 BGD1 1.0 BTME 0
18 BGD1 1.6 NON 0
19 GHB 51.0 NON 1
20 GHB 54.0 NON 1
21 GHB 4.7 NON 0
This works but extremely inefficient and would like a much better way to do it.
This should be faster:
def func(identif, value):
if identif in md:
if value >= md[identif]:
return 1.0
else:
return 0.0
else:
return np.NaN
df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)
The timing on this little example:
CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
Wall time: 1.66 ms
Your version timing:
CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
Wall time: 8.79 ms
Although on such a small example it's not conclusive.
First, you're traversing your dataset four times total, for each row in the data frame you're traversing every element in your dictionary. You can change your function to traverse it once instead. This will speed it up your original function. Try something like:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
value = mydict.get(row[idCol])
if row[valCol] >= value:
df['chk'] = 1
else:
df['chk'] = 0
return df

Array into dataframe interpolation

I have the following array:
[299.13953679 241.1902389 192.58645951 ... 8.53750551 24.38822528
71.61117789]
For each value in the array I want to get the interpolated wind speed based on the values in the column power in the following pd.DataFrame:
wind speed power
5 2.5 0
6 3.0 25
7 3.5 82
8 4.0 154
9 4.5 244
10 5.0 354
11 5.5 486
12 6.0 643
13 6.5 827
14 7.0 1038
15 7.5 1272
16 8.0 1525
17 8.5 1794
18 9.0 2037
19 9.5 2211
20 10.0 2362
21 10.5 2386
22 11.0 2400
So basically I'd like to retreive the following array:
[4.7 4.5 4.3 ... 2.6 3.0 3.4]
Any suggestions on where to start? I was looking at the pd.DataFrame.interpolate function but reading through its functionalities it does not seem to be helpful in my problem. Or am I wrong?
Using interp from numpy
np.interp(ary,df['power'].values,df['wind speed'].values)
Out[202]:
array([4.75063426, 4.48439022, 4.21436922, 2.67075011, 2.98776451,
3.40886998])

Import space delimited .csv into python3, ignoring text at the start?

I would like to import the following .csv data (.txt file) into python lists for each column of data, ignoring the text at the start. I can't change the format of the file. I'm getting the error:
"Traceback (most recent call last):
File "/Users/Hamish/Desktop/Python/AWBM/Import.py", line 13, in <module>
rain_column = float(row[7])
IndexError: list index out of range"
This is the code which I'm trying to get working...
import csv
import numpy as np
file = open('Data_Bris.txt')
reader = csv.reader(file, delimiter=' ')
datelist = []
rainlist = []
evaplist = []
for row in reader:
# row = [date, day, date2, T.Max, Smx, T.Min, Smn, Rain, Srn, Evap, Sev, Rad, Ssl, VP, Svp, maxT, minT, Span, Ssp]
date_column = str(row[0])
rain_column = float(row[7])
evap_column = float(row[9])
datelist.append([date_column])
rainlist.append([rain_column])
evaplist.append([evap_column])
date = np.array([datelist])
rain = np.array([rainlist])
evap = np.array([evaplist])
timeseries = np.arange(rain.size)
This is the data file that I would like to import (continues the same beyond)...
"17701231" 365 31/12/1770 -99.9 999 -99.9 999 9999.9 999 999.9 999 999.9 999 999.9 999 9999.9 9999.9 9999.9 999
""
" This file is SPACE DELIMITED for easy import into both spreadsheets and programs."
"The first line 17701231 contains dummy data and is provided to allow spreadsheets to sense the columns"
" To read into a spreadsheet select DELIMITED and SPACE."
" "
" "
"========= The following essential information and notes should be kept in the data file =========="
" "
"The Data Drill system and data are copyright to the Queensland Government Department of Science, Information Technology and Innovation (DSITI)."
"SILO data, with the exception of Patched Point data for Queensland, are supplied to the licencee only and may not be given, lent, or sold to any other party"
" "
"Notes:"
" * Data Drill for Lat, Long: -27.5000 153.0000 (DECIMAL DEGREES), 27 30'S 153 00'E Your Ref: Data_Bris"
" * Elevation: 102m "
" * Extracted from Silo on 20171214"
" * Please read the documentation on the Data Drill at http://www.longpaddock.qld.gov.au/silo"
" "
" * As evaporation is read at 9am, it has been shifted to the day before"
" ie The evaporation measured on 20 April is in row for 19 April"
" * The 6 Source columns Smx, Smn, Srn, Sev, Ssl, Svp indicate the source of the data to their left, namely Max temp, Min temp, Rainfall, Evaporation, Radiation and Vapour Pressure respectively "
" "
" 35 = interpolated from daily observations using anomaly interpolation method for CLIMARC data
" 25 = interpolated daily observations, 75 = interpolated long term average"
" 26 = synthetic pan evaporation "
" "
" * Relative Humidity has been calculated using 9am VP, T.Max and T.Min"
" RHmaxT is estimated Relative Humidity at Temperature T.Max"
" RHminT is estimated Relative Humidity at Temperature T.Min"
" Span = a calibrated estimate of class A pan evaporation based on vapour pressure deficit and solar radiation
" * The accuracy of the data depends on many factors including date, location, and variable."
" For consistency data is supplied using one decimal place, however it is not accurate to that precision."
" Further information is available from http://www.longpaddock.qld.gov.au/silo"
"===================================================================================================="
" "
Date Day Date2 T.Max Smx T.Min Smn Rain Srn Evap Sev Radn Ssl VP Svp RHmaxT RHminT Span Ssp
(yyyymmdd) () (ddmmyyyy) (oC) () (oC) () (mm) () (mm) () (MJ/m2) () (hPa) () (%) (%) (mm) ()
18890101 1 1-01-1889 29.5 35 21.5 35 0.3 25 6.2 75 23.0 35 26.0 35 63.1 100.0 5.6 26
18890102 2 2-01-1889 32.0 35 21.5 35 0.1 25 6.2 75 23.0 35 21.0 35 44.2 81.9 6.9 26
18890103 3 3-01-1889 31.5 35 21.5 35 0.0 25 6.2 75 23.0 35 24.0 35 51.9 93.6 6.4 26
18890104 4 4-01-1889 29.5 35 21.0 35 0.0 25 6.2 75 23.0 35 22.0 35 53.4 88.5 6.1 26
18890105 5 5-01-1889 30.0 35 19.0 35 0.0 25 6.2 75 23.0 35 19.0 35 44.8 86.5 6.5 26
18890106 6 6-01-1889 28.5 35 18.5 35 0.0 25 6.2 75 23.0 35 23.0 35 59.1 100.0 5.7 26
18890107 7 7-01-1889 30.0 35 18.5 35 0.1 25 6.2 75 23.0 35 20.0 35 47.1 94.0 6.4 26
18890108 8 8-01-1889 28.0 35 18.5 35 0.0 25 6.2 75 23.0 35 21.0 35 55.6 98.7 5.8 26
18890109 9 9-01-1889 28.5 35 19.0 35 0.0 25 6.2 75 24.0 35 22.0 35 56.5 100.0 6.0 26
18890110 10 10-01-1889 29.0 35 20.0 35 0.0 25 6.2 75 23.0 35 21.0 35 52.4 89.9 6.1 26
Here, you want to ignore all lines from the header including the names and format of the columns. A simple way to achieve that is to ignore any line not starting with a digit. With a generator (to avoid loading all the file in memory), you could simply create your reader with:
...
reader = csv.reader((row for row in io.StringIO(t) if row[0].isdigit()),
delimiter=' ', skipinitialspace=True))
...
The skipinitialspace=True allows to accept multiple spaces as a single delimiter.

Categories