I started to use python and i am trying to find outliers per year using the quantile
my data is organized as follows:
columns of years, and for each year i have months and their corresponding salinity and temperature
year=[1997:2021]
month=[1,2...]
SAL=[33,32,50,......,35,...]
Following is my code:
#1st quartile
Q1 = DF['SAL'].quantile(0.25)
#3rd quartile
Q3 = DF['SAL'].quantile(0.75)
#calculate IQR
IQR = Q3 - Q1
print(IQR)
df_out = DF['SAL'][((DF['SAL'] < (Q1 - 1.5 * IQR)) |(DF['SAL'] > (Q3 + 1.5 * IQR)))]
I want to identify the month and year of the outlier and replace it with nan.
To get the outliers per year, you need to compute the quartiles for each year via groupby. Other than that, there's not much to change in your code, but I recently learned about between which seems useful here:
import numpy as np
clean_data = list()
for year, group in DF.groupby('year'):
Q1 = group['SAL'].quantile(0.25)
Q3 = group['SAL'].quantile(0.75)
IQR = Q3 - Q1
# set all values to np.nan that are not (~) in between the two values
group.loc[~group['SAL'].between(Q1 - 1.5 * IQR,
Q3 + 1.5 * IQR,
inclusive=False),
'SAL'] = np.nan
clean_data.append(group)
clean_df = pd.concat(clean_data)
You can use the following function. It uses the definition of an outlier that is below Q1-1.5IQR or above Q3+1.5IQR, such as classically done for boxplots.
import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(range(1997,2022), 12),
'month': np.tile(range(12), 25)+1,
'SAL': np.random.randint(20,40, size=12*25)+np.random.choice([0,-20, 20], size=12*25, p=[0.9,0.05,0.05]),
})
def outliers(s, replace=np.nan):
Q1, Q3 = np.percentile(s, [25 ,75])
IQR = Q3-Q1
return s.where((s > (Q1 - 1.5 * IQR)) & (s < (Q3 + 1.5 * IQR)), replace)
# add new column with excluded outliers
df['SAL_excl'] = df.groupby('year')['SAL'].apply(outliers)
Checking that it works:
with outliers:
import seaborn as sns
sns.boxplot(data=df, x='year', y='SAL')
without outliers:
sns.boxplot(data=df, x='year', y='SAL_excl')
NB. it is possible that new outliers appear as data has now new Q1/Q3/IQR due to the filtering.
How to retrieve rows with outliers:
df[df['SAL_excl'].isna()]
output:
year month SAL SAL_excl
28 1999 5 53 NaN
33 1999 10 7 NaN
94 2004 11 52 NaN
100 2005 5 38 NaN
163 2010 8 6 NaN
182 2012 3 25 NaN
188 2012 9 22 NaN
278 2020 3 53 NaN
294 2021 7 9 NaN
Related
I'd like to get some % rates based on a .groupby() in pandas. My goal is to take an indicator column Ind and get the Rate of A (numerator) divided by the total (A+B) in that year
Example Data:
import pandas as pd
import numpy as np
df: pd.DataFrame = pd.DataFrame([['2011','A',1,2,3], ['2011','B',4,5,6],['2012','A',15,20,4],['2012','B',17,12,12]], columns=["Year","Ind","X", "Y", "Z"])
print(df)
Year Ind X Y Z
0 2011 A 1 2 3
1 2011 B 4 5 6
2 2012 A 15 20 4
3 2012 B 17 12 12
Example for year 2011: XRate would be summing up the A indicators for X (which would be 1) and dividing byt the total (A+B) which would be 5 thus I would receive an Xrate of 0.20.
I would like to do this for all columns X, Y, Z to get the rates. I've tried doing lambda applys but can't quite get the desired results.
Desired Results:
Year XRate YRate ZRate
0 2011 0.20 0.29 0.33
1 2012 0.47 0.63 0.25
You can group the dataframe on Year and aggregate using sum:
s1 = df.groupby('Year').sum()
s2 = df.query("Ind == 'A'").groupby('Year').sum()
s2.div(s1).round(2).add_suffix('Rate')
XRate YRate ZRate
Year
2011 0.20 0.29 0.33
2012 0.47 0.62 0.25
I have two sets of continuous data that I would like to pass into a contour plot. The x-axis would be time, the y-axis would be mass, and the z-axis would be frequency (as in how many times that data point appears). However, most data points are not identical but rather very similar. Thus, I suspect it's easiest to discretize both the x-axis and y-axis.
Here's the data I currently have:
INPUT
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Dates'].head(5)
df['Mass'].head(5)
OUTPUT
13 2003-05-09
14 2003-09-09
15 2010-01-18
16 2010-11-21
17 2012-06-29
Name: Date, dtype: datetime64[ns]
13 2500.0
14 3500.0
15 4000.0
16 4500.0
17 5000.0
Name: Mass, dtype: float64
I'd like to convert the data such that it groups up data points within the year (ex: all datapoints taken in 2003) and it groups up data points within different levels of mass (ex: all datapoints between 3000-4000 kg). Next, the code would count how many data points are within each of these blocks and pass that as the z-axis.
Ideally, I'd also like to be able to adjust the levels of slices. Ex: grouping points up every 100kg instead of 1000kg, or passing a custom list of levels that aren't equally distributed. How would I go about doing this?
I think the function you are looking for is pd.cut
import pandas as pd
import numpy as np
import datetime
n = 10
scale = 1e3
Min = 0
Max = 1e4
np.random.seed(6)
Start = datetime.datetime(2000, 1, 1)
Dates = np.array([base + datetime.timedelta(days=i*180) for i in range(n)])
Mass = np.random.rand(n)*10000
df = pd.DataFrame(index = Dates, data = {'Mass':Mass})
print(df)
gives you:
Mass
2000-01-01 8928.601514
2000-06-29 3319.798053
2000-12-26 8212.291231
2001-06-24 416.966257
2001-12-21 1076.566799
2002-06-19 5950.520642
2002-12-16 5298.173622
2003-06-14 4188.074286
2003-12-11 3354.078493
2004-06-08 6225.194322
if you want to group your Masses by say 1000, or implement your own custom bins, you can do this:
Bins,Labels=np.arange(Min,Max+.1,scale),(np.arange(Min,Max,scale))+(scale)/2
EqualBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(1,'Equal Bins',EqualBins)
Bins,Labels=[0,1000,5000,10000],['Small','Medium','Big']
CustomBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(2,'Custom Bins',CustomBins)
If you want to just show the year, month, etc it is very simple:
df['Year'] = df.index.year
df['Month'] = df.index.month
but you can also do custom date ranges if you like:
Bins=[datetime.datetime(1999, 12, 31),datetime.datetime(2000, 9, 1),
datetime.datetime(2002, 1, 1),datetime.datetime(2010, 9, 1)]
Labels = ['Early','Middle','Late']
CustomDateBins = pd.cut(df.index,bins=Bins,labels=Labels)
df.insert(3,'Custom Date Bins',CustomDateBins)
print(df)
This yields something like what you want:
Mass Equal Bins Custom Bins Custom Date Bins Year Month
2000-01-01 8928.601514 8500.0 Big Early 2000 1
2000-06-29 3319.798053 3500.0 Medium Early 2000 6
2000-12-26 8212.291231 8500.0 Big Middle 2000 12
2001-06-24 416.966257 500.0 Small Middle 2001 6
2001-12-21 1076.566799 1500.0 Medium Middle 2001 12
2002-06-19 5950.520642 5500.0 Big Late 2002 6
2002-12-16 5298.173622 5500.0 Big Late 2002 12
2003-06-14 4188.074286 4500.0 Medium Late 2003 6
2003-12-11 3354.078493 3500.0 Medium Late 2003 12
2004-06-08 6225.194322 6500.0 Big Late 2004 6
The .groupby function is probably of interst to you as well:
yeargroup = df.groupby(df.index.year).mean()
massgroup = df.groupby(df['Equal Bins']).count()
print(yeargroup)
print(massgroup)
Mass Year Month
2000 6820.230266 2000.0 6.333333
2001 746.766528 2001.0 9.000000
2002 5624.347132 2002.0 9.000000
2003 3771.076389 2003.0 9.000000
2004 6225.194322 2004.0 6.000000
Mass Custom Bins Custom Date Bins Year Month
Equal Bins
500.0 1 1 1 1 1
1500.0 1 1 1 1 1
2500.0 0 0 0 0 0
3500.0 2 2 2 2 2
4500.0 1 1 1 1 1
5500.0 2 2 2 2 2
6500.0 1 1 1 1 1
7500.0 0 0 0 0 0
8500.0 2 2 2 2 2
9500.0 0 0 0 0 0
I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range?
I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem?
import pandas as pd
data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] }
data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'],
'O':[20, 40, 60], 'S': [5, 10, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
r = pd.DataFrame()
#use this code to get first value when F <= Yrs
r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \
(1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \
(pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of']))
#use this code to get the rest of the values until L = Yrs
r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\
'2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \
pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \
pd.to_numeric(df2['Of'])
r
I expect output to be:(the values may be different but its the pattern I want)
2018 2019 2020 2021 2022
0 7.75 8.375 NaN NaN NaN
1 11.0 11.5 12 12.5 13.0
2 NaN NaN 18 18.75 19.25
but i get:
2018 2019 2020 2021 2022
0 7.75 8.375 9.0 9.625 10.25
1 11.0 11.5 12 NaN NaN
2 16.50 17.25 18 NaN NaN
I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64
I have a dataframe such as the following:
What's the best way to calculate a cumulative return to fill the Nan Values? The logic of each cell is shown.
Following is the intended result:
import pandas as pd
df = pd.DataFrame({"DATE":[2018,2019,2020,2021,2022,2023,2024],"RATIO":[0.03,0.04,0.05,0.06,0.07,0.08,0.09],"PROFIT":[10,20,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.loc[df['DATE']==2020, ['PROFIT']] = 20000*(1+0.04)
df.loc[df['DATE']==2021, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)
df.loc[df['DATE']==2022, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)
df.loc[df['DATE']==2023, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)
df.loc[df['DATE']==2024, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)*(1+0.080)
df
You are looking for cumprod
df['PROFIT']=df['PROFIT'].fillna(df.RATIO.shift().add(1).iloc[2:].cumprod()*20000)
df
Out[30]:
DATE RATIO PROFIT
0 2018 0.03 10.00000
1 2019 0.04 20.00000
2 2020 0.05 20800.00000
3 2021 0.06 21840.00000
4 2022 0.07 23150.40000
5 2023 0.08 24770.92800
6 2024 0.09 26752.60224