custom function to extract and visualize outliers in python - python

For the dataframe below,I want to write a function that
I. extract the outliers for each column and export the output as a csv file (I need help with this one)
II. visualize using boxplot and export as pdf file
Outlier definition: as boundaries ±3 standard deviations from the mean
OR
as being any point of data that lies over 1.5 IQRs below the first quartile (Q1) or above the third quartile (Q3)in a data set.
High = (Q3) + 1.5 IQR
Low = (Q1) – 1.5 IQR
See below for the dataset and my attempt :
# dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# intialise data of lists.
data = {'region':['R1', 'R1', 'R2', 'R2', 'R2','R1','R1','R1','R2','R2'],
'cost':[120.05, 181.90, 10.21, 133.01, 311.19,2003.4,112.4,763.2,414.8,812.5],
'commission':[110.21, 191.12, 190.21,15.31, 245.09,63.41,811.3,10.34, 153.10, 311.17],
'salary':[10022,19910, 19113,449999, 25519,140.29, 291.07, 390.22, 245.09, 4122.62],
'revenue':[14029, 29100, 39022, 24509, 412271,110.21, 191.12, 190.21, 12.00, 245.09],
'tax':[120.05, 181.90, 10.34, 153.10, 311.17,52119,32991,52883,69359,57835],
'debt':[100.22,199.10, 191.13,199.99, 255.19,41218, 52991,1021,69152,79355],
'income': [43211,7672991,56881,211,77342,100.22,199.10, 191.13,199.99, 255.19],
'rebate': [31.21,429.01,538.18,621.58,6932.5,120.05, 181.90, 10.34, 153.10, 311.17],
'scale':['small','small','small','mid','mid','large','large','mid','large','small']
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
############## my attempt ####################
def outlier_extractor(data):
# select numeric columns
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
#(I)Extract and export outliers as csv..... I need help with this one
#(II) boxplot visualization
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.savefig('graph_outliers.pdf')
plt.show()
# driver code
outlier_extractor(df)
Please comment and share your full code. Thanks in advance

def outlier_extractor(data):
numeric_data = data.select_dtypes(include=np.number)
Q1, Q3 = numeric_data.quantile(.25), numeric_data.quantile(.75)
IQR = Q3-Q1
numeric_data[:] = np.where((numeric_data > Q3+1.5*IQR)|(numeric_data < Q1-1.5*IQR), np.nan, numeric_data)
numeric_data.apply(lambda series:series.dropna().to_csv(series.name+".csv"))
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
#plt.savefig('graph_outliers.pdf')
plt.show()
outlier_extractor(df)
Note that the apply function saves each filtered column in one different csv file. From your description I though that this was your task.
Note also that you don't need the seaborn package
EDIT
To export all the filtered dataframe with missing values replacing the ouliers you have to replace the to_csv row with:
numeric_data.to_excel("filtered_numeric_data.xlsx")

Related

Matplotlib time-based heatmap [duplicate]

This question already has answers here:
Normalize columns of a dataframe
(23 answers)
Closed 8 months ago.
Background: I picked up Python about a month ago, so my experience level is pretty slim. I'm pretty comfortable with VBA though years of data analysis through excel and PI Processbook.
I have 27 thermocouples that I pull data for in 1s intervals. I would like to heatmap them from hottest to coldest at a given instance in time. I've leveraged seaborn heatmaps, but the problem with those is that they compare temperatures across time as well and the aggregate of these thermocouples changes dramatically over time. See chart below:
Notice how in the attached, the pink one is colder than the rest when all of them are cold, but when they all heat up, the cold spot transfers to the orange and green ones (and even the blue one for a little bit at the peak).
In excel, I would write a do loop to apply conditional formatting to each individual timestamp (row), however in Python I can't figure it out for the life of me. The following is the code that I used to develop the above chart, so I'm hoping I can modify this to make it work.
tsStartTime = pd.Timestamp(strStart_Time)
tsEndTime = pd.Timestamp(strEnd_Time)
t = np.linspace(tsStartTime.value,tsEndTime.value, 150301)
TimeAxis = pd.to_datetime(t)
fig,ax = plt.subplots(figsize=(25,5))
plt.subplots_adjust(bottom = 0.25)
x = TimeAxis
i = 1
while i < 28:
globals()['y' + str(i)] = forceconvert_v(globals()['arTTXD' + str(i)])
ax.plot(x,globals()['y' + str(i)])
i += 1
I've tried to use seaborn heatmaps, but when i slice it by timestamps, the output array is size (27,) instead of (27,1), so it gets rejected.
Ultimately, I'm looking for an output that looks like this:
Notice how the values of 15 in the middle are blue despite being higher than the red 5s in the beginning. I didnt fill out every cell, but hopefully you get the jist of what I'm trying to accomplish.
This data is being pulled from OSISoft PI via the PIConnect library. PI leverages their own classes, but they are essentially either series or dataframes, but I can manipulate them into whatever they need to be if someone has any awesome ideas to handle this.
Here's the link to the data: https://file.io/JS0RoQvDL6AB
Thanks!
You are going the wrong way with globals. In this case, I suggest to use pandas.DataFrame.
What you are looking for can be produced like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Settings
col_number = 5
start = '1/1/2022 10:00:00'
end = '1/1/2022 10:10:00'
# prepare a data frame
index = pd.date_range(start=start, end=end, freq="S")
columns = [f'y{i}' for i in range(col_number)]
df = pd.DataFrame(index=index, columns=columns)
# fill in the data
for n, col in enumerate(df.columns):
df[col] = np.array([n + np.sin(2*np.pi*i/len(df)) for i in range(len(df))])
# drawing a heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
ax1.plot(df)
ax1.legend(df.columns)
ax2.imshow(df.T, aspect='auto', cmap='plasma')
ax2.set_yticks(range(len(df.columns)))
ax2.set_yticklabels(df.columns)
plt.show()
Here:
As far as you didn't supply data to reproduce your case I use sin as illustrative values.
Transposing df.T is needed to put records horizontally. Of course, we can initially write data horizontally, it's up to you.
set_yticks is here to avoid troubles when changing the y-labels on the second figure.
seaborn.heatmap(...) can be used as well:
import seaborn as sns
data = df.T
data.columns = df.index.strftime('%H:%M:%S')
plt.figure(figsize=(15,3))
sns.heatmap(data, cmap='plasma', xticklabels=60)
Update
To compare values at each point in time:
data = (data - data.min())/(data.max() - data.min())
sns.heatmap(data, cmap='plasma', xticklabels=60)

Power law test using XY scatter plot

I have Daily Crude oil prices downloaded from FRED, about 10k observations, some values are blank(code cleans them). I believe that I cannot share excel sheets here, so I will just give you a screenshot of what the data looks like:
I calculate the differences and returns and clean up the data but I am kind of stuck.
Here is what the code looks like to get you started:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("DCOILWTICO.csv")
nan_value = float("NaN")
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['Previous'] = data['DCOILWTICO'].shift(1)
data.dropna(subset=['Previous'],inplace=True)
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['DCOILWTICO'] = data['DCOILWTICO'].astype(float)
data['Previous'] = data['Previous'].astype(float)
data['Diff'] = data['DCOILWTICO'] - data['Previous']
data['Return'] = (data['DCOILWTICO'] - data['Previous'])/data['Previous']
Here comes the question: I am trying to duplicate the graph below.(which I believe was generated using Mathematica) The difficult part is to be able to create the bins in the right way. Looking at the graph it looks like there are around 200 bins. On the x-axis are the returns and on the y axis are the frequencies(which have been binned).
I think you are asking how to make equally spaced bins in logspace. If so then use the np.geomspace function (geometric space), rather than np.linspace (linear space).
plt.figure()
bins = np.geomspace(data['returns'].min(), data['returns'].max(), 200)
plt.hist(data['returns'], bins = bins)

How to plot multiple CSV files with separate plots for each category

I have 3 CSV files. All file contains one column (column name: Method). There are 4 different types of methods contains in each CSV file.
I can create a plot that contains all 4 methods for 1 CSV file. The code is given below
ROOT = Path(__file__).resolve().parent
CSVS = [
ROOT / "CSV_1.csv",
ROOT / "CSV_2.csv",
ROOT / "CSV_3.csv",
]
def accuracy_plot():
for csv in CSVS:
error_kind = csv.stem[: csv.stem.find("_")]
data = pd.read_csv(csv)
plt.figure(figsize=(8, 8))
sns.scatterplot(x="EC", y="MAE", data=data, hue="Method")
plt.title("MAE vs EC")
plt.xlabel("MAE")
plt.ylabel("EC")
plt.savefig(error_kind + ".png")
The plot looks like the below one
Now, I want to create a plot for one method and data that comes from all CSV files. More specifically, I want a plot that shows only one method's info but data from all 3 CSV files. In this plot, I need to use different signs (+, -, *) to identify the data that comes from different datasets. Or we can also give them a different color. Ultimately, I will have 4 different plots for 4 different methods.
Could you tell me how can I do this job? I used seaborn but you are most welcome if you want to use matplotlib.
Option 1
The simplest way I can think is to combine all the files into a single dataframe
Plot the dataframe with seaborn.relplot
This option creates a single FacetGrid
# create the dataframe from all the csv files
df = pd.concat([pd.read_csv(csv).assign(file=csv.stem) for csv in CSVS]).reset_index(drop=True)
# create the plots
p = sns.relplot(data=df, kind='scatter', col='Method', col_wrap=2, x='EC', y='MAE', hue='file', height=4)
# save
p.savefig("output.png")
Option 2
As requested in the edit, this option will create separate plots for each method.
Add a column to the dataframe to identify which file the data is from, using .assign. Using .assign, instead of a for-loop, came from Simon Bowly.
Select data for the given method, and use hue='file'
Set hue_order, to ensure the same color is assigned within each plot.
import seaborn as sns
import pandas as pd
import matpltlib.pyplot as plt
# create a single dataframe
df = pd.concat([pd.read_csv(csv).assign(file=csv.stem) for csv in CSVS]).reset_index(drop=True)
# plot
for method in df.Method.unique():
# select data only for the current method
data = df[df.Method.eq(method)]
sns.scatterplot(data=data, x='EC', y='MAE', hue='file', hue_order=sorted(df.file.unique()))
plt.title(method)
plt.xlim(0, 1)
plt.savefig(f'{method}.png')
plt.show() or plt.clf() # to clear the plot between each loop; pick one
Sample Data
import numpy as np
import pandas as pd
def data(method, loc):
n = 100
v = np.random.normal(loc=loc, scale=0.01, size=(n,))
m = np.random.uniform(low=0.0, high=120.0, size=(n,))
f = np.random.choice(['csv_1', 'csv_2', 'csv_3'], size=(n,))
met = [method] * n
d = {'EC': v, 'MAE': m, 'Method': met, 'file': f}
return pd.DataFrame(d)
samples = [('method_1', 0.05), ('method_1', 0.12),
('method_2', 0.053), ('method_2', 0.11), ('method_2', 0.21),
('method_3', 0.63), ('method_3', 0.72), ('method_3', 0.9),
('method_4', 0.7), ('method_4', 0.8), ('method_4', 0.9)]
np.random.seed(1)
df = pd.concat([data(method, loc) for (method, loc) in samples]).reset_index(drop=True)

Colour code the plot based on the two data frame values

I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.
Here is the code:
def func():
...
df = pd.read_csv(PATH + file, sep=",", header=None)
b = 2.72
a = 0.00000009
popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])
perr = np.sqrt(np.diag(pcov))
plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure
plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure
plt.legend(loc="upper left")
Here is the sample dataset:
**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**
file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...
So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.
and the scatter plot:
This is roughly how my desired solution of the color code needs to look like:
I have around 200 entries in the csv file.
Does using NumPy in this scenario is more advantageous ?
Let me know if this is appropriate or if I have misunderstood anything-
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')
max_2 = pd.DataFrame(df.groupby('1').max()['2'])
no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]
# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]
plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
plt.show()
Well, what do you know. I understood this task totally differently. I thought the point was to have alpha levels according to all df[2], df[4], and df[5] values for each df[1] value. Oh well, since I have done the work already, why not post it?
from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb
#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)
#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])
#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])
#plot each group with a different colour
for groupkey, groupval in dfgroups:
#create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
groupval= groupval.melt(var_name="x", value_name="y")
groupval.x = groupkey
#get min and max y for the normalization
y_high = groupval.y.max()
y_low = groupval.y.min()
#read out r, g, and b values of the next color in the cycle
r, g, b = to_rgb(next(sc_color))
#create a colour array with nonlinear normalized alpha levels
#between 0.2 and 0.8, so that all data point are visible
group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
#and plot
ax.scatter(groupval.x, groupval.y, c=group_color)
plt.show()
Sample output of your data:
Two main problems here. One is that alpha in a scatter plot does not accept an array. But color does, hence, the detour to read out the RGB values and create an RGBA array with added alpha levels.
The other is that your data are spread over a rather wide range. A linear normalization makes changes near the lowest values invisible. There is surely some optimization possible; I like for instance this suggestion.

How to use decile cut from one data to cut another data?

I know we can use the following code to create a decile column for based on a column of given data set considering there are tie in the data (see How to qcut with non unique bin edges?):
import numpy as np
import pandas as pd
# create a sample
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 3), columns=list('ABC'))
# sort by column C
df = df.sort_values(['C'] , ascending = False )
# create decile by column C
df['decile'] = pd.qcut(df['C'].rank(method='first'), 10, labels=np.arange(10, 0, -1))
Is there an easy way to save the cut point from df then use the same cut point to cut a new data set? For example:
np.random.seed([1])
df_new = pd.DataFrame(np.random.rand(100, 1), columns=list('C'))
You can using .left get all bins
s1=pd.Series([1,2,3,4,5,6,7,8,9])
s2=pd.Series([2,3,4,6,1])
a=pd.qcut(s1,10).unique()
bins=[x.left for x in a ] + [np.inf]
pd.cut(s2,bins=bins)

Categories