How to prepare training data (remove boundary values) - python

Have we numpy function or pandas function which make somthinfg like that:
For me, boundary values are the farthest values from the regression line.
That means for me:
the farthest from the line over the line and the farthest from the line under the line.
If I will have data:
l1 = [0,1,4,3,4,3]
df = pd.DataFrame(l1)
It looks like that:
0
0 1
1 4
2 3
3 4
4 3
How to find data from index 1 and index 4.
I need to recognize from python script and remove them. I know how to remove but i do not know how to find.
What I want to do:
First I am going to calculate linear regression, next I am going to remove outsider values and next i am going to recalculate linear regression one more time without the farthest values.

To remove outliers, you can use Series.quantile:
Suppose the following dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2022)
df = pd.DataFrame({'A': np.random.normal(5, 2, size=50)})
df.plot.hist(bins=25)
plt.xlim(0, 10)
plt.show()
Now filter out your dataframe:
df1 = df.loc[df['A'].between(*df['A'].quantile([0.25, 0.75]).values)]
df1.plot.hist(bins=10)
plt.xlim(0, 10)
plt.show()

Related

Plot average of multiple line plots with different x values

I have multiple dataframes that look similar to this:
x y x y
1 2 0.5 2
2 4 1.5 6
3 6 3 12
Where the x columns are my indices. I want to plot the average line plot for these multiple datasets. My idea was to concatenate the two dataframes so that I have a scatterplot and can do a best fit line, but Pandas is throwing an error Reindexing only valid with uniquely valued Index objects. I've read other questions for this error message and have renamed my index names and column names to x_1 x_2 and y_1 and y_2 but it is still complaining, I believe because some of the x values are the same. What am I doing wrong here?
Not sure if I understand completely how your dataframes look like, but you can concatenate two (or more) dataframes df1,df2... by doing:
new_dataframe = pd.DataFrame(np.concatenate([df1,df2]),columns=['x','y'])
where my imports are
import pandas as pd
import numpy as np
Are you just looking for a best fit line for all the points? If so you can concat and use lmplot.
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'x':[1,2,3],'y':[2,4,6]})
df2 = pd.DataFrame({'x':[.5,1.5,3], 'y':[2,6,12]})
out = pd.concat([df,df2])
sns.lmplot(data=out, x='x', y='y', ci=None);

How do I format a y-axis 'y' in matplotlib going between pandas dataframes and simple variables?

CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).
I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')

Multiple columns visualization with plotly or seaborn

I have data of factories and their error codes during production
such as below;
PlantID A B C D
1 0 1 2 4
1 3 0 2 0
3 0 0 0 1
4 0 1 1 5
Each row represent production order.
I want to create a graph with x-axis=PlantID's and y-axis are A,B,C,D with different bars.
In this way I can see that which factory has the most D error, which has A in one graph
I usually use plotly and seaborn but I couldn't find any solution for that, y-axis is single column in every example
Thanks in advance,
Seaborn likes its data in long or wide-form.
As mentioned above, seaborn will be most powerful when your datasets have a particular organization. This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham in this academic paper. The rules can be simply stated:
Each variable is a column
Each observation is a row
The following code converts the original dataframe to a long form dataframe.
By stacking the columns on top of each other such that every row corresponds to a single record that specifies the column name and the value (the count).
import numpy as np
import pandas as pd
import seaborn as sns
# Generating some data
N = 20
PlantID = np.random.choice(np.arange(1, 4), size=N, replace=True)
data = dict((k, np.random.randint(0, 50, size=N)) for k in ['A', 'B', 'C', 'D'])
df = pd.DataFrame(data, index=PlantID)
df.index = df.index.set_names('PlantID')
# Stacking the columns and resetting the index to create a longformat. (And some renaming)
df = df.stack().reset_index().rename({'level_1' : 'column', 0: 'count'},axis=1)
sns.barplot(x='PlantID', y='count', hue='column', data=df)
Pandas has really clever built-in plotting functionality:
df.plot(kind='bar')
plt.show()

Pandas: How to detect the peak points (outliers) in a dataframe?

I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I use to remove these outliers or peak points from the data?
Example:
data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}
in this data If I see the points 4, 4, 5, 6 are completely outlier values,
before I have used the rolling mean with 5 min of window frame to smooth these values but still I am getting these type of a lot of blip points, which I want to remove, can any one suggest me any technique to get rid of these points.
I have an image which is more clear view of data:
if you see here how the data is showing some outlier points which I have to remove?
any Idea whats the possible way to get rid of these points?
I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
Originial data:
Test run 1 : Z-score = 4:
As you can see, no data has been removed because the level was set too high.
Test run 2 : Z-score = 2:
Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.
Test run 3 : Z-score = 1.2:
This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!
EDIT: Here's the whole thing for an easy copy&paste:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)
# final plot
df_clean.plot(style = 'o')
You might cut values above a certain quantile as follows:
import numpy as np
clean_data=np.array(data_points)[(data_points<=np.percentile(data_points, 95))]
In pandas you would use df.quantile, you can find it here
Or you may use the Q3+1.5*IQR approach to eliminate the outliers, like you would do through a boxplot

Why does Pandas qcut give me unequal sized bins?

Pandas docs have this to say about the qcut function:
Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
So I would expect this code to give me 4 bins of 10 values each:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
But instead I get this:
Quartiles:
1st 14
2nd 6
3rd 11
4th 9
dtype: int64
graph
What am I doing wrong here?
The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.
The output I get is as follows:
Quartiles:
1st 10
2nd 10
3rd 10
4th 10
dtype: int64
Looking at the boundaries of the bins highlights the problem stated inside the comments.
boundaries = [1, 2, 3.5, 6, 9]
These boundaries are correct. The code of pandas creates the values for the quantiles (inside qcut), first. Afterwards the samples are put into the bins. The range of 2s overlaps the boundary of the first quartile.
The reason for the third values is that the value below the threshold is a 3 and the value above the threshold is a 4. The function quantile of pandas is called so that the boundary lies in between the two neighboring values.
Concluding: A concept like quantiles gets more and more appropriate, when there are a larger number of samples, so that more values are available fixing the boundaries.

Categories