I need to check data distributions of all my numeric columns in a dataset. I chose Fitter library to do so. I loop over all columns but have only one plot+summary table as an outcome instead. What is wrong with my code?
from fitter import Fitter
import numpy as np
df_numeric = df.select_dtypes(include=np.number).sample(n=5000)
num_cols = df_numeric.columns.tolist()
distr = ['cauchy',
'chi2',
'expon',
'exponpow',
'gamma',
'beta',
'lognorm',
'logistic',
'norm',
'powerlaw',
'rayleigh',
'uniform']
for col in num_cols:
modif_col = df_numeric[col].fillna(0).values
dist_fitter = Fitter(modif_col, distributions=distr)
dist_fitter.fit()
dist_fitter.summary()
Maybe there is another approach to check distributions in a loop?
It looks like your code is correctly looping over all the numeric columns in the dataframe, fitting different distributions to each column using the Fitter library, and then printing a summary of the fitting results. However, you're only seeing one plot and summary table as the outcome because you're overwriting the plot and summary table for each iteration of the loop.
To see a separate plot and summary table for each column, you should move the calls to dist_fitter.summary() and dist_fitter.plot() inside the loop and make sure to give each plot and summary table a unique name or title, so you can distinguish them when viewing them.
Here is the code example you can use it
import matplotlib.pyplot as plt
for col in num_cols:
modif_col = df_numeric[col].fillna(0).values
dist_fitter = Fitter(modif_col, distributions=distr)
dist_fitter.fit()
plt.figure()
dist_fitter.plot()
plt.title(col)
plt.show()
print(col)
dist_fitter.summary()
I have Daily Crude oil prices downloaded from FRED, about 10k observations, some values are blank(code cleans them). I believe that I cannot share excel sheets here, so I will just give you a screenshot of what the data looks like:
I calculate the differences and returns and clean up the data but I am kind of stuck.
Here is what the code looks like to get you started:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("DCOILWTICO.csv")
nan_value = float("NaN")
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['Previous'] = data['DCOILWTICO'].shift(1)
data.dropna(subset=['Previous'],inplace=True)
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['DCOILWTICO'] = data['DCOILWTICO'].astype(float)
data['Previous'] = data['Previous'].astype(float)
data['Diff'] = data['DCOILWTICO'] - data['Previous']
data['Return'] = (data['DCOILWTICO'] - data['Previous'])/data['Previous']
Here comes the question: I am trying to duplicate the graph below.(which I believe was generated using Mathematica) The difficult part is to be able to create the bins in the right way. Looking at the graph it looks like there are around 200 bins. On the x-axis are the returns and on the y axis are the frequencies(which have been binned).
I think you are asking how to make equally spaced bins in logspace. If so then use the np.geomspace function (geometric space), rather than np.linspace (linear space).
plt.figure()
bins = np.geomspace(data['returns'].min(), data['returns'].max(), 200)
plt.hist(data['returns'], bins = bins)
I have 4 variables of which 2 variables are nominal (dtype=object) and 2 are numeric(dtypes=int and float).
df.head(1)
OUT:
OS_type|Week_day|clicks|avg_app_speed
iOS|Monday|400|3.4
Now, I want to throw the dataframe into a seaborn heatmap visualization.
import numpy as np
import seaborn as sns
ax = sns.heatmap(df)
But I get an error indicating I cannot use categorical variables, only numbers. How do I process this correctly and then feed it back into the heatmap?
The heatmap to be plotted needs values between 0 and 1. For correlations between numerical variables you can use Pearson's R, for categorical variables (the corrected) Cramer's V, and for correlations between categorical and numerical variables you can use the correlation ratio.
As for creating numerical representations of categorical variables there is a number of ways to do that:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('some_source.csv') # has categorical var 'categ_var'
# method 1: uses pandas
df['numerized1'] = df['categ_var'].astype('category').cat.codes
# method 2: uses pandas, sorts values descending by frequency
df['numerized2'] = df['categ_var'].apply(lambda x: df['categ_var'].value_counts().index.get_loc(x))
# method 3: uses sklearn, result is the same as method 1
lbl = LabelEncoder()
df['numerized3'] = lbl.fit_transform(df['categ_var'])
# method 4: uses pandas; xyz captures a list of the unique values
df['numerized4'], xyz = pd.factorize(df['categ_var'])
Is it possible to group the data (for defining x and y variables) for running regression directly in regPlot (or any other seaborn feature)? I am unable to find an inbuilt feature of that sort.
For example, in a column, I have a categorical variable "C", then I am trying to fit a regression line (with x and y) using the median for each category of C. Is there any functionality to do so?
You need to group by your data with pandas first and then plot it with seaborn. Since you didn't provide your dataframe, I will use a seaborn sample dataset to demonstrate.
import pandas as pd
import seaborn as sns
# load dataframe
df = sns.load_dataset('car_crashes')
The dataframe looks like the following. abbrev column is a category column. I will use total and speeding variable as y and x.
First, use pandas .groupby() method and pass your categorical variable and in the meantime chain another method .median() so that pandas will aggregate your data and return the median for data. Pandas will return a dataframe that looks like the following
And then just call the column you want to plot. In our case, they are total and speeding. After, pass your x and y to seaborn .regplot()
# group by
x = df.groupby(['abbrev']).median().speeding
y = df.groupby(['abbrev']).median().total
# plot
sns.regplot(x, y)
I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I use to remove these outliers or peak points from the data?
Example:
data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}
in this data If I see the points 4, 4, 5, 6 are completely outlier values,
before I have used the rolling mean with 5 min of window frame to smooth these values but still I am getting these type of a lot of blip points, which I want to remove, can any one suggest me any technique to get rid of these points.
I have an image which is more clear view of data:
if you see here how the data is showing some outlier points which I have to remove?
any Idea whats the possible way to get rid of these points?
I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
Originial data:
Test run 1 : Z-score = 4:
As you can see, no data has been removed because the level was set too high.
Test run 2 : Z-score = 2:
Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.
Test run 3 : Z-score = 1.2:
This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!
EDIT: Here's the whole thing for an easy copy&paste:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)
# final plot
df_clean.plot(style = 'o')
You might cut values above a certain quantile as follows:
import numpy as np
clean_data=np.array(data_points)[(data_points<=np.percentile(data_points, 95))]
In pandas you would use df.quantile, you can find it here
Or you may use the Q3+1.5*IQR approach to eliminate the outliers, like you would do through a boxplot