I am trying to loop through a Pandas data frame and produce a bar chart only for columns that contain exactly two unique values. I envision the final bar chart to contain the two unique values on the X axis, and the Y axis to show the number of rows.
I've been able to produce a Series off my data frame (df_clean) which shows me the number of unique values per column:
col_values = df_clean.apply(lambda x: len(x.unique()))
But I am completely lost how to:
loop through my df_clean to only plot the columns with two unique values
how to produce multiple graphs in one figure (I think matplotlib subplot would help?)
In the same code, I have been able to successfully loop through my df_clean and successfully plot all the int and float type columns. I am struggling with how to modify this working code for the above issue.
i = 1
c_num_cols = len(df_clean.select_dtypes(["int64","float64"]).columns)
for column in df_clean.select_dtypes(["int64","float64"]).columns:
plt.subplot(c_num_cols,(c_num_cols % 2) + 1,i)
plt.subplots_adjust(hspace=0.5)
df_clean[column].plot(kind = 'hist', figsize = [15,c_num_cols * 4], title = column)
i += 1
Try using Series.nunique and Series.value_counts:
binary_cols = df.nunique()[lambda x: x == 2].index
for i, col in enumerate(binary_cols):
plt.subplot(len(binary_cols), (len(binary_cols) % 2) + 1, i+1)
plt.subplots_adjust(hspace=0.5)
df[col].value_counts().plot(kind='bar')
Example
# Setup
df = pd.DataFrame({'col1': list('aaaaaaabbbbbbbb'),
'col2': list('aaabbbcccdddeee'),
'col3': [1] * 9 + [3] * 6})
binary_cols = df.nunique()[lambda x: x == 2].index
for i, col in enumerate(binary_cols):
plt.subplot(len(binary_cols), (len(binary_cols) % 2) + 1, i+1)
plt.subplots_adjust(hspace=0.5)
df[col].value_counts().plot(kind='bar')
Related
I'm very new to Python and am working on plotting a graph with matplotlib with values from a csv and am trying to figure out the most efficient way to remove outliers from my lists. The CSV has three variables, x, y, z, which I've put into separate lists.
I want to find the standard deviation of each list and remove each point that is < or > 2x stdev (remove the point from each list - x, y, z, not just one list).
I'm having a hard time figuring out how to efficiently remove a point that is represented in three separate lists while making sure that I don't mix up different data points.
Do I use while loop and delete the value at a certain position for each variable? If so, how would I reference the position in the list where then number is larger than 2x stdev? Thanks!
import matplotlib.pyplot as plt
import csv
import statistics as stat
#making list of each variable
x = []
y = []
z = []
with open('fundata.csv', 'r') as csvfile:
plots = csv.reader(csvfile, delimiter = ',')
#skip the header line in CSV
next(plots)
#import each variable from the CSV file into a list as a float
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
z.append(float(row[2]))
#cleaning up the data
stdev_x = stat.stdev(x)
stdev_y = stat.stdev(y)
stdev_z = stat.stdev(z)
print(stdev_x)
print(stdev_y)
print(stdev_z)
#making the graph
fig, ax = plt.subplots()
#make a scatter plot graphing x by y with z variable as color, size of each point is 3
ax.scatter(x, y, c=z, s=3)
#Set chart title and label the axes
ax.set_title("Heatmap of variables", fontsize = 18)
ax.set_xlabel("Var 1", fontsize = 14)
ax.set_ylabel("Var 2", fontsize = 14)
#open Matplotlib viewer
plt.show()
Data set is as follows but is ~35000 rows long with more variability:
var1
var2
var3
3876514
3875931
3875846
3876515
3875931
3875846
3876516
3875931
3875846
It is nearly always easier to use pandas to deal with data of this kind. Calculate the row-wise means and standard deviations, then select values within the required range. The outliers will be replaced with missing values. You can then use dropna to drop all the rows that contain missing values.
import pandas as pd
df = pd.read_csv("fundata.csv", names=["x", "y", "z"])
mean = df.mean(axis=0)
std = df.std(axis=0)
edited = df[(mean - 2 * std <= df) & (df <= mean + 2 * std)].dropna()
Alternatively, use scipy.stats.zscore, which will do the calculation for you:
from scipy.stats import zscore
...
edited = df[(abs(zscore(df)) <= 2).all(axis=1)]
If you want to avoid pandas for some reason, then one way would be to replace all the outliers within each column with None:
def replace_outliers(values):
mean = statistics.mean(values)
stdev = statistics.stdev(values)
for v in values:
if mean - 2 * stdev <= v <= mean + 2 * stdev:
yield v
else:
yield None
x, y, z = [replace_outliers(column) for column in [x, y, z]]
Then zip the columns together and select rows that do not contain None:
selected_rows = [row for row in zip(x, y, z) if not None in row]
Finally if needed you can zip the rows together to transpose the data back into three column lists:
x, y, z = zip(*selected_rows)
I can create 1 pie-chart using the 'Churn' column to group the data, however, not sure how to create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features & show percentage distribution in the pie charts?
As DataFrame, I am using "Telco-Customer-Churn.csv"
f,axes=plt.subplots(1,2,figsize=(17,7))
df_churn['Churn'].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[0])
sns.countplot('Churn',data=df_churn,ax=axes[1])
axes[0].set_title('Categorical Variable Pie Chart')
plt.show()
I did something like this, not sure if i did it right:-
#%% PlotMultiplePie
Input: df = Pandas dataframe, categorical_features = list of features , dropna = boolean variable to use NaN or not
Output: prints multiple px.pie()
def PlotMultiplePie(df_churn,categorical_features = None,dropna = False):
# set a threshold of 30 unique variables, more than 50 can lead to ugly pie charts
threshold = 40
# if user did not set categorical_features
if categorical_features == None:
categorical_features = df_churn.select_dtypes(['object','category']).columns.to_list()
print(categorical_features)
# loop through the list of categorical_features
for cat_feature in categorical_features:
num_unique = df_churn[cat_feature].nunique(dropna = dropna)
num_missing = df_churn[cat_feature].isna().sum()
# prints pie chart and info if unique values below threshold
if num_unique <= threshold:
print('Pie Chart for: ', cat_feature)
print('Number of Unique Values: ', num_unique)
print('Number of Missing Values: ', num_missing)
fig = px.pie(df_churn[cat_feature].value_counts(dropna = dropna), values=cat_feature,
names = df_churn[cat_feature].value_counts(dropna = dropna).index,title = cat_feature,template='ggplot2')
fig.show()
else:
print('Pie Chart for ',cat_feature,' is unavailable due high number of Unique Values ')
print('Number of Unique Values: ', num_unique)
print('Number of Missing Values: ', num_missing)
print('\n')
This worked for me. Defined a function to plot the pie charts, for all categorical variables in a dataframe.
#Function to plot Pie-Charts for all categorical variables in the dataframe
def pie_charts_for_CategoricalVar(df_pie,m):
'''Takes in a dataframe(df_pie) and plots pie charts for all categorical columns. m = number of columns required in grid'''
#get all the column names in the dataframe
a = []
for i in df_pie:
a.append(i)
#isolate the categorical variable names from a to b
b = []
for i in a:
if (df[i].dtype.name) == 'category':
b.append(i)
plt.figure(figsize=(15, 12))
plt.subplots_adjust(hspace=0.2)
plt.suptitle("Pie-Charts for Categorical Variables in the dataframe", fontsize=18, y=0.95)
# number of columns, as inputted while calling the function
ncols = m
# calculate number of rows
nrows = len(b) // ncols + (len(b) % ncols > 0)
# loop through the length of 'b' and keep track of index
for n, i in enumerate(b):
# add a new subplot iteratively using nrows and ncols
ax = plt.subplot(nrows, ncols, n + 1)
# filter df and plot 'i' on the new subplot axis
df.groupby(i).size().plot(kind='pie', autopct='%.2f%%',ax=ax)
ax.set_title(i.upper())
ax.set_xlabel("")
ax.set_ylabel("")
plt.show()
#calling the function to plot pie-charts for categorical variable
pie_charts_for_CategoricalVar(df,5) #dataframe, no. of cols in the grid
i want to make a linear equation with some dynamic inputs like it can be
y = θ0*x0 + θ1*x1
or
y = θ0*x0 + θ1*x1 + θ2*x2 + θ3*x3 + θ4*x4
for that i have
dictionary for x0,x1,x2......xn
and array for θ0,θ1,θ2......θn
im new to python so i tried this function but im stuck
so my question is how can i write a fucntion that gets x_values and theta_values as parameters and gives y_values as output
X = pd.DataFrame({'x0': np.ones(6), 'x1': np.linspace(0, 5, 6)})
θ = np.matrix('0 1')
def line_func(features, parameters):
result = []
for feat, param in zip(features.iteritems(), parameters):
for i in feat:
result.append(i*param)
return result
line_func(X,θ)
If you want to multiply your thetas with a list of features, then you technically mulitply a matrix (the features) with a vector (theta).
You can do this as follows:
import numpy as np
x_array= x.values
theta= np.array([theta_0, theta_1])
x_array.dot(theta)
Just order your theta-vector the way your columns are ordered in x. But note, that this gives a row-wise sum of the products for theta_i*x_i for all is. If you don't want it to be summed up rowise, you just need to write x_array * theta.
If you want to work with pandas (which I wouldn't recommend) also for the mulitplication and want to get a dataframe with the products of the column value and the corresponding theta, you could do this as follows:
# define the theta-x mapping (theta-value per column name in x)
thetas={'x1': 1, 'x2': 3}
# create an empty result dataframe with the index of x
df_result= pd.DataFrame(index=x.index)
# assign the calculated columns in a loop
for col_name, col_series in x.iteritems():
df_result[col_name]= col_series*thetas[col_name]
df_result
This results in:
x1 x2
0 1 6
1 -1 3
I have multiple Dataframes (up to 30) which all contain timestamps with associated values. The timestamp in the DataFrames do not necessarily overlap and the recorded values can only stay the same or increase. A DataFrame may look like this:
time coverage
0 0.000000 32.111748
1 0.875050 32.482579
2 1.850576 32.784133
3 3.693440 34.205134
...
I uploaded a couple of csv files with data here 1, 2, 3, 4.
So what I am trying to do is to plot the increase of the mean and median coverage values over time for all recordings, as follows:
# data is a list of dataframes
keys = ["Run " + str(i) for i in range(len(data))]
glued = pd.concat(data, keys=keys).reset_index(level=0).rename(columns={'level_0': 'Run'})
glued["roundtime"] = glued["time"] / 60
glued["roundtime"] = glued["roundtime"].round(0) # 1 significant digit
f, (ax1, ax2) = plt.subplots(2)
my_dpi = 96
stepsize = 5
start = 0
end = 60
ax1.set_title("Mean")
ax2.set_title("Median")
f.set_size_inches(1980 / my_dpi, 1080 / my_dpi)
ax1 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator="mean", data=glued, ax=ax1)
ax1.set(xlabel="Time", ylabel="Coverage in percent")
ax1.xaxis.set_ticks(np.arange(start, end, stepsize))
ax1.set_xlim(0, 70)
ax2 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator='median', data=glued, ax=ax2)
ax2.set(xlabel="Time", ylabel="Coverage in percent")
ax2.xaxis.set_ticks(np.arange(start, end, stepsize))
ax2.set_xlim(0, 70)
plt.show()
The result looks like this.
However, the curve should never decrease as the "coverage" values can never decrease either. The reason for this, I suspect, is that at certain points in time I only have recordings of some DataFrames with lower values and therefore the mean/median is also lower.
I tried to fix this by aligning the indices of all the DataFrames and filling missing values with previous recordings, before doing any of the previous code. Like this:
#create a common index
index = None
for df in data:
df.set_index("time", inplace=True, drop=False)
if index is not None:
index = index.union(df.index)
else:
index = df.index
# reindex all dataframes and fill missing values
new_data = []
for df in data:
print(df)
new_df = df.reindex(index, fill_value=np.NaN)
new_df = new_df.fillna(method="ffill")
new_data.append(new_df)
data = new_data
The result however does change much and decreases at certain times. It looks like this:
Is this approach wrong or am I simply missing something?
How can I change the background color of a line chart based on a variable that is not in the chart?
For example if I have the following dataframe:
import numpy as np
import pandas as pd
dates = pd.date_range('20000101', periods=800)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(800))
df['B'] = np.random.randint(-1,2,size=800)
If I do a line chart of df.A, how can I change the background color based on the values of column 'B' at that point in time?
For example, if B = 1 in that date, then background at that date is green.
If B = 0 then background that date should be yellow.
If B = -1 then background that date should be red.
Adding the workaround that I originally was thinking of doing with axvline, but #jakevdp answer is what exactly was looking because no need of for loops:
First need to add an 'i' column as counter, and then the whole code looks like:
dates = pd.date_range('20000101', periods=800)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(800))
df['B'] = np.random.randint(-1,2,size=800)
df['i'] = range(1,801)
# getting the row where those values are true wit the 'i' value
zeros = df[df['B']== 0]['i']
pos_1 = df[df['B']==1]['i']
neg_1 = df[df['B']==-1]['i']
ax = df.A.plot()
for x in zeros:
ax.axvline(df.index[x], color='y',linewidth=5,alpha=0.03)
for x in pos_1:
ax.axvline(df.index[x], color='g',linewidth=5,alpha=0.03)
for x in neg_1:
ax.axvline(df.index[x], color='r',linewidth=5,alpha=0.03)
You can do this with a plot command followed by pcolor() or pcolorfast(). For example, using the data you define above:
ax = df['A'].plot()
ax.pcolorfast(ax.get_xlim(), ax.get_ylim(),
df['B'].values[np.newaxis],
cmap='RdYlGn', alpha=0.3)