python pandas histogram plot including NaN values - python

I wanted to draw a histogram of some data. sorry that I could not attach a sample histogram as I don't have enough reputation. Hope that my description of the problem I am facing will be understood by you. I am using python pandas and I realize that any NaN value is treated as a 0 by pandas. Is there any method that I can use to include the count of Nan value in the histogram? What I mean is that the x-axis should have the NaN value as well. Please help... Thank you very much.

I was looking for the same thing. I ended up with the following solution:
figure = plt.figure(figsize=(6,9), dpi=100);
graph = figure.add_subplot(111);
freq = pandas.value_counts(data)
bins = freq.index
x=graph.bar(bins, freq.values) #gives the graph without NaN
graphmissing = figure.add_subplot(111)
y = graphmissing.bar([0], freq[numpy.NaN]) #gives a bar for the number of missing values at x=0
figure.show()
This gave me a histogram with a column at 0 showing the number of missing values in the data.

Did you try replacing NaN with some other unique value and then plot the histogram?
x= some unique value
plt.hist(df.replace(np.nan, x)

As pointed out by Sreeram TP, it is possible to use the argument dropna=False in the function value_counts to include the counts of NaNs.
df = pd.DataFrame({'feature1': [1, 2, 2, 4, 3, 2, 3, 4, np.NaN],
'feature2': [4, 4, 3, 4, 1, 4, 3, np.NaN, np.NaN]})
# Calculates the histogram for feature1
counts = df['feature1'].value_counts(dropna=False)
counts.plot.bar(title='feat1', grid=True)
I can not insert images. So, here is the result:
image plot here

By using .iloc[::-1] on the output of value_counts(), you can reverse its order.
The code would look like this:
df["column"].value_counts().iloc[::-1]

Related

Is this an error in the seaborn.lineplot hue parameter?

With this code snippet, I'm expecting a line plot with one line per hue, which has these distinct values: [1, 5, 10, 20, 40].
import math
import pandas as pd
import seaborn as sns
sns.set(style="whitegrid")
TANH_SCALING = [1, 5, 10, 20, 40]
X_VALUES = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
COLUMNS = ['x', 'y', 'hue group']
tanh_df = pd.DataFrame(columns=COLUMNS)
for sc in TANH_SCALING:
data = {
COLUMNS[0]: X_VALUES,
COLUMNS[1]: [math.tanh(x/sc) for x in X_VALUES],
COLUMNS[2]: len(X_VALUES)*[sc]}
tanh_df = tanh_df.append(
pd.DataFrame(data=data, columns=COLUMNS),
ignore_index=True
)
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1], hue=COLUMNS[2], data=tanh_df);
However, what I get is a hue legend with values [0, 15, 30, 45], and an additional line, like so:
Is this a bug or am I missing something obvious?
This is a known bug of seaborn when the hue can be cast to integers. You could add a prefix to the hue so casting to integers fails:
for sc in TANH_SCALING:
data = {
COLUMNS[0]: X_VALUES,
COLUMNS[1]: [math.tanh(x/sc) for x in X_VALUES],
COLUMNS[2]: len(X_VALUES)*[f'A{sc}']} # changes here
tanh_df = tanh_df.append(
pd.DataFrame(data=data, columns=COLUMNS),
ignore_index=True
)
Output:
Or after you created your data:
# data creation
for sc in TANH_SCALING:
data = {
COLUMNS[0]: X_VALUES,
COLUMNS[1]: [math.tanh(x/sc) for x in X_VALUES],
COLUMNS[2]: len(X_VALUES)*[f'A{sc}']}
tanh_df = tanh_df.append(
pd.DataFrame(data=data, columns=COLUMNS),
ignore_index=True
)
# hue manipulation
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1],
hue='A_' + tanh_df[COLUMNS[2]].astype(str), # change hue here
data=tanh_df);
As #LudvigH's comment on the other answer says, this isn't a bug, even if the default behavior is surprising in this case. As explained in the docs:
The default treatment of the hue (and to a lesser extent, size) semantic, if present, depends on whether the variable is inferred to represent “numeric” or “categorical” data. In particular, numeric variables are represented with a sequential colormap by default, and the legend entries show regular “ticks” with values that may or may not exist in the data. This behavior can be controlled through various parameters, as described and illustrated below.
Here are two specific ways to control the behavior.
If you want to keep the numeric color mapping but have the legend show the exact values in your data, set legend="full":
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1], hue=COLUMNS[2], data=tanh_df, legend="full")
If you want to have seaborn treat the levels of the hue parameter as discrete categorical values, pass a named categorical colormap or either a list or dictionary of the specific colors you want to use:
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1], hue=COLUMNS[2], data=tanh_df, palette="deep")

Plot paired data using seaborn

Is there a way to make a scatterplot in using seaborn where the (x, y) values are paired? I'd like the x-axis to represent values under condition A, and the y-axis to represent values under condition B. Concretely, suppose that x-axis is patient weight before treatment and y-axis is patient weight after treatment. My data is formatted like the following:
df = pd.DataFrame(
{'n': [1, 2, 3, 1, 2, 3],
'treatment': ['before', 'before', 'before', 'after', 'after', 'after'],
'weight': np.random.rand(6)})
n treatment weight
0 1 before 0.431438
1 2 before 0.053631
2 3 before 0.567058
3 1 after 0.324254
4 2 after 0.624151
5 3 after 0.519498
I think this qualifies as tidy data but the single variable I want to plot is weight. All the documentation I see for seaborn shows examples for paired data like plotting variable x of each item against variable y of each item. For example, sepal_length versus sepal_width. But how might I plot x versus y where my x and y are coming from the same column?
Is the solution to reformat my data so that I have a weight_before and a weight_after column? If so, can you provide the cleanest way to modify the data using pandas? I know I can do something like the following, but I feel like it's not great syntax.
df['weight_before'] = df['weight']
df.loc[df['treatment'] != 'before', 'weight_before'] = np.nan
# and similar for df['weight_after']
If I understand you question correctly, this might work for you:
sns.lmplot(data=df.pivot(index='n', columns='treatment', values='weight'),
x='before', y='after', fit_reg=False)
Another way of doing it
Pivot the dataframe;
df2=pd.pivot_table(df, index='n',columns='treatment', values='weight', aggfunc=np.sum)
df2.reset_index(drop=True, inplace=True)
Plot Scatter
ax = sns.scatterplot(x="before", y="after",data=df2)
Chained solution
ax = sns.scatterplot(x="before", y="after",data=pd.pivot_table(df, index='n',columns='treatment', values='weight', aggfunc=np.sum).reset_index(drop=True))

3D surface plot never shows any data

The 3D surface plot in plotly never shows the data, I get the plot to show up, but nothing shows up in the plot, as if I had ploted an empty Data Frame.
At first, I tried something like the solution I found here(Plotly Plot surface 3D not displayed), but had the same result, another plot with no data.
df3 = pd.DataFrame({'x':[1, 2, 3, 4, 5],'y':[10, 20, 30, 40, 50],'z': [5, 4, 3, 2, 1]})
iplot(dict(data=[Surface(x=df3['x'], y=df3['y'], z=df3['z'])]))
And so I tried the code at the plotly website(the first cell of this notebook: https://plot.ly/python/3d-scatter-plots/), exactly as it is there, just to see if their example worked, but I get an error.
I am getting this:
https://lh3.googleusercontent.com/sOxRsIDLVkBGKTksUfVqm3HtaSQAN_ybQq2HLA-aclzEU-9ekmvd1ETdfsswC2SdbysizOI=s151
But I should get this:
https://lh3.googleusercontent.com/5Hy2Z-97_vwd3ftKBA6dYZfikJHnA-UMEjd3PHvEvdBzw2m2zeEHBtneLC1jzO3RmE2lyw=s151
Observation: could not post the images because of lack of reputation.
In order to plot a surface you have to provide a value for each point. In this case your x and y are series of size 5, that means that your z should have a shape (5, 5).
If I had a bit more info I could give you more details but for a minimal working example try to pass a (5, 5) dataframe, numpy array or even a list of lists to the z value of data.
EDIT:
In a notebook environment the following code works for me:
from plotly import offline
from plotly import graph_objs as go
offline.init_notebook_mode(connected=False)
df3 = {'x':[1, 2, 3, 4, 5],'y':[10, 20, 30, 40, 50],'z': [[5, 4, 3, 2, 1]]*5}
offline.iplot(dict(data=[go.Surface(x=df3['x'], y=df3['y'], z=df3['z'])]))
as shown here:
I'm using plotly 3.7.0.

how to find mean for mixed categorical variables in pandas dataframe?

I have survey dataset about different age of people over using various social media platform. I want to calculate the average number of people over social media app usage. Here is how example data looks like:
here is reproducible pandas dataframe:
df=pd.DataFrame({'age': np.random.randint(10,100,size=10),
'web1a': np.random.choice([1, 2], size=(10,)),
'web1b': np.random.choice([1, 2], size=(10,), p=[1./3, 2./3]),
'web1c': np.random.choice([1, 2], size=(10,)),
'web1d': np.random.choice([1, 2], size=(10,))})
here is what I tried:
df.pivot_table(df, values='web1a', index='age', aggfunc='mean')
but it is not efficient and didn't produce my desired output. Any idea to get this done? Thanks
update:
for me, the way to do this, first select categorical values in each column and get mean for it which can be the same for others. If I do that, how can I nicely plot them?
Note that in column web1a,web1b, web1c, web1d, 1 mean user and 2 means non-user respectively. I want to compute the average age of the user and non-user. How can I do that? Anyone give me a possible idea to make this happen? Thanks!
Using
df.melt('age').set_index(['variable','value']).mean(level=[0,1]).unstack().plot(kind='bar')
This can be done using groupby method:
df.groupby(['web1a', 'web1b', 'web1c', 'web1d']).mean()
You can groupby the 'web*' columns and calculate the mean on the 'age' column.
You can also plot bar charts (colors can be defined in the subplot). I'm not sure pie charts make sense in this case.
I tried with your data, taking only the columns starting with 'web'. There are more values than '1's and '2's, So I assumed you only wanted to analyze the users and non-users and nothing else. You can change the values or add other values in the chart in the same way, as long as you know what values you want to draw.
df = df.filter(regex=('web|age'),axis=1)
userNr = '1'
nonUserNr = '2'
users = list()
nonUsers = list()
labels = [x for x in df.columns.tolist() if 'web' in x]
for col in labels:
users.append(df.loc[:,['age',col]].groupby(col).mean().loc[userNr][0])
nonUsers.append(df.loc[:,['age',col]].groupby(col).mean().loc[nonUserNr][0])
from matplotlib import pyplot as plt
x = np.arange(1, len(labels)+1)
ax = plt.subplot(111)
ax.bar(x-0.1, users, width=0.2,color='g')
ax.bar(x+0.1,nonUsers, width=0.2,color='r')
plt.xticks(x, labels)
plt.legend(['users','non-users'])
plt.show()
df.melt(id_vars='age').groupby(['variable', 'value']).mean()

Why does Pandas qcut give me unequal sized bins?

Pandas docs have this to say about the qcut function:
Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
So I would expect this code to give me 4 bins of 10 values each:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
But instead I get this:
Quartiles:
1st 14
2nd 6
3rd 11
4th 9
dtype: int64
graph
What am I doing wrong here?
The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:
import numpy as np
import pandas as pd
np.random.seed(4242)
y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])
print('Quartiles:')
print(quartiles.value_counts(sort=False))
y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');
By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.
The output I get is as follows:
Quartiles:
1st 10
2nd 10
3rd 10
4th 10
dtype: int64
Looking at the boundaries of the bins highlights the problem stated inside the comments.
boundaries = [1, 2, 3.5, 6, 9]
These boundaries are correct. The code of pandas creates the values for the quantiles (inside qcut), first. Afterwards the samples are put into the bins. The range of 2s overlaps the boundary of the first quartile.
The reason for the third values is that the value below the threshold is a 3 and the value above the threshold is a 4. The function quantile of pandas is called so that the boundary lies in between the two neighboring values.
Concluding: A concept like quantiles gets more and more appropriate, when there are a larger number of samples, so that more values are available fixing the boundaries.

Categories