I have a dataframe with both positive and negative values. I would like to show a bar chart that shows two bars, one bar shows the percentage of positive values and another percentage of negative values.
dummy = pd.DataFrame({'A' : [-4, -3, -1, 0, 1, 2, 3, 4, 5], 'B' : [-4, -3, -1, 0, 1, 2, 3, 4, 5]})
less_than_0 = dummy['A'][dummy['A'] < 0]
greater_than_0 = dummy['A'][dummy['A'] >= 0]
I am able to split the positive and negative values. I tried this using seaborn.
sns.barplot(dummy['A'])
but both positive and negative are coming in single bar.
I tried this too
sns.barplot(less_than_0)
sns.barplot(greater_than_0)
Is there any way to show 2 bars, 1 for percentage of positive values and other for percentage of negative values?
This isn't the most elegant solution, but you can create a new DataFrame that contains two columns: labels that contain the labels you want to display on the x-axis of the barplot, and percentages that contain the percentages of negative and positive values.
Then you can pass these column names with the relevant information to sns.barplot as the x and y parameters.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dummy = pd.DataFrame({'A' : [-4, -3, -1, 0, 1, 2, 3, 4, 5], 'B' : [-4, -3, -1, 0, 1, 2, 3, 4, 5]})
df_percentages = pd.DataFrame({
'labels':['Less than 0', 'Greater than or equal to 0'],
'percentage':[100*x/len(dummy['A']) for x in [sum(dummy['A'] < 0), sum(dummy['A']>=0)]]
})
sns.barplot(x='labels', y='percentage', data=df_percentages)
plt.show()
You can use Series.value_counts with normalize=True and then plot Series:
d = {True:'Less than 0',False:'Greater than or equal to 0'}
s = (dummy['A'] < 0).value_counts(normalize=True).rename(d)
print (s)
Greater than or equal to 0 0.666667
Less than 0 0.333333
Name: A, dtype: float64
sns.barplot(x=s.index, y=s.to_numpy())
Related
I am working with a series in python, What I want to achieve is to get the highest value out of every n values in the series.
For example:
if n is 3
Series: 2, 1, 3, 5, 3, 6, 1, 6, 9
Expected Series: 3, 6, 9
I have tried nlargest function in pandas but it returns largest values in descending order, But I need the values in order of the original series.
There are various options. If the series is guaranteed to have a length of a multiple of n, you could drop down to numpy and do a .reshape followed by .max along an axis.
Otherwise, if the index is the default (0, 1, 2, ...), you can use groupby:
import pandas as pd
n = 3
ser = pd.Series([2, 1, 3, 5, 3, 6, 1, 6, 9])
out = ser.groupby(ser.index // n).max()
out:
0 3
1 6
2 9
dtype: int64
I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add
x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe
You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331
Is there a built-in function (numpy or pandas I'm thinking) that would help combine multiple rows of one column in a dataframe, keeping the same dimensions, but different scale? Also, combined with that, summing the values from a different column between the intervals? Or is it something I just need to build from scratch? Example below, I'm not sure exactly how to ask. This would need to be scalable; the example is simple, in reality I'm working with a 250 dim array and theoretically unlimited rows.
Ex:
import pandas as pd
import numpy as np
#Creating DF
df = pd.DataFrame([[[-2,-1,0,1,2],[-10,-5,5,5,-10]],
[[-.5,.5,1.5,2.5,3.5],[-3,-2,0,-2,-3]]])
output: 0 1
0 [-2, -1, 0, 1, 2] [-10, -5, 5, 5, -10]
1 [-0.5, 0.5, 1.5, 2.5, 3.5] [-3, -2, 0, -2, -3]
where the answer is [-2,-0.625,0.75,2.125,3.5] (column0 combined with dim 5) , [-10,-5,0,-5,-5] (sum of column1 between steps of column0 where (interval-1) < x<=interval)
answer = pd.DataFrame([[[-2,-.625,.75,2.125,3.5],[-10,-5,0,-5,-5]]])
This is an example dataframe:
import pandas as pd
import numpy as np
values = np.array([
[0, 1, 2, 0, 0, 4],
[1, 0, 0, 1, 1, 0 ],
[0, 4, 0, 0, 2, 1],
[2, 0, 2, 0, 4, 0],
])
indexes= 'a','b','c','d'
columns='ab','bc','cd','de','ef','fg'
df = pd.DataFrame(index=indexes,columns=columns, data=values)
print(df)
from this dataframe I need to create a series of pie charts, one for every column, shown on the same figure, where the slices dimension is fixed (equal to 100/len(indexes)) and the color of the slices depends on the value of the index, in particular: white if 0, green if 1, yellow if 2, red if 4.
What suggestions can you give me?
I found that:
df.plot(kind='pie', subplots=True, figsize=(len(columns)*2, 2))
it creates a series, but I can't control the input values...
I've created a pie for a column, but then I wasn't able to link the color to the value of index:
labels = indexes
sizes = np.linspace(100/len(labels),100/len(labels), num=len(labels))
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels)
ax1.axis('equal')
plt.show()
ImportanceOfBeingErnest answer has helped me giving to the piechart the wanted look:
fig1, ax1 = plt.subplots()
labels = indexes
sizes = np.linspace(100/len(labels),100/len(labels), num=len(labels))
coldic = {0 : "w", 1 : "g", 2 : "y", 4 : "r" }
colors = [coldic[v] for v in values[:,0]]
ax1.pie(sizes, labels=labels, colors=colors,counterclock=False, startangle=90)
ax1.axis('equal')
plt.show()
Now the colors a linked to the values, and the dimensions of the slices are fixed. I just need to have the same pie chart for all the columns and in the same image.
The importance of these charts is given by the colors, not the dimensions of the slices, which I want to be always equal.
Thanks for your time!
Not relying on pandas internal plotting functions (which are of course limited) one can use matplotlib' pie function to plot the diagrams.
The colors can be set as a list, which is generated from the values according to some mapping dictionary.
import numpy as np
import matplotlib.pyplot as plt
coldic = {0 : "w", 1 : "g", 2 : "y", 4 : "r" }
values = np.array([
[0, 1, 2, 0, 0, 4],
[1, 0, 0, 1, 1, 0 ],
[0, 4, 0, 0, 2, 1],
[2, 0, 2, 0, 4, 0],
])
labels= ['a','b','c','d']
fig1, axes = plt.subplots(ncols=values.shape[1], )
for i in range(values.shape[1]):
colors = [coldic[v] for v in values[:,i]]
labs = [l if values[j,i] > 0 else "" for j, l in enumerate(labels)]
axes[i].pie(values[:,i], labels=labs, colors=colors)
axes[i].set_aspect("equal")
plt.show()
For fixed wedge sizes you just use a fixed array to supply to pie.
import numpy as np
import matplotlib.pyplot as plt
coldic = {0 : "w", 1 : "g", 2 : "y", 4 : "r" }
values = np.array([
[0, 1, 2, 0, 0, 4],
[1, 0, 0, 1, 1, 0 ],
[0, 4, 0, 0, 2, 1],
[2, 0, 2, 0, 4, 0],
])
labels= ['a','b','c','d']
fig1, axes = plt.subplots(ncols=values.shape[1], )
for i in range(values.shape[1]):
colors = [coldic[v] for v in values[:,i]]
axes[i].pie(np.ones(values.shape[0]), labels=labels, colors=colors,
wedgeprops=dict(linewidth=1, edgecolor="k"))
axes[i].set_aspect("equal")
axes[i].set_title("".join(list(map(str,values[:,i]))))
plt.show()
This paper has a nice way of visualizing clusters of a dataset with binary features by plotting a 2D matrix and sorting the values according to a cluster.
In this case, there are three clusters, as indicated by the black dividing lines; the rows are sorted, and show which examples are in each cluster, and the columns are the features of each example.
Given a vector of cluster assignments and a pandas DataFrame, how can I replicate this using a Python library (e.g. seaborn)? Plotting a DataFrame using seaborn isn't difficult, nor is sorting the rows of the DataFrame to align with the cluster assignments. What I am most interested in is how to display those black dividing lines which delineate each cluster.
Dummy data:
"""
col1 col2
x1_c0 0 1
x2_c0 0 1
================= I want a line drawn here
x3_c1 1 0
================= and here
x4_c2 1 0
"""
import pandas as pd
import seaborn as sns
df = pd.DataFrame(
data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2']
)
clus = [0, 0, 1, 2] # This is the cluster assignment
sns.heatmap(df)
The link that mwaskom posted in a comment is good starting place. The trick is figuring out what the coordinates are for the vertical and horizontal lines.
To illustrate what the code is actually doing, it's worthwhile to just plot all of the lines individually
%matplotlib inline
import pandas as pd
import seaborn as sns
df = pd.DataFrame(data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2'])
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
ax.axvline(1, 0, 2, linewidth=3, c='w')
ax.axhline(1, 0, 1, linewidth=3, c='w')
ax.axhline(2, 0, 1, linewidth=3, c='w')
ax.axhline(3, 0, 1, linewidth=3, c='w')
f.tight_layout()
The the way that the axvline method works is the first argument is the x location of the line and then the lower bound and upper bound of the line (in this case 1, 0, 2). The horizontal line takes the y location and then the x start and x stop of the line. The defaults will create the line for the entire plot, so you can typically leave those out.
This code above creates a line for every value in the dataframe. If you want to create groups for the heatmap, you will want to create an index in your data frame, or some other list of values to loop through. For instance with a more complicated example using code from this example:
df = pd.DataFrame(data={'col1': [0, 0, 1, 1, 1.5], 'col2': [1, 1, 0, 0, 2]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2', 'x5_c2'])
df['id_'] = df.index
df['group'] = [1, 2, 2, 3, 3]
df.set_index(['group', 'id_'], inplace=True)
df
col1 col2
group id_
1 x1_c0 0.0 1
2 x2_c0 0.0 1
x3_c1 1.0 0
3 x4_c2 1.0 0
x5_c2 1.5 2
Then plot the heatmap with the groups:
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
groups = df.index.get_level_values(0)
for i, group in enumerate(groups):
if i and group != groups[i - 1]:
ax.axhline(len(groups) - i, c="w", linewidth=3)
ax.axvline(1, c="w", linewidth=3)
f.tight_layout()
Because your heatmap is not symmetric you may need to use a separate for loop for the columns