I would like to plot a heatmap from a DataFrame in pandas. The data looks like
df = pd.DataFrame({"A": np.random.random(100), "B": np.random.random(100), "C": np.random.random(100)})
To show how C changes as a function of A and B, I want to bin the data based on A and B and calculate the average value of C in each bin, finally the heatmap has A and B as X-axis and Y-axis, and the color indicates the corresponding C value.
I tried to use Seaborn.heatmap, but the function accepts square dataset, which means I should bin the data first.
Is there a way to directly generate what I want from the DataFrame?
If not, how can I bin DataFrame into 2-D grids?
I know pandas.cut can do the trick, but it seems only be able to cut based on one column one time. Of corse I can write tedious function to pipeline 'two cuts', but I am wondering if there is some simple way to do the task.
Scatter plot can give similar results but I want heatmap, something like this, not this.
Something like this?
>>> df.groupby([pd.cut(df.A, 3), pd.cut(df.B, 3)]).C.mean().unstack()
B (0.00223, 0.335] (0.335, 0.666] (0.666, 0.998]
A
(0.000763, 0.334] 0.579832 0.454004 0.349740
(0.334, 0.667] 0.587145 0.677880 0.559560
(0.667, 1] 0.566409 0.496061 0.420541
Related
I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()
I have survey dataset about different age of people over using various social media platform. I want to calculate the average number of people over social media app usage. Here is how example data looks like:
here is reproducible pandas dataframe:
df=pd.DataFrame({'age': np.random.randint(10,100,size=10),
'web1a': np.random.choice([1, 2], size=(10,)),
'web1b': np.random.choice([1, 2], size=(10,), p=[1./3, 2./3]),
'web1c': np.random.choice([1, 2], size=(10,)),
'web1d': np.random.choice([1, 2], size=(10,))})
here is what I tried:
df.pivot_table(df, values='web1a', index='age', aggfunc='mean')
but it is not efficient and didn't produce my desired output. Any idea to get this done? Thanks
update:
for me, the way to do this, first select categorical values in each column and get mean for it which can be the same for others. If I do that, how can I nicely plot them?
Note that in column web1a,web1b, web1c, web1d, 1 mean user and 2 means non-user respectively. I want to compute the average age of the user and non-user. How can I do that? Anyone give me a possible idea to make this happen? Thanks!
Using
df.melt('age').set_index(['variable','value']).mean(level=[0,1]).unstack().plot(kind='bar')
This can be done using groupby method:
df.groupby(['web1a', 'web1b', 'web1c', 'web1d']).mean()
You can groupby the 'web*' columns and calculate the mean on the 'age' column.
You can also plot bar charts (colors can be defined in the subplot). I'm not sure pie charts make sense in this case.
I tried with your data, taking only the columns starting with 'web'. There are more values than '1's and '2's, So I assumed you only wanted to analyze the users and non-users and nothing else. You can change the values or add other values in the chart in the same way, as long as you know what values you want to draw.
df = df.filter(regex=('web|age'),axis=1)
userNr = '1'
nonUserNr = '2'
users = list()
nonUsers = list()
labels = [x for x in df.columns.tolist() if 'web' in x]
for col in labels:
users.append(df.loc[:,['age',col]].groupby(col).mean().loc[userNr][0])
nonUsers.append(df.loc[:,['age',col]].groupby(col).mean().loc[nonUserNr][0])
from matplotlib import pyplot as plt
x = np.arange(1, len(labels)+1)
ax = plt.subplot(111)
ax.bar(x-0.1, users, width=0.2,color='g')
ax.bar(x+0.1,nonUsers, width=0.2,color='r')
plt.xticks(x, labels)
plt.legend(['users','non-users'])
plt.show()
df.melt(id_vars='age').groupby(['variable', 'value']).mean()
I am relatively new to python and am currently trying to generate a scatterplot based off of some data using pandas & seaborn.
The data I'm using ('ErrorMedianScatter') is as follows (apologies for the link, I have yet to get permissions to embed images!):
Image of data
Each participant has two data points of interest. The mean when MissingLimb = 0 or 1
I want to create a scatterplot for participants where the x-axis represents their value for 'mean' when 'MissingLimb' = 0, and the y-axis represents their value for 'mean' when 'MissingLimb' = 1.
I am using the current code so far to create a scatterplot:
sns.lmplot(("mean",
"mean",
data=ErrorMedianScatter,
fit_reg=False,
hue="participant")
This generates a perfectly functional, but very uninteresting, scatterplot. What I'm stuck on is creating an x-/y-axis variable that allows for me to specify that I'm interested in the mean of a participant based on the value of 'MissingLimb' column.
Many thanks in advance!
There are most likely multiple ways to solve your problem. The method I'd take is to first transform you dataset in such a way that there is a single row (observation) for each participant, and where (for each row) there is one column that reports the means where MissingLimb is 0 and another column that reports the means where MissingLimb is 1.
You can accomplish this data transformation with this code:
df = pd.pivot_table(ErrorMedianScatter,
values='mean',
index='participant',
columns='MissingLimb')
df.columns = ['MissingLimb 0', 'MissingLimb 1']
You can then use this (transformed) dataframe to create the scatterplot:
sns.lmplot(data=df, x='MissingLimb 0', y='MissingLimb 1')
Notice that in addition to specifying the data to plot (using the data parameter), I also specified the data to plot on the x- and y-axis (using the x and y parameters, respectively). You can add additional arguments to the sns.lmplot call and customize the plot to your specifications.
I am trying to plot a heatmap with 2 columns of data from a pandas dataframe. However, I would like to use a 3rd column to label the x axis, ideally by colour though another method such as an additional axis would be equally suitable. My dataframe is:
MUT SAMPLE VAR GROUP
True s1 1_1334442_T CC002
True s2 1_1334442_T CC006
True s1 1_1480354_GAC CC002
True s2 1_1480355_C CC006
True s2 1_1653038_C CC006
True s3 1_1730932_G CC002
...
Just to give a better idea of the data; there are 9 different types of 'GROUP', ~60,000 types of 'VAR' and 540 'SAMPLE's. I am not sure if this is the best way to build a heatmap in python but here is what I figured out so far:
pivot = pd.crosstab(df_all['VAR'],df_all['SAMPLE'])
sns.set(font_scale=0.4)
g = sns.clustermap(pivot, row_cluster=False, yticklabels=False, linewidths=0.1, cmap="YlGnBu", cbar=False)
plt.show()
I am not sure how to get 'GROUP' to display along the x-axis, either as an additional axis or just colouring the axis labels? Any help would be much appreciated.
I'm not sure if the 'MUT' column being a boolean variable is an issue here, df_all is 'TRUE' on every 'VAR' but as pivot is made, any samples which do not have a particular 'VAR' are filled as 0, others are filled with 1. My aim was to try and cluster samples with similar 'VAR' profiles. I hope this helps.
Please let me know if I can clarify anything further? Many thanks
Take look at this example. You can give a list or a dataframe column to the clustermap function. By specifying either the col_colors argument or the row_colors argument you can give colours to either the rows or the columns based on that list.
In the example below I use the iris dataset and make a pandas series object that specifies which colour the specific row should have. That pandas series is given as an argument for row_colors.
iris = sns.load_dataset("iris")
species = iris.pop("species")
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
g = sns.clustermap(iris, row_colors=row_colors,row_cluster=False)
This code results in the following image.
You may need to tweak a bit further to also include a legend for the colouring for groups.
I would like to plot boxplots for several datasets based on a criterion.
Imagine a dataframe similar to the example below:
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Group M F
0 1 0.465636 0.537723
1 1 0.560537 0.727238
2 1 0.268154 0.648927
3 2 0.722644 0.115550
4 3 0.586346 0.042896
5 2 0.562881 0.369686
6 2 0.395236 0.672477
7 3 0.577949 0.358801
8 1 0.764069 0.642724
9 3 0.731076 0.302369
In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded.
This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them.
. The desirable output would look something like this:
Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby
My questions are:
How to implement groupby to feed the desired data into the boxplot
What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)
I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.
import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')
The plot looks similar to your required plot.
Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:
# here I prepare the data (group them manually and then store in lists)
Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]
for g in Groups:
dfgc = df[df['Group']==g]
m=dfgc['M'].dropna()
f=dfgc['F'].dropna()
Mgroups.append(m.tolist())
Fgroups.append(f.tolist())
fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
plt.setp(bp['boxes'], color=cl, linewidth=2.)
plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
plt.setp(bp['caps'], color=cl,linewidth=2)
plt.setp(bp['medians'], color=cl, linewidth=3.5)
bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')
# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()
plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()
The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...