I have a dataframe to which I want to precisely modify the label when using df.plot(). Example:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),columns=['a', 'b', 'c'])
When plotting this dataframe it shows the name of the columns as labels, but I will like to add more text at the top of that label in LATEX format, for instance $V_{sd}$. At the end I want my label to look like:
(transparent) $V_{sd}$
(blue) a
(orange) b
(green) c
What is written inside the parenthesis is the color of the label/line which I want to precisely control as well.
One way to do this is using matplotlib.pyplot.plot and make an empty plot with the extra label and then plot each column/row one by one, but I wonder if there is an easier way to do it with pandas since I have a bunch of dataframes.
Do you mean this:
df.plot(color=['b','orange','green'])
plt.legend(title='$V_{sd}$')
Output:
Related
Is there a way to make a scatterplot in using seaborn where the (x, y) values are paired? I'd like the x-axis to represent values under condition A, and the y-axis to represent values under condition B. Concretely, suppose that x-axis is patient weight before treatment and y-axis is patient weight after treatment. My data is formatted like the following:
df = pd.DataFrame(
{'n': [1, 2, 3, 1, 2, 3],
'treatment': ['before', 'before', 'before', 'after', 'after', 'after'],
'weight': np.random.rand(6)})
n treatment weight
0 1 before 0.431438
1 2 before 0.053631
2 3 before 0.567058
3 1 after 0.324254
4 2 after 0.624151
5 3 after 0.519498
I think this qualifies as tidy data but the single variable I want to plot is weight. All the documentation I see for seaborn shows examples for paired data like plotting variable x of each item against variable y of each item. For example, sepal_length versus sepal_width. But how might I plot x versus y where my x and y are coming from the same column?
Is the solution to reformat my data so that I have a weight_before and a weight_after column? If so, can you provide the cleanest way to modify the data using pandas? I know I can do something like the following, but I feel like it's not great syntax.
df['weight_before'] = df['weight']
df.loc[df['treatment'] != 'before', 'weight_before'] = np.nan
# and similar for df['weight_after']
If I understand you question correctly, this might work for you:
sns.lmplot(data=df.pivot(index='n', columns='treatment', values='weight'),
x='before', y='after', fit_reg=False)
Another way of doing it
Pivot the dataframe;
df2=pd.pivot_table(df, index='n',columns='treatment', values='weight', aggfunc=np.sum)
df2.reset_index(drop=True, inplace=True)
Plot Scatter
ax = sns.scatterplot(x="before", y="after",data=df2)
Chained solution
ax = sns.scatterplot(x="before", y="after",data=pd.pivot_table(df, index='n',columns='treatment', values='weight', aggfunc=np.sum).reset_index(drop=True))
The 3D surface plot in plotly never shows the data, I get the plot to show up, but nothing shows up in the plot, as if I had ploted an empty Data Frame.
At first, I tried something like the solution I found here(Plotly Plot surface 3D not displayed), but had the same result, another plot with no data.
df3 = pd.DataFrame({'x':[1, 2, 3, 4, 5],'y':[10, 20, 30, 40, 50],'z': [5, 4, 3, 2, 1]})
iplot(dict(data=[Surface(x=df3['x'], y=df3['y'], z=df3['z'])]))
And so I tried the code at the plotly website(the first cell of this notebook: https://plot.ly/python/3d-scatter-plots/), exactly as it is there, just to see if their example worked, but I get an error.
I am getting this:
https://lh3.googleusercontent.com/sOxRsIDLVkBGKTksUfVqm3HtaSQAN_ybQq2HLA-aclzEU-9ekmvd1ETdfsswC2SdbysizOI=s151
But I should get this:
https://lh3.googleusercontent.com/5Hy2Z-97_vwd3ftKBA6dYZfikJHnA-UMEjd3PHvEvdBzw2m2zeEHBtneLC1jzO3RmE2lyw=s151
Observation: could not post the images because of lack of reputation.
In order to plot a surface you have to provide a value for each point. In this case your x and y are series of size 5, that means that your z should have a shape (5, 5).
If I had a bit more info I could give you more details but for a minimal working example try to pass a (5, 5) dataframe, numpy array or even a list of lists to the z value of data.
EDIT:
In a notebook environment the following code works for me:
from plotly import offline
from plotly import graph_objs as go
offline.init_notebook_mode(connected=False)
df3 = {'x':[1, 2, 3, 4, 5],'y':[10, 20, 30, 40, 50],'z': [[5, 4, 3, 2, 1]]*5}
offline.iplot(dict(data=[go.Surface(x=df3['x'], y=df3['y'], z=df3['z'])]))
as shown here:
I'm using plotly 3.7.0.
I have survey dataset about different age of people over using various social media platform. I want to calculate the average number of people over social media app usage. Here is how example data looks like:
here is reproducible pandas dataframe:
df=pd.DataFrame({'age': np.random.randint(10,100,size=10),
'web1a': np.random.choice([1, 2], size=(10,)),
'web1b': np.random.choice([1, 2], size=(10,), p=[1./3, 2./3]),
'web1c': np.random.choice([1, 2], size=(10,)),
'web1d': np.random.choice([1, 2], size=(10,))})
here is what I tried:
df.pivot_table(df, values='web1a', index='age', aggfunc='mean')
but it is not efficient and didn't produce my desired output. Any idea to get this done? Thanks
update:
for me, the way to do this, first select categorical values in each column and get mean for it which can be the same for others. If I do that, how can I nicely plot them?
Note that in column web1a,web1b, web1c, web1d, 1 mean user and 2 means non-user respectively. I want to compute the average age of the user and non-user. How can I do that? Anyone give me a possible idea to make this happen? Thanks!
Using
df.melt('age').set_index(['variable','value']).mean(level=[0,1]).unstack().plot(kind='bar')
This can be done using groupby method:
df.groupby(['web1a', 'web1b', 'web1c', 'web1d']).mean()
You can groupby the 'web*' columns and calculate the mean on the 'age' column.
You can also plot bar charts (colors can be defined in the subplot). I'm not sure pie charts make sense in this case.
I tried with your data, taking only the columns starting with 'web'. There are more values than '1's and '2's, So I assumed you only wanted to analyze the users and non-users and nothing else. You can change the values or add other values in the chart in the same way, as long as you know what values you want to draw.
df = df.filter(regex=('web|age'),axis=1)
userNr = '1'
nonUserNr = '2'
users = list()
nonUsers = list()
labels = [x for x in df.columns.tolist() if 'web' in x]
for col in labels:
users.append(df.loc[:,['age',col]].groupby(col).mean().loc[userNr][0])
nonUsers.append(df.loc[:,['age',col]].groupby(col).mean().loc[nonUserNr][0])
from matplotlib import pyplot as plt
x = np.arange(1, len(labels)+1)
ax = plt.subplot(111)
ax.bar(x-0.1, users, width=0.2,color='g')
ax.bar(x+0.1,nonUsers, width=0.2,color='r')
plt.xticks(x, labels)
plt.legend(['users','non-users'])
plt.show()
df.melt(id_vars='age').groupby(['variable', 'value']).mean()
I've started to use Holoviews with Python3 and Jupyter notebooks, and I'm looking for a good way to put long names and units on my plot axis. An example looks like this:
import holoviews as hv
import pandas as pd
from IPython.display import display
hv.notebook_extension()
dataframe = pd.DataFrame({"time": [0, 1, 2, 3],
"photons": [10, 30, 20, 15],
"norm_photons": [0.33, 1, 0.67, 0.5],
"rate": [1, 3, 2, 1.5]}, index=[0, 1, 2, 3])
hvdata = hv.Table(dataframe, kdims=["time"])
display(hvdata.to.curve(vdims='rate'))
This gives me a nice plot, but instead of 'time' on the x-axis and 'rate' on the y-axis, I would prefer something like 'Time (ns)' and 'Rate (1/s)', but I don't want to type that in the code every time.
I've found this blog post by PhilippJFR which kind of does what I need, but the DFrame() function which he uses is depreciated, so I would like to avoid using that, if possible. Any ideas?
Turns out it's easy to do but hard to find in the documentation. You just pass a holoviews.Dimension instead of a string as the kdims parameter:
hvdata = hv.Table(dataframe, kdims=[hv.Dimension('time', label='Time', unit='ns')])
display(hvdata.to.curve(vdims=hv.Dimension('rate', label='Rate', unit='1/s')))
You can find good alternatives in this SO question:
Setting x and y labels with holoviews
I like doing it like this:
Creating a tuple with the name of the variable and the long name you would like to see printed on the plot:
hvdata = hv.Table(
dataframe,
kdims=[('time', 'Time (ns)')],
vdims=[('rate', 'Rate (1/s)')],
)
I would like to plot a heatmap from a DataFrame in pandas. The data looks like
df = pd.DataFrame({"A": np.random.random(100), "B": np.random.random(100), "C": np.random.random(100)})
To show how C changes as a function of A and B, I want to bin the data based on A and B and calculate the average value of C in each bin, finally the heatmap has A and B as X-axis and Y-axis, and the color indicates the corresponding C value.
I tried to use Seaborn.heatmap, but the function accepts square dataset, which means I should bin the data first.
Is there a way to directly generate what I want from the DataFrame?
If not, how can I bin DataFrame into 2-D grids?
I know pandas.cut can do the trick, but it seems only be able to cut based on one column one time. Of corse I can write tedious function to pipeline 'two cuts', but I am wondering if there is some simple way to do the task.
Scatter plot can give similar results but I want heatmap, something like this, not this.
Something like this?
>>> df.groupby([pd.cut(df.A, 3), pd.cut(df.B, 3)]).C.mean().unstack()
B (0.00223, 0.335] (0.335, 0.666] (0.666, 0.998]
A
(0.000763, 0.334] 0.579832 0.454004 0.349740
(0.334, 0.667] 0.587145 0.677880 0.559560
(0.667, 1] 0.566409 0.496061 0.420541