Grouped bar chart of multiindex - python
first of all: I'm completely new to python.
I'm trying to visualize some measured data. Each entry has a quadrant, number and sector. The original data lies in a .xlsx file. I've managed to use a .pivot_table to sort the data according to its sector. Due to overlapping, number and quadrant also have to be indexed. Now I want to plot it as a bar chart, where the bars are grouped by sector and the colors represent the quadrant.
But because number also has to be indexed, it shows up in the bar chart as a separate group. There should only be three groups, 0, i and a.
MWE:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
d = {'quadrant': ["0","0","0","0","0","0","I","I","I","I","I","I","I","I","I","I","I","I","II","II","II","II","II","II","II","II","II","II","II","II","III","III","III","III","III","III","III","III","III","III","III","III","IV","IV","IV","IV","IV","IV","IV","IV","IV","IV","IV","IV"], 'sector': [0,"0","0","0","0","0","a","a","a","a","a","a","i","i","i","i","i","i","a","a","a","a","a","a","i","i","i","i","i","i","a","a","a","a","a","a","i","i","i","i","i","i","a","a","a","a","a","a","i","i","i","i","i","i"], 'number': [1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6], 'Rz_m': [67.90,44.17,44.30,63.43,49.87,39.33,61.17,69.37,66.20,44.20,64.77,39.93,44.33,50.97,55.90,51.33,58.23,44.53,50.03,47.40,58.67,71.57,57.60,70.77,63.93,47.37,46.90,34.73,41.27,48.23,58.30,47.07,50.53,51.20,32.67,50.37,37.50,55.50,41.20,48.07,56.80,49.77,40.87,44.43,44.00,60.03,63.73,72.80,51.60,45.53,60.27,71.00,59.63,48.70]}
df = pd.DataFrame(data=d)
B = df.pivot_table(index=['sector','number', 'quadrant'])
B.unstack().plot.bar(y='Rz_m')
The data viz ecosystem in Python is pretty diverse and there are multiple libraries you can use to produce the same chart. Matplotlib is a very powerful library, but it's also quite low-level, meaning you often have to do a lot of preparatory work before getting to the chart, so usually you'll find people use seaborn for static visualisations, especially if there is a scientific element to them (it has built-in support for things like error bars, etc.)
Out of the box, it has a lot of chart types to support exploratory data analysis and is built on top of matplotlib. For your example, if I understood it right, it would be as simple as:
import seaborn as sns
sns.catplot(x="sector", y="Rz_m", hue="quadrant", data=df, ci=None,
height=6, kind="bar", palette="muted")
And the output would look like this:
Note that in your example, you missed out "" for one of the zeroes and 0 and "0" are plotted as separate columns. If you're using seaborn, you don't need to pivot the data, just feed it the df as you've defined it.
For interactive visualisations (with tooltips, zoom, pan, etc.), you can also check out bokeh.
There is an interesting wrinkle to this - how to center the nested bars on the label. By default the bars are drawn with center alignment which works fine for an odd number of columns. However, for an even number, you'd want them to be centered on the right edge. You can make a small alteration in the source code categorical.py, lines beginning 1642 like so:
# Draw the bars
offpos = barpos + self.hue_offsets[j]
barfunc(offpos, self.statistic[:, j], -self.nested_width,
color=self.colors[j], align="edge",
label=hue_level, **kws)
Save the .png and then change it back, but it's not ideal. Probably worth flagging up to the library maintainers.
Related
Why am I unable to make a plot containing subplots in plotly using a px.scatter plot?
I have been trying to make a figure using plotly that combines multiple figures together. In order to do this, I have been trying to use the make_subplots function, but I have found it very difficult to have the plots added in such a way that they are properly formatted. I can currently make singular plots (as seen directly below): However, whenever I try to combine these singular plots using make_subplots, I end up with this: This figure has the subplots set up completely wrong, since I need each of the four subplots to contain data pertaining to the four methods (A, B, C, and D). In other words, I would like to have four subplots that look like my singular plot example above. I have set up the code in the following way: for sequence in sequences: #process for making sequence profile is done here sequence_df = pd.DataFrame(sequence_profile) row_number=1 grand_figure = make_subplots(rows=4, cols=1) #there are four groups per sequence, so the grand figure should have four subplots in total for group in sequence_df["group"].unique(): figure_df_group = sequence_df[(sequence_df["group"]==group)] figure_df_group.sort_values("sample", ascending=True, inplace=True) figure = px.line(figure_df_group, x = figure_df_group["sample"], y = figure_df_group["intensity"], color= figure_df_group["method"]) figure.update_xaxes(title= "sample") figure.update_traces(mode='markers+lines') #note: the next line fails, since data must be extracted from the figure, hence why it is commented out #grand_figure.append_trace(figure, row = row_number, col=1) figure.update_layout(title_text="{} Profile Plot".format(sequence)) grand_figure.append_trace(figure.data[0], row = row_number, col=1) row_number+=1 figure.write_image(os.path.join(output_directory+"{}_profile_plot_subplots_in_{}.jpg".format(sequence, group))) grand_figure.write_image(os.path.join(output_directory+"grand_figure_{}_profile_plot_subplots.jpg".format(sequence))) I have tried following directions (like for example, here: ValueError: Invalid element(s) received for the 'data' property) but I was unable to get my figures added as is as subplots. At first it seemed like I needed to use the graph object (go) module in plotly (https://plotly.com/python/subplots/), but I would really like to keep the formatting/design of my current singular plot. I just want the plots to be conglomerated in groups of four. However, when I try to add the subplots like I currently do, I need to use the data property of the figure, which causes the design of my scatter plot to be completely messed up. Any help for how I can ameliorate this problem would be great.
Ok, so I found a solution here. Rather than using the make_subplots function, I just instead exported all the figures onto an .html file (Plotly saving multiple plots into a single html) and then converted it into an image (HTML to IMAGE using Python). This isn't exactly the approach I would have preferred to have, but it does work. UPDATE I have found that plotly express offers another solution, as the px.line object has the parameter of facet that allows one to set up multiple subplots within their plot. My code is set up like this, and is different from the code above in that the dataframe does not need to be iterated in a for loop based on its groups: sequence_df = pd.DataFrame(sequence_profile) figure = px.line(sequence_df, x = sequence_df["sample"], y = sequence_df["intensity"], color= sequence_df["method"], facet_col= sequence_df["group"]) Although it still needs more formatting, my plot now looks like this, which is works much better for my purposes:
In a pairplot, how can I not show confidence intervals but display grid lines instead? [duplicate]
I'm plotting two data series with Pandas with seaborn imported. Ideally I would like the horizontal grid lines shared between both the left and the right y-axis, but I'm under the impression that this is hard to do. As a compromise I would like to remove the grid lines all together. The following code however produces the horizontal gridlines for the secondary y-axis. import pandas as pd import numpy as np import seaborn as sns data = pd.DataFrame(np.cumsum(np.random.normal(size=(100,2)),axis=0),columns=['A','B']) data.plot(secondary_y=['B'],grid=False)
You can take the Axes object out after plotting and perform .grid(False) on both axes. # Gets the axes object out after plotting ax = data.plot(...) # Turns off grid on the left Axis. ax.grid(False) # Turns off grid on the secondary (right) Axis. ax.right_ax.grid(False)
sns.set_style("whitegrid", {'axes.grid' : False}) Note that the style can be whichever valid one that you choose. For a nice article on this, refer to this site.
The problem is with using the default pandas formatting (or whatever formatting you chose). Not sure how things work behind the scenes, but these parameters are trumping the formatting that you pass as in the plot function. You can see a list of them here in the mpl_style dictionary In order to get around it, you can do this: import pandas as pd pd.options.display.mpl_style = 'default' new_style = {'grid': False} matplotlib.rc('axes', **new_style) data = pd.DataFrame(np.cumsum(np.random.normal(size=(100,2)),axis=0),columns=['A','B']) data.plot(secondary_y=['B'])
This feels like buggy behavior in Pandas, with not all of the keyword arguments getting passed to both Axes. But if you want to have the grid off by default in seaborn, you just need to call sns.set_style("dark"). You can also use sns.axes_style in a with statement if you only want to change the default for one figure.
You can just set: sns.set_style("ticks") It goes back to normal.
How can I loop through a list of elements and create time series plots in Python
Here is a sample of the data I'm working with WellAnalyticalData I'd like to loop through each well name and create a time series chart for each parameter with sample date on the x-axis and the value on the y-axis. I don't think I want subplots, I'm just looking for individual plots of each analyte for each well. I've used pandas to try grouping by well name and then attempting to plot, but that doesn't seem to be the way to go. I'm fairly new to python and I think I'm also having trouble figuring out how to construct the loop statement. I'm running python 3.x and am using the matplotlib library to generate the plots.
so if I understand your question correctly you want one plot for each combination of Well and Parameter. No subplots, just a new plot for each combination. Each plot should have SampleDate on the x-axis and Value on the y-axis. I've written a loop here that does just that, although you'll see that since in your data has just one date per well per parameter, the plots are just a single dot. import pandas as pd import matplotlib.pyplot as plt %matplotlib inline df = pd.DataFrame({'WellName':['A','A','A','A','B','B','C','C','C'], 'SampleDate':['2018-02-15','2018-03-31','2018-06-07','2018-11-14','2018-02-15','2018-11-14','2018-02-15','2018-03-31','2018-11-14'], 'Parameter':['Arsenic','Lead','Iron','Magnesium','Arsenic','Iron','Arsenic','Lead','Magnesium'], 'Value':[0.2,1.6,0.05,3,0.3,0.79,0.3,2.7,2.8] }) for well in df.WellName.unique(): temp1 = df[df.WellName==well] for param in temp1.Parameter.unique(): fig = plt.figure() temp2 = temp1[temp1.Parameter==param] plt.scatter(temp2.SampleDate,temp2.Value) plt.title('Well {} and Parameter {}'.format(well,param))
Why do sns.lmplot and FacetGrid+plt.scatter create different scatter points from the same data?
I'm quite new to Python, pandas DataFrames and Seaborn. When I was trying to understand Seaborn better, particularly sns.lmplot, I came across a difference between two figures made of the same data, that I thought were supposed to look alike, and I wonder why that is. Data: My data is a pandas DataFrame that has 454 rows and 19 columns. The data relevant to this question includes 4 columns and looks something like this: Columns: Av_density; pred2; LOC; Year; Variable type: Continuous variable; Continuous variable; Categorical variable 1...4;Categorical 2012...2014 There are no missing data points. My aim is to draw a 2x2 figure panel describing the relationship between Av_density and pred2 separately for each LOC(=location) with years marked with different colours. I call seaborn with: import seaborn as sns sns.set(style="whitegrid") np.random.seed(sum(map(ord, "linear_categorical"))) (Side point: for some reason calling "linear_quantitative" does not work, i.e. I get a "File "stdin", line 2 sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2); ^ SyntaxError: invalid syntax") Figure method 1, FacetGrid + scatter: sur=sns.FacetGrid(Data,col="LOC", col_wrap=2,hue="YEAR") sur.map(plt.scatter, "Av_density", "pred2" ); plt.legend() This produces a nice scatter of the data accurately. You can see the picture here:https://drive.google.com/file/d/0B7h2wsx9mUBScEdUbGRlRk5PV1E/view?usp=sharing Figure method 2, sns.lmplot: sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2); This produces the figure panel divided by LOC accurately, with Years in different colours, but the scatter of the data points does not look right. Instead, it looks like lmplot has linearised the data points, and lost the original scatter points that it is supposed to be drawing in addition to the regression lines. You can see the figure here: https://drive.google.com/file/d/0B7h2wsx9mUBSRkN5ZXhBeW9ob1E/view?usp=sharing My data produces only three points per location per year, and I was first wondering if this is what makes the "mistake" in lmplot datapoint. Optimally I would have a shorter line describing the trend between years instead of a proper regression, but I have not figured out the code to this yet. But before tackling that issue, I would really like to know if there is something I am doing wrong that I can fix, or if this is an issue of lmplot trying to handle my data? Any help, comments and ideas on this are warmly welcome! -TA- Ps. I'm running Python 2.7.8 with Spyder 2.3.4 EDIT: I get shorter "trend lines" with the first method by adding: sur.map(plt.plot,"Av_density", "pred2" ); Still would like to know what is messing the figure with lmplot.
The issue is probably only that the added regression line is messing up the y-axis, so that the variability in the data cannot be seen. Try resetting the y-axis based on the variability in your original plot to see if they show the same thing, in your case e.g. fig1 = sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2); fig1.set(ylim=(-0.03, 0.05)) plt.show(fig1)
How to get rid of grid lines when plotting with Seaborn + Pandas with secondary_y
I'm plotting two data series with Pandas with seaborn imported. Ideally I would like the horizontal grid lines shared between both the left and the right y-axis, but I'm under the impression that this is hard to do. As a compromise I would like to remove the grid lines all together. The following code however produces the horizontal gridlines for the secondary y-axis. import pandas as pd import numpy as np import seaborn as sns data = pd.DataFrame(np.cumsum(np.random.normal(size=(100,2)),axis=0),columns=['A','B']) data.plot(secondary_y=['B'],grid=False)
You can take the Axes object out after plotting and perform .grid(False) on both axes. # Gets the axes object out after plotting ax = data.plot(...) # Turns off grid on the left Axis. ax.grid(False) # Turns off grid on the secondary (right) Axis. ax.right_ax.grid(False)
sns.set_style("whitegrid", {'axes.grid' : False}) Note that the style can be whichever valid one that you choose. For a nice article on this, refer to this site.
The problem is with using the default pandas formatting (or whatever formatting you chose). Not sure how things work behind the scenes, but these parameters are trumping the formatting that you pass as in the plot function. You can see a list of them here in the mpl_style dictionary In order to get around it, you can do this: import pandas as pd pd.options.display.mpl_style = 'default' new_style = {'grid': False} matplotlib.rc('axes', **new_style) data = pd.DataFrame(np.cumsum(np.random.normal(size=(100,2)),axis=0),columns=['A','B']) data.plot(secondary_y=['B'])
This feels like buggy behavior in Pandas, with not all of the keyword arguments getting passed to both Axes. But if you want to have the grid off by default in seaborn, you just need to call sns.set_style("dark"). You can also use sns.axes_style in a with statement if you only want to change the default for one figure.
You can just set: sns.set_style("ticks") It goes back to normal.