Indexing my dataframe properly with pandas

Indexing my dataframe properly with pandas - python

I'm trying to plot a bargraph with errorbars acquired from my tests, i found some code on the internet on how to make it. But the code does not fit the way i want the table to look like.
I've tried leaving things out however i don't understand the dataframe enough to know what kind of code i need to process the data correctly.
order=pd.MultiIndex.from_arrays([['402515','402515','402515','402510','402510','402510'],
['z','z','z','z','z','z']],names=['letter','word'])
datas=pd.DataFrame({'first cracking strength':[em1,em2,em3,em4,em5,em6],'flexural strength':[en1,en2,en3,en4,en5,en6]},index=order)
gp4 = datas.groupby(level=('letter', 'word'))
means = gp4.mean()
errors = gp4.std()
print(means)
fig, ax = plt.subplots()
means.plot.bar(yerr=errors, ax=ax, capsize=4);
The multi-index code requires two labels (the 'z' and the '402515/402510', I only want the '402515/402510') on your dataset, but I only want one. What other code does that?
How it looks when I run the code.
How I want it to look.

Related

Why am I unable to make a plot containing subplots in plotly using a px.scatter plot?

I have been trying to make a figure using plotly that combines multiple figures together. In order to do this, I have been trying to use the make_subplots function, but I have found it very difficult to have the plots added in such a way that they are properly formatted. I can currently make singular plots (as seen directly below):
However, whenever I try to combine these singular plots using make_subplots, I end up with this:
This figure has the subplots set up completely wrong, since I need each of the four subplots to contain data pertaining to the four methods (A, B, C, and D). In other words, I would like to have four subplots that look like my singular plot example above.
I have set up the code in the following way:
for sequence in sequences:
#process for making sequence profile is done here
sequence_df = pd.DataFrame(sequence_profile)
row_number=1
grand_figure = make_subplots(rows=4, cols=1)
#there are four groups per sequence, so the grand figure should have four subplots in total
for group in sequence_df["group"].unique():
figure_df_group = sequence_df[(sequence_df["group"]==group)]
figure_df_group.sort_values("sample", ascending=True, inplace=True)
figure = px.line(figure_df_group, x = figure_df_group["sample"], y = figure_df_group["intensity"], color= figure_df_group["method"])
figure.update_xaxes(title= "sample")
figure.update_traces(mode='markers+lines')
#note: the next line fails, since data must be extracted from the figure, hence why it is commented out
#grand_figure.append_trace(figure, row = row_number, col=1)
figure.update_layout(title_text="{} Profile Plot".format(sequence))
grand_figure.append_trace(figure.data[0], row = row_number, col=1)
row_number+=1
figure.write_image(os.path.join(output_directory+"{}_profile_plot_subplots_in_{}.jpg".format(sequence, group)))
grand_figure.write_image(os.path.join(output_directory+"grand_figure_{}_profile_plot_subplots.jpg".format(sequence)))
I have tried following directions (like for example, here: ValueError: Invalid element(s) received for the 'data' property) but I was unable to get my figures added as is as subplots. At first it seemed like I needed to use the graph object (go) module in plotly (https://plotly.com/python/subplots/), but I would really like to keep the formatting/design of my current singular plot. I just want the plots to be conglomerated in groups of four. However, when I try to add the subplots like I currently do, I need to use the data property of the figure, which causes the design of my scatter plot to be completely messed up. Any help for how I can ameliorate this problem would be great.

Ok, so I found a solution here. Rather than using the make_subplots function, I just instead exported all the figures onto an .html file (Plotly saving multiple plots into a single html) and then converted it into an image (HTML to IMAGE using Python). This isn't exactly the approach I would have preferred to have, but it does work.
UPDATE
I have found that plotly express offers another solution, as the px.line object has the parameter of facet that allows one to set up multiple subplots within their plot. My code is set up like this, and is different from the code above in that the dataframe does not need to be iterated in a for loop based on its groups:
sequence_df = pd.DataFrame(sequence_profile)
figure = px.line(sequence_df, x = sequence_df["sample"], y = sequence_df["intensity"], color= sequence_df["method"], facet_col= sequence_df["group"])
Although it still needs more formatting, my plot now looks like this, which is works much better for my purposes:

Python: A histogram for selected columns split by a variable

I do most of my work in R and am trying to explore a bit more of Python. My fluency of the latter is pretty rubbish so explaining anything super simple won't offend me :)
I am starting some exploratory analysis and want to show the distribution of each variable by what will become the target variable. The outcome I would like a histogram for every column in the DF with the data split by the target. Writing in R this is super simple, in the example below x,z and y are the columns and 'cut' the target.
How could I reproduce this in Python?
# R
library(ggplot2)
library(tidyr)
shinyStuff <- gather(diamonds,KPI,numbers,x:z)
ggplot(data = shinyStuff)+geom_histogram(aes(x=numbers,color=cut),stat='count') + facet_wrap(~KPI)
I have tried looping over DF like this:
# Python
for num, col in enumerate(diamonds):
print(num)
plt.figure()
axs[num].hist(diamonds[diamonds['cut']=='Fair'].iloc[:,num],alpha=0.6)
axs[num].hist(diamonds[diamonds['cut']=='Good'].iloc[:,num],alpha=0.6)
This didn't work full stop.
I have tried splitting the DF and mapping
# Python
fig, ax = plt.subplots()
diamonds[diamonds['Cut']=='Fair'].hist(figsize = (16,20),color='red',ax=ax,alpha=0.6)
diamonds[diamonds['Cut']=='Good'].hist(figsize = (16,20),color='blue',ax=ax,alpha=0.6);
This just over writes the first.
Tried a few more things which I won't post - they may well have been along the write lines but I am not versed enough in Python to get them right so I don't think a list of failed examples will help here.
I am using Python 3 and open to all solutions using any dependencies.

How do I make this subplotting work in Python?

First of all,
I'm learning to use Python and sometimes it's a little tricky to me.
I'm using a Game of Thrones database from kraggle to learn visualizations. Now I'm trying to see how many character of each hause died in each book.
Then I make this code:
houses_deathbybook = data_deathsB.groupby(['Book_of_Death', 'Allegiances']).count()[['Name']]
To see a count of deads by house and book.
And used the subplot command to achieve this graph.
I'm now trying to make that graph more usefull using this code
fig, axes = plt.subplots(nrows=1, ncols=1, gridspec_kw={'wspace': 0.1, 'hspace': 0.9})
data_deathsB.loc[data_deathsB['Allegiances']=='House Arryn'.groupby(['Book_of_Death']).agg('count').plot(x='Book of Death', y='Muertes',kind='bar',figsize=(20,15),color='limegreen',grid=True,ax=axes[1,0], title='House Arryn',fontsize=13)
The second part of the code will go replicate for each house.
But it seems to do not work. I make a test, putting in the grid settings just 1 row and column to check one house, and it gives me the next error "unexpected EOF while parsing".
Could you help me?

The problem in your second approach is that you have defined a figure with 2 subfigures: having a single column and two rows. So when you have either a single row or a single column, you can't use two indices [0,0] and so on to access the subplots. In this case you will have to use like the following
ax=axes[0],title='House Arryn')
and
ax=axes[1],title='House Arryn')
The two index style [0,0], [0,1] etc. will work when you will have more than one row and one column.

It worked!
This is the result of the next code (just one of the graphs
fig, axes = plt.subplots(nrows=2,gridspec_kw={'hspace': 1})
data_deathsB.loc[data_deathsB['Allegiances']=='House Arryn',['Allegiances', 'Name', 'Book_of_Death']].groupby(['Book_of_Death'],as_index=False).agg('count').plot(x='Book_of_Death', kind='bar',figsize=(20,15),color='limegreen',grid=True,ax=axes[0],title='House Arryn')
data_deathsB.loc[data_deathsB['Allegiances']=='House Baratheon',['Allegiances', 'Name', 'Book_of_Death']].groupby(['Book_of_Death'],as_index=False).agg('count').plot(x='Book_of_Death', kind='bar',figsize=(20,15),color='limegreen',grid=True,ax=axes1,title='House Baratheon')
The next steps would be to make the graphs a little more cute.
Thanks to everyone!

How do I "reset the index" for a matplotlib plot?

I have the following code:
fig, ax = plt.subplots(1, 1)
calls["2016-12-24"].resample("1h").sum().plot(ax=ax)
calls["2016-12-25"].resample("1h").sum().plot(ax=ax)
calls["2016-12-26"].resample("1h").sum().plot(ax=ax)
which generates the following image:
How can I make this so the lines share the x-axis? In other words, how do I make them not switch days?

If you don't care about using the correct datetime as index, you could just reset the index as you suggested for all the series. This is going to overlap all the time series, if this is what you're trying to achieve.
# the below should
calls["2016-12-24"].resample("1h").sum().reset_index("2016-12-24").plot(ax=ax)
calls["2016-12-25"].resample("1h").sum().reset_index("2016-12-25").plot(ax=ax)
calls["2016-12-26"].resample("1h").sum().reset_index("2016-12-26").plot(ax=ax)
Otherwise you should try as well to resample the three columns at the same time. Have a go with the below but not knowing how your original dataframe look like, I'm not sure this will fit your case. You should post some more information about the input dataframe.
# have a try with the below
calls[["2016-12-24","2016-12-25","2016-12-26"].resample('1h').sum().plot()

Why do sns.lmplot and FacetGrid+plt.scatter create different scatter points from the same data?

I'm quite new to Python, pandas DataFrames and Seaborn. When I was trying to understand Seaborn better, particularly sns.lmplot, I came across a difference between two figures made of the same data, that I thought were supposed to look alike, and I wonder why that is.
Data: My data is a pandas DataFrame that has 454 rows and 19 columns. The data relevant to this question includes 4 columns and looks something like this:
Columns: Av_density; pred2; LOC; Year;
Variable type: Continuous variable; Continuous variable; Categorical variable 1...4;Categorical 2012...2014
There are no missing data points.
My aim is to draw a 2x2 figure panel describing the relationship between Av_density and pred2 separately for each LOC(=location) with years marked with different colours. I call seaborn with:
import seaborn as sns
sns.set(style="whitegrid")
np.random.seed(sum(map(ord, "linear_categorical")))
(Side point: for some reason calling "linear_quantitative" does not work, i.e. I get a "File "stdin", line 2
sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2);
^
SyntaxError: invalid syntax")
Figure method 1, FacetGrid + scatter:
sur=sns.FacetGrid(Data,col="LOC", col_wrap=2,hue="YEAR")
sur.map(plt.scatter, "Av_density", "pred2" );
plt.legend()
This produces a nice scatter of the data accurately. You can see the picture here:https://drive.google.com/file/d/0B7h2wsx9mUBScEdUbGRlRk5PV1E/view?usp=sharing
Figure method 2, sns.lmplot:
sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2);
This produces the figure panel divided by LOC accurately, with Years in different colours, but the scatter of the data points does not look right. Instead, it looks like lmplot has linearised the data points, and lost the original scatter points that it is supposed to be drawing in addition to the regression lines.
You can see the figure here: https://drive.google.com/file/d/0B7h2wsx9mUBSRkN5ZXhBeW9ob1E/view?usp=sharing
My data produces only three points per location per year, and I was first wondering if this is what makes the "mistake" in lmplot datapoint. Optimally I would have a shorter line describing the trend between years instead of a proper regression, but I have not figured out the code to this yet.
But before tackling that issue, I would really like to know if there is something I am doing wrong that I can fix, or if this is an issue of lmplot trying to handle my data?
Any help, comments and ideas on this are warmly welcome!
-TA-
Ps. I'm running Python 2.7.8 with Spyder 2.3.4
EDIT: I get shorter "trend lines" with the first method by adding:
sur.map(plt.plot,"Av_density", "pred2" );
Still would like to know what is messing the figure with lmplot.

The issue is probably only that the added regression line is messing up the y-axis, so that the variability in the data cannot be seen.
Try resetting the y-axis based on the variability in your original plot to see if they show the same thing, in your case e.g.
fig1 = sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2);
fig1.set(ylim=(-0.03, 0.05))
plt.show(fig1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Indexing my dataframe properly with pandas - python

Related

Why am I unable to make a plot containing subplots in plotly using a px.scatter plot?

Python: A histogram for selected columns split by a variable

How do I make this subplotting work in Python?

How do I "reset the index" for a matplotlib plot?

Why do sns.lmplot and FacetGrid+plt.scatter create different scatter points from the same data?

Categories

Resources