Plotting proportional data python (stacked barplot) - python

I have a data set where clients answer a question, and clients belong to a certain category. The category is ordinal. I want to visualize the change in percentages as a proportional stacked barplot. Here is some test data:
answer | categ
1 1
2 1
3 2
1 2
2 3
3 3
1 1
2 1
3 2
1 2
2 3
3 3
1 3
2 2
3 1
Here is how you can generate it:
pd.DataFrame({'answer':[1,2,3]*5, 'categ':[1,1,2,2,3,3]*2+[3,2,1]})
Using some convoluted code which can probably be written much nicer and more efficient I got to percentages within the answer.
test = pd.DataFrame({'answer':[1,2,3]*5, 'categ':[1,1,2,2,3,3]*2+[3,2,1]})
rel_data = pd.merge(pd.DataFrame(test.groupby(['answer','categ']).size()).reset_index(),pd.DataFrame(test.groupby('answer').size()).reset_index(), how='left', on='answer')
rel_data.columns = ['answer', 'categ', 'number_combination', 'number_answer']
rel_data['perc'] = rel_data['number_combination']/rel_data['number_answer']
rel_data[['answer', 'categ', 'perc']]
This results in:
answer | categ | perc
1 1 0.4
1 2 0.4
1 3 0.2
2 1 0.4
2 2 0.2
2 3 0.4
3 1 0.2
3 2 0.4
3 3 0.4
How do I get this into a stacked bar plot with per answer a bar and colored areas per category?

Once I had the last dataframe, I could get it fairly easily. By doing this:
rel_data = rel_data.groupby(['answer','categ']).\
perc.sum().unstack().plot(kind='bar', stacked=True, ylim=(0,1))
It's again dirty but at least it got the job done. The perc.sum turns it into one value per group (even though it already was that), the unstack() turns it into a DF with the categories as columns and the answers as rows, the plot turns this into a proportional stacked barplot. The ylim is due to some tiny rounding error where it could add up to 1.00001 which added a whole new tick.

This is by no means perfect but it's a start:
for i in set(df.categ):
colors = ["r", "g", "b", "y", "o"] #etc....
if i == 1:
x = np.zeros(len(set(df.answer)))
else:
x += df[df.categ == i - 1].perc.as_matrix()
plt.bar(df[df.categ == i].answer, df[df.categ == i].perc, bottom=x, color=colors[i - 1])
plt.xticks(list(set(df.answer)))
plt.show()
The approach is to group the data first by category and then we can iterate over each category to get the answers which will be the individual bars. We then check if its the first iteration by the i == 1 check. This creates an empty array which will be used when stacking. Then we draw the first bars. Then we iterate and add the height of the bars as we go into the variable x.
The colors array are there just so you can differentiate the bars a bit better.
Hope this helps.

You can make a barplot with the matplotlib library. Follow this tuto : http://matplotlib.org/examples/api/barchart_demo.html

Related

Filter dataframe based on matching values from two columns

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,2,3,0,0]})
I would like to filter the dataframe based on the below criteria
cdf['Id']==cdf['Label'] # first 3 rows are matching for both columns in cdf
I tried the below
flag = np.where[cdf['Id'].eq(cdf['Label'])==True,1,0]
final_df = cdf[cdf['flag']==1]
but I got the below error
TypeError: 'function' object is not subscriptable
I expect my output to be like as shown below
Id Label
0 1 1
1 2 2
2 3 3
I think you're overthinking this. Just compare the columns:
>>> cdf[cdf['Id'] == cdf['Label']]
Id Label
0 1 1
1 2 2
2 3 3
Your particular error though is coming from the fact that you're using square brackets to call np.where, e.g. np.where[...], which is wrong. You should be using np.where(...) instead, but the above solution is bound to be as fast as it gets ;)
Also you can check query
cdf.query('Id == Label')
Out[248]:
Id Label
0 1 1
1 2 2
2 3 3

How to perform groupby and remove duplicate based on first occurrence of a column condition?

This problem is a bit hard for me to wrap my head around so I hope I can explain it properly below.
I have a data frame with a lot of rows but only 3 columns like below:
data = {'line_group': [1,1,8,8,4,4,5,5],
'route_order': [1,2,1,2,1,2,1,2],
'StartEnd':['20888->20850','20888->20850','20888->20850','20888->20850',
'20961->20960','20961->20960','20961->20960','20961->20960']}
df = pd.DataFrame(data)
In the end, I want to use this data to plot routes between points for instance 20888 to 20850. But the problem is that there are a lot of trips/line_group that also goes through these two points so when I do plot things, it will be overlapping and very slow which is not what I want.
So I only want the first line_group which has the unique StartEnd like in the data frame below:
I believe it could have something to do with groupby like in the following code below that I have tried but it doesn't produce the results I want. And in the full dataset route orders aren't usually just from 1 point to another and can be up to much more (E.g 1,2,3,4,...).
drop_duplicates(subset='StartEnd', keep="first")
Group by StartEnd and keep only the first line_group value
Then filter to rows which contain the unique line groups
unique_groups = df.groupby('StartEnd')['line_group'].agg(lambda x: list(x)[0]).reset_index()
StartEnd line_group
20888->20850 1
20961->20960 4
unique_line_groups = unique_groups['line_group']
filtered_df = df[df['line_group'].isin(unique_line_groups)]
Final Output
line_group route_order StartEnd
1 1 20888->20850
1 2 20888->20850
4 1 20961->20960
4 2 20961->20960
You can add in route_order to the argument subset to get output you want.
In [8]: df.drop_duplicates(subset=['StartEnd', 'route_order'], keep='first')
Out[8]:
line_group route_order StartEnd
0 1 1 20888->20850
1 1 2 20888->20850
4 4 1 20961->20960
5 4 2 20961->20960
You can use groupby.first():
df.groupby(["route_order", "StartEnd"], as_index=False).first()
output:
route_order StartEnd line_group
0 1 20888->20850 1
1 1 20961->20960 4
2 2 20888->20850 1
3 2 20961->20960 4

How can i speed up data labelling for a large pandas dataframe?

I have a large pandas data frame which roughly looks this
Identity periods one two three Label
0 one 1 -0.462407 0.022811 -0.277357
1 one 1 -0.617588 1.667191 -0.370436
2 one 2 -0.604699 0.635473 -0.556088
3 one 2 -0.852943 1.087415 -0.784377
4 two 3 0.421453 2.390097 0.176333
5 two 3 -0.447321 -1.215280 -0.187156
6 two 4 0.398953 -0.334095 -1.194132
7 two 4 -0.324348 -0.842357 0.970825
I need to be able to categorise the data according to groupings in the various columns, for example one of my categorisation criteria is to label each of the groups in the identity column with a label if there is between x and y periods in the periods column.
The code I have to categorise this looks like this, generating a final column:
for i in df['Identity'].unique():
if (2 <= df[df['Identity']==i]['periods'].max() <= 5) :
df.loc[df['Identity']==i,'label']='label 1'
I have also tried a version using
df.groupby('Identity').apply().
But this is no quicker.
My data is approximately 2.8m rows at the moment, and there are about 900 unique identities. The code takes about 5 minutes to run, which to me suggests it's the code within the loop that is slow, rather than the looping making it slow.
Let's try to enhance the system performance by using all vectorized Pandas operations instead of using loops or .apply() function which is also just commonly using the relatively slow Python loops internally.
Use .groupby() and .transform() to broadcast max() of periods within group to get a series for making mask. Then use .loc[] with the mask of the condition 2 <= max <=5 and setup label for such rows fulfulling the mask.
Assumed same label for all rows of same Identity group whenever the max period within the group is within 2 <= max <=5.
m = df.groupby('Identity')['periods'].transform('max')
df.loc[(m >=2) & (m <=5), 'Label'] = 'label 1'
print(df)
Identity periods one two three Label
0 one 1 -0.462407 0.022811 -0.277357 label 1
1 one 1 -0.617588 1.667191 -0.370436 label 1
2 one 2 -0.604699 0.635473 -0.556088 label 1
3 one 2 -0.852943 1.087415 -0.784377 label 1
4 two 3 0.421453 2.390097 0.176333 label 1
5 two 3 -0.447321 -1.215280 -0.187156 label 1
6 two 4 0.398953 -0.334095 -1.194132 label 1
7 two 4 -0.324348 -0.842357 0.970825 label 1

Hue two panda series

I have two pandas series for which I want to compare them visually by plotting them on top of each other. I already tried the following
>>> s1 = pd.Series([1,2,3,4,5])
>>> s2 = pd.Series([3,3,3,3,3])
>>> df = pd.concat([s1, s2], axis=1)
>>> sns.stripplot(data = df)
which yields the following picture:
Now, I am aware of the hue keyword of sns.stripplot but trying to apply it, requires me to to use the keywords x and y. I already tried to transform my data into a different dataframe like that
>>> df = pd.concat([pd.DataFrame({'data':s1, 'type':'s1'}), pd.DataFrame({'data':s2, 'type':'s2'})])
so I can "hue over" type; but even then I have no idea what to put for the keyword x (assuming y = 'data'). Ignoring the keyword x like that
>>> sns.stripplot(y='data', data=df, hue='type')
fails to hue anything:
seaborn generally works best with long-form data, so you might need to rearrange your dataframe slightly. The hue keyword is expecting a column, so we'll use .melt() to get one.
long_form = df.melt()
long_form['X'] = 1
sns.stripplot(data=long_form, x='X', y='value', hue='variable')
Will give you a plot that roughly reflects your requirements:
When we do pd.melt, we change the frame from having multiple columns of values to having a single column of values, with a "variable" column to identify which of our original columns they came from. We add in an 'X' column because stripplot needs both x and hueto work properly in this case. Our long_form dataframe, then, looks like this:
variable value X
0 0 1 1
1 0 2 1
2 0 3 1
3 0 4 1
4 0 5 1
5 1 3 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1

how to plot histogram of maximum values of a dataframe

I have a dataframe with 3 columns df=["a", "b", "value"]. (Actually this is a snippet, the solution should be able to handle n variables, like "a", "b", "c", "d"...) In this case, the "value" column has been generated depending on the "a" and "b" value, doing something like:
for a in range(1,10):
for b in range (1,10):
generate_value(a,b)
The resulting data is similar to:
a b value
0 1 1 0.23
1 1 2 6.34
2 1 3 0.25
3 1 4 2.17
4 1 5 5.97
[...]
I want to know the statistical better combinations of "a" and "b" that gives me the bigger "value". So I want to draw some kind of histogram that shows me which values of "a" and "b" statistically generates bigger "value". I tried with something like:
fig = plot.figure()
ax=fig.add_subplot(111)
ax.hist(df["a"],bins=50, normed=True)
or:
plot.plot(df["a"].values, df["value"].values, "o")
But the results are not good. I think that I should use some kind of histogram or gauss bell curve, but I'm not sure how to plot it.
So, how to plot the statistically better "a" and "b" to get maximum "value"?
Note: the answer 1 is perfect for two variables a and b, but the problem is that the correct answer would need to work for multiple variables, a, b, c, d...
Edit 1: Please note that although I'm asking about two variables, the solution can't be to bound "a" to axis x and "b" to axis y, as there may be more variables. So if we have "a", "b", "c", "d", "e", the solution should be valid
Edit 2: Trying to explain it better: Lets take the following dataframe:
a b c d value
0 1 6 9 7 0.23
1 5 2 3 5 11.34
2 6 7 8 4 0.25
3 1 4 9 3 2.17
4 1 5 9 1 4.97
5 6 6 4 7 25.9
6 3 5 5 2 10.37
7 1 5 1 2 7.87
8 2 5 3 3 8.12
9 1 5 2 1 2.97
10 7 5 4 9 5.97
11 3 5 2 3 9.92
[...]
The row 5 clearly is the winner, with a 25.9 value, so the supposedly better values of a,b,c,d are: 6 6 4 7 . But we can see that statistically it is a strange result, it is the only one so high with those values of a,b,c,d, so it is very unlikely that we're going to get, in the future, a high value choosing those values for a,b,c,d. Instead, seems much more safe to choose numbers that have generated "value" between 8 and 11. Although a 8 to 11 gain is less than 25.9, the probability that the values of a,b,c,d (5,2,3,3) generate this higher "value" is bigger
Edit 3: Although a,b,c,d are discrete, the combination/order of them will generate different results. I mean, there is a function that will return a value inside a small range, like: value=func(a,b,c,d). That value will depend not only on the values of a,b,c,d, but also on some random things. So, for instance, func(5,2,3,5) could return a value of 11.34, but it also could return a similar value, like 10.8, 9.5 or something like that (a range value between 8 and 11). Also, func(1,6,9,7) will return 0.23, or it could return 2.7, but probably it won't return 10.1 as it is also very far from its range.
Following the the example, I'm trying to get the numbers that most probably will generate something in the range of 8-11 (well, the maximum). Probably the numbers I want to visualize somehow will be some kind of combination of numbers 3,5 and 2. But probably there won't be any 6,7,4 numbers, as they usually generate smaller "value" results
I don't think there are any statistics involved here. You can plot the value as a function of a and b.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
A,B = np.meshgrid(np.arange(10),np.arange(10))
df = pd.DataFrame({"a" : A.flatten(), "b" : B.flatten(),
"value" : np.random.rand(100)})
ax = df.plot.scatter(x="a",y="b", c=df["value"])
plt.colorbar(ax.collections[0])
plt.show()
The darker the dots, the higher the value.
This problem seems to be very complicated to solve it by one built-in function.
I think it should be solved in this way:
exclude outliers from data
select n largest values
summarize results with bar plot or any other
Clean data from outliers
We might choose any appropriate method for outliers detection e.g. 3*sigma, 1.5*IQR etc. I used 1.5*IQR in the example bellow.
cleaned_data = data[data['value'] < 1.5 * stats.iqr(data['value'])]
Select n largest values
Pandas provides method nlargest, so you can use it to select n largest values:
largest_values = cleaned_data.nlargest(5, 'value')
or you can use interval of values
largest_values = cleaned_data[cleaned_data['value'] > cleaned_data['value'].max() - 3]
Summarize results
Here we should count ocurrences of values in each column and then plot this data.
melted = pd.melt(largest_values['here you should select columns with explanatory variables'])
table = pd.crosstab(melted['variable'], melted['value'])
table.plot.bar()
example of resulting plot

Categories