How to plot my data using proportions and violin plots?

How to plot my data using proportions and violin plots? - python

Let's say I have people chew a type of gum while reading a question, and then answer a test question. Sometimes they would chew orange gum while reading and answering a question. Sometimes they would chew peppermint. Not everyone chewed and answered all of the questions.
Let's say I have my data laid out like this:
ID
Gum Type
Test (1= correct, 2=incorrect)
1
Orange
1
1
Orange
0
1
Peppermint
0
1
Peppermint
1
2
Orange
0
2
Peppermint
1
I want to create a violin plot where on my x-axis, I have Gum Type, and on my Y-axis, I have the Proportion correct on the test, and participant 1 would show up as only one data point for Orange, and One data point for Peppermint. So participant one would show up on the "Orange" violin plot as one data point, in the middle (got 50% of orange questions correct).

Use:
data = '''ID Gum Type Test (1= correct, 2=incorrect)
1 Orange 1
1 Orange 0
1 Peppermint 0
1 Peppermint 1
2 Orange 0
2 Peppermint 1'''
data = [x.split(' ') for x in data.split('\n')]
import seaborn as sns
df = pd.DataFrame(data[1:], columns = data[0])
df['Test (1= correct, 2=incorrect)'] = df['Test (1= correct, 2=incorrect)'].astype(int)
df1 = df.groupby(['ID', 'Gum Type'])['Test (1= correct, 2=incorrect)'].mean().to_frame().reset_index()
ax = sns.violinplot(x="Gum Type", y="Test (1= correct, 2=incorrect)", data=df1)
Output:

Related

How can I count instances of a string in a dataframe column of lists that matches the string of a column in a different dataframe?

I have a dataframe containing a column of produce and a column of a list of colors the produce comes in:
import pandas as pd
data = {'produce':['zucchini','apple','citrus','banana','pear'],
'colors':['green, yellow','green, red, yellow','orange, yellow ,green','yellow','green, yellow, brown']}
df = pd.DataFrame(data)
print(df)
Dataframe looks like:
produce colors
0 zucchini green, yellow
1 apple green, red, yellow
2 citrus orange, yellow, green
3 banana yellow
4 pear green, yellow, brown
I am trying to create a second dataframe with each color, and count the number of columns in the first dataframe that have that color. I am able to get the unique list of colors into a dataframe:
#Create Dataframe with a column of unique values
unique_colors = df['colors'].str.split(",").explode().unique()
df2 = pd.DataFrame()
df2['Color'] = unique_colors
print(df2)
But some of the colors repeat some of the time:
Color
0 green
1 yellow
2 red
3 orange
4 green
5 yellow
6 brown
and I am unable to find a way to add a column that counts the instances in the other dataframe. I have tried:
#df['Count'] = data['colors'] == df2['Color']
df['Count'] = ()
for i in df2['Color']:
count=0
if df["colors"].str.contains(i):
count+1
df['Count']=count
but I get the error "ValueError: Length of values (0) does not match length of index (5)"
How can I
make sure values aren't repeated in the list, and
count the instances of the color in the other dataframe
(This is a simplification of a much larger dataframe, so I can't just edit values in the first dataframe to fix the unique color issue).

You need consider the space around , when split. To count the occurrence of color, you can use Series.value_counts().
out = (df['colors'].str.split(' *, *')
.explode().value_counts()
.to_frame('Count')
.rename_axis('Color')
.reset_index())
print(out)
Color Count
0 yellow 5
1 green 4
2 red 1
3 brown 1
4 orange 1

Proposed script
import operator
y_c = (df['colors'].agg(lambda x: [e.strip() for e in x.split(',')])
.explode()
)
clrs = pd.DataFrame.from_dict({c: [operator.countOf(y_c, c)] for c in y_c.unique()})
Two presentations for the result
1 - Horizontal :
print(clrs.rename(index={0:'count'}))
# green yellow red orange brown
# count 4 5 1 1 1
2- Vertical :
print(clrs.T.rename(columns={0:'count'}))
# count
# green 4
# yellow 5
# red 1
# orange 1
# brown 1

Incorrect labels for bars in bar plot

I'm taking a biostatistics class and we've been asked to manipulate some data from a CSV into various different types of plots. I'm having issues getting each bar on a bar plot to show the correct categorical variable. I'm following an example the professor provided and not getting what I want. I'm totally new to this, so my apologies for formatting errors.
I've created the dataframe variable and am now trying to plot it as a bar graph (and later on other variables in the CSV as other types of plots). Not sure if I'm providing the code in the correct manner, but here's what I have so far. We're supposed to create a bar plot of PET using the number of cases (number of each pet/type of pet).
This is the data for this particular question. In the CSV it's shown as just the type of pet each student has (not sure how to share the CSV, but if it'd help I can post it).
I'm editing the post to show the code I've run to get the plot, and include the CSV info (hope I'm doing this right):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
HW2 = pd.read_csv("/Path/to/file")
HW2Grouped = HW2.groupby('Pet').count()
HW2Grouped['Pet'] = HW2Grouped.index
HW2Grouped.columns = ['Pet', 'Count', 'col_1', 'col_2', 'col_3', 'col_4']
%matplotlib inline
HW2bar = HW2Grouped.plot.bar(x = 'Pet', y = 'Count', title = "Pet count for students")
HW2bar.set_xlabel('Pet Type')
t = HW2bar.set_ylabel('Count')
This is the data I have to work with (sorry it's just a screenshot).
This is the bar plot I got from the code I ran.

It seems to me that when you added a new column, Pet, it became the new last column. Then you renamed columns of the HW2Grouped, and the first column (where the results of count aggregation are) was renamed to Pet, and the actual Pet column became col_4.
Let me now trace back to what appeared to be wrong in the steps you tried — to make it clear what was going on.
When you grouped your DataFrame with this code:
HW2Grouped = HW2.groupby('Pet').count()
You received this:
Height Ice Cream n of letters Favorite TA Minutes to Hometown
Pet
Cat 1 1 1 1 1
Dog 17 17 17 17 17
Horse 2 2 2 2 2
None 4 4 4 4 4
After you performed adding a new column Pet (what you might thought was creating a variable) to HW2Grouped, it started to look like this:
Height Ice Cream n of letters Favorite TA Minutes to Hometown Pet
Pet
Cat 1 1 1 1 1 Cat
Dog 17 17 17 17 17 Dog
Horse 2 2 2 2 2 Horse
None 4 4 4 4 4 None
Then, when you changed the .columns attribute, your grouped DataFrame became like this:
Pet Count col_1 col_2 col_3 col_4
Pet
Cat 1 1 1 1 1 Cat
Dog 17 17 17 17 17 Dog
Horse 2 2 2 2 2 Horse
None 4 4 4 4 4 None
Then, when plotting HW2Grouped, you passed Pet as an x, but Pet now wasn't there after renaming the columns, it now was the former Height column. This led to the wrong bar names.
You may try:
%matplotlib inline
HW2bar = HW2Grouped.plot.bar(x = 'col_4', y = 'Count', title = "Pet count for students")
HW2bar.set_xlabel('Pet Type')
t = HW2bar.set_ylabel('Count')
I think what you originally intended to do was this (except you didn't indicate the column to perform the count on):
HW2Grouped = HW2.groupby('Pet')['Pet'].count()
However, this won't sort the bars in a descending order.
There is a short way without column additions and renaming, the bars will be sorted:
HW2['Pet'].value_counts().plot.bar()

problem with re index dataframe (dealing with categorical data)

i have a data that look like this
subject_id hour_measure urine color heart_rate
3 1 red 40
3 1.15 red 60
4 2 yellow 50
i want to re index data to make 24 hour of measurement for every patient
i use the following code
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
df = df.groupby(['subject_id','hour_measure']).mean().reindex(mux).reset_index()
df.to_csv('totalafterreindex.csv')
it works good with numeric values , but with categorical values it removed it ,
how can i enhance this code to use mean for numeric and most frequent for categorical
the wanted output
subject_id hour_measure urine color heart_rate
3 1 red 40
3 2 red 60
3 3 yellow 50
3 4 yellow 50
.. .. ..

Idea is use GroupBy.agg with mean for numeric and mode for categorical, also is added next with iter for return Nones if mode return empty value:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df1 = df.groupby(['subject_id','hour_measure']).agg(f).reindex(mux).reset_index()
Detail:
print (df.groupby(['subject_id','hour_measure']).agg(f))
urine color heart_rate
subject_id hour_measure
3 1.00 red 40
1.15 red 60
4 2.00 yellow 50
Last if necessary forward filling missing values per subject_id use GroupBy.ffill:
cols = df.columns.difference(['subject_id','hour_measure'])
df[cols] = df.groupby('subject_id')[cols].ffill()

Pandas DataFrame - How to make a stacked area graph stack (matplotlib)

I am trying to convert data in a pandas DataFrame in to a stacked area graph but can not seem to get it to stack.
The data is in the format
index | datetime (yyyy/mm/dd) | name | weight_change
With 6 different people each measured daily.
I want the stacked graph to show the weight_change (y) over the datetime (x) but with weight_change for each of the 6 people stacked on top of each other
The closest I have been able to get to it is with:
df = df.groupby['datetime', 'name'], as_index=False).agg({'weight_change': 'sum'})
agg = df.groupby('datetime').sum()
agg.plot.area()
This produces the area graph for the aggregate of the weight_change values (sum of each persons weight_change for each day) but I can't figure out how to split this up for each person like the different values here:
I have tried various things with no luck. Any ideas?

A simplified version of your data:
df = pd.DataFrame(dict(days=range(4)*2,
change=np.random.rand(8)*2.,
name=['John',]*4 + ['Jane',]*4))
df:
change days name
0 0.238336 0 John
1 0.293901 1 John
2 0.818119 2 John
3 1.567114 3 John
4 1.295725 0 Jane
5 0.592008 1 Jane
6 0.674388 2 Jane
7 1.763043 3 Jane
Now we can simply use pyplot's stackplot:
import matplotlib.pyplot as plt
days = df.days[df.name == 'John']
plt.stackplot(days, df.change[df.name == 'John'],
df.change[df.name == 'Jane'])
This produces the following plot:

Plotting proportional data python (stacked barplot)

I have a data set where clients answer a question, and clients belong to a certain category. The category is ordinal. I want to visualize the change in percentages as a proportional stacked barplot. Here is some test data:
answer | categ
1 1
2 1
3 2
1 2
2 3
3 3
1 1
2 1
3 2
1 2
2 3
3 3
1 3
2 2
3 1
Here is how you can generate it:
pd.DataFrame({'answer':[1,2,3]*5, 'categ':[1,1,2,2,3,3]*2+[3,2,1]})
Using some convoluted code which can probably be written much nicer and more efficient I got to percentages within the answer.
test = pd.DataFrame({'answer':[1,2,3]*5, 'categ':[1,1,2,2,3,3]*2+[3,2,1]})
rel_data = pd.merge(pd.DataFrame(test.groupby(['answer','categ']).size()).reset_index(),pd.DataFrame(test.groupby('answer').size()).reset_index(), how='left', on='answer')
rel_data.columns = ['answer', 'categ', 'number_combination', 'number_answer']
rel_data['perc'] = rel_data['number_combination']/rel_data['number_answer']
rel_data[['answer', 'categ', 'perc']]
This results in:
answer | categ | perc
1 1 0.4
1 2 0.4
1 3 0.2
2 1 0.4
2 2 0.2
2 3 0.4
3 1 0.2
3 2 0.4
3 3 0.4
How do I get this into a stacked bar plot with per answer a bar and colored areas per category?

Once I had the last dataframe, I could get it fairly easily. By doing this:
rel_data = rel_data.groupby(['answer','categ']).\
perc.sum().unstack().plot(kind='bar', stacked=True, ylim=(0,1))
It's again dirty but at least it got the job done. The perc.sum turns it into one value per group (even though it already was that), the unstack() turns it into a DF with the categories as columns and the answers as rows, the plot turns this into a proportional stacked barplot. The ylim is due to some tiny rounding error where it could add up to 1.00001 which added a whole new tick.

This is by no means perfect but it's a start:
for i in set(df.categ):
colors = ["r", "g", "b", "y", "o"] #etc....
if i == 1:
x = np.zeros(len(set(df.answer)))
else:
x += df[df.categ == i - 1].perc.as_matrix()
plt.bar(df[df.categ == i].answer, df[df.categ == i].perc, bottom=x, color=colors[i - 1])
plt.xticks(list(set(df.answer)))
plt.show()
The approach is to group the data first by category and then we can iterate over each category to get the answers which will be the individual bars. We then check if its the first iteration by the i == 1 check. This creates an empty array which will be used when stacking. Then we draw the first bars. Then we iterate and add the height of the bars as we go into the variable x.
The colors array are there just so you can differentiate the bars a bit better.
Hope this helps.

You can make a barplot with the matplotlib library. Follow this tuto : http://matplotlib.org/examples/api/barchart_demo.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot my data using proportions and violin plots? - python

Related

How can I count instances of a string in a dataframe column of lists that matches the string of a column in a different dataframe?

Incorrect labels for bars in bar plot

problem with re index dataframe (dealing with categorical data)

Pandas DataFrame - How to make a stacked area graph stack (matplotlib)

Plotting proportional data python (stacked barplot)

Categories

Resources