I'm taking a biostatistics class and we've been asked to manipulate some data from a CSV into various different types of plots. I'm having issues getting each bar on a bar plot to show the correct categorical variable. I'm following an example the professor provided and not getting what I want. I'm totally new to this, so my apologies for formatting errors.
I've created the dataframe variable and am now trying to plot it as a bar graph (and later on other variables in the CSV as other types of plots). Not sure if I'm providing the code in the correct manner, but here's what I have so far. We're supposed to create a bar plot of PET using the number of cases (number of each pet/type of pet).
This is the data for this particular question. In the CSV it's shown as just the type of pet each student has (not sure how to share the CSV, but if it'd help I can post it).
I'm editing the post to show the code I've run to get the plot, and include the CSV info (hope I'm doing this right):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
HW2 = pd.read_csv("/Path/to/file")
HW2Grouped = HW2.groupby('Pet').count()
HW2Grouped['Pet'] = HW2Grouped.index
HW2Grouped.columns = ['Pet', 'Count', 'col_1', 'col_2', 'col_3', 'col_4']
%matplotlib inline
HW2bar = HW2Grouped.plot.bar(x = 'Pet', y = 'Count', title = "Pet count for students")
HW2bar.set_xlabel('Pet Type')
t = HW2bar.set_ylabel('Count')
This is the data I have to work with (sorry it's just a screenshot).
This is the bar plot I got from the code I ran.
It seems to me that when you added a new column, Pet, it became the new last column. Then you renamed columns of the HW2Grouped, and the first column (where the results of count aggregation are) was renamed to Pet, and the actual Pet column became col_4.
Let me now trace back to what appeared to be wrong in the steps you tried — to make it clear what was going on.
When you grouped your DataFrame with this code:
HW2Grouped = HW2.groupby('Pet').count()
You received this:
Height Ice Cream n of letters Favorite TA Minutes to Hometown
Pet
Cat 1 1 1 1 1
Dog 17 17 17 17 17
Horse 2 2 2 2 2
None 4 4 4 4 4
After you performed adding a new column Pet (what you might thought was creating a variable) to HW2Grouped, it started to look like this:
Height Ice Cream n of letters Favorite TA Minutes to Hometown Pet
Pet
Cat 1 1 1 1 1 Cat
Dog 17 17 17 17 17 Dog
Horse 2 2 2 2 2 Horse
None 4 4 4 4 4 None
Then, when you changed the .columns attribute, your grouped DataFrame became like this:
Pet Count col_1 col_2 col_3 col_4
Pet
Cat 1 1 1 1 1 Cat
Dog 17 17 17 17 17 Dog
Horse 2 2 2 2 2 Horse
None 4 4 4 4 4 None
Then, when plotting HW2Grouped, you passed Pet as an x, but Pet now wasn't there after renaming the columns, it now was the former Height column. This led to the wrong bar names.
You may try:
%matplotlib inline
HW2bar = HW2Grouped.plot.bar(x = 'col_4', y = 'Count', title = "Pet count for students")
HW2bar.set_xlabel('Pet Type')
t = HW2bar.set_ylabel('Count')
I think what you originally intended to do was this (except you didn't indicate the column to perform the count on):
HW2Grouped = HW2.groupby('Pet')['Pet'].count()
However, this won't sort the bars in a descending order.
There is a short way without column additions and renaming, the bars will be sorted:
HW2['Pet'].value_counts().plot.bar()
Related
Let's say I have people chew a type of gum while reading a question, and then answer a test question. Sometimes they would chew orange gum while reading and answering a question. Sometimes they would chew peppermint. Not everyone chewed and answered all of the questions.
Let's say I have my data laid out like this:
ID
Gum Type
Test (1= correct, 2=incorrect)
1
Orange
1
1
Orange
0
1
Peppermint
0
1
Peppermint
1
2
Orange
0
2
Peppermint
1
I want to create a violin plot where on my x-axis, I have Gum Type, and on my Y-axis, I have the Proportion correct on the test, and participant 1 would show up as only one data point for Orange, and One data point for Peppermint. So participant one would show up on the "Orange" violin plot as one data point, in the middle (got 50% of orange questions correct).
Use:
data = '''ID Gum Type Test (1= correct, 2=incorrect)
1 Orange 1
1 Orange 0
1 Peppermint 0
1 Peppermint 1
2 Orange 0
2 Peppermint 1'''
data = [x.split(' ') for x in data.split('\n')]
import seaborn as sns
df = pd.DataFrame(data[1:], columns = data[0])
df['Test (1= correct, 2=incorrect)'] = df['Test (1= correct, 2=incorrect)'].astype(int)
df1 = df.groupby(['ID', 'Gum Type'])['Test (1= correct, 2=incorrect)'].mean().to_frame().reset_index()
ax = sns.violinplot(x="Gum Type", y="Test (1= correct, 2=incorrect)", data=df1)
Output:
This is my fist question on stackoverflow.
I'm implementing a Machine Learning classification algorithm and I want to generalize it for any input dataset that have their target class in the last column. For that, I want to modify all values of this column without needing to know the names of each column or rows using pandas in python.
For example, let's suppose I load a dataset:
dataset = pd.read_csv('random_dataset.csv')
Let's say the last column has the following data:
0 dog
1 dog
2 cat
3 dog
4 cat
I want to change each "dog" appearence to 1 and each cat appearance to 0, so that the column would look:
0 1
1 1
2 0
3 1
4 0
I have found some ways of changing the values of specific cells using pandas, but for this case, what would be the best way to do that?
I appreciate each answer.
You can use pandas.Categorical:
df['column'] = pd.Categorical(df['column']).codes
You can also use the built in functionality for this too:
df['column'] = df['column'].astype('category').cat.codes
use the map and map the values as per requirement:
df['col_name'] = df['col_name'].map({'dog' : 1 , 'cat': 0})
OR -> Use factorize(Encode the object as an enumerated type) -> if you wanna assign random numeric values
df['col_name'] = df['col_name'].factorize()[0]
OUTPUT:
0 1
1 1
2 0
3 1
4 0
The following is an excerpt from my dataframe:
In[1]: df
Out[1]:
LongName BigDog
1 Big Dog 1
2 Mastiff 0
3 Big Dog 1
4 Cat 0
I want to use regex to update BigDog values to 1 if LongName is a mastiff. I need other values to stay the same. I tried this, and although it assigns 1 to mastiffs, it nulls all other values instead of keeping them intact.
def BigDog(longname):
if re.search('(?i)mastiff', longname):
return '1'
df['BigDog'] = df['LongName'].apply(BigDog)
I'm not sure what to do, could anybody please help?
You don't need a loop or apply, use str.match with DataFrame.loc:
df.loc[df['LongName'].str.match('(?i)mastiff'), 'BigDog'] = 1
LongName BigDog
1 Big Dog 1
2 Mastiff 1
3 Big Dog 1
4 Cat 0
I have a data frame with a single column of values and an index of sample names:
>>> df = pd.DataFrame(data={'value':[1,3,4]},index=['cat','dog','bird'])
>>> print(df)
value
cat 1
dog 3
bird 4
I would like to convert this to a square matrix wherein each cell of the matrix shows the difference between every set of two values:
cat dog bird
cat 0 2 3
dog 2 0 1
bird 3 1 0
Is this possible? If so, how do I go about doing this?
I have tried to use scipy.spatial.distance.squareform to convert my starting data frame into a matrix, but apparently what I am starting with is not the right type of vector. Any help would be much appreciated!
I am trying to convert data in a pandas DataFrame in to a stacked area graph but can not seem to get it to stack.
The data is in the format
index | datetime (yyyy/mm/dd) | name | weight_change
With 6 different people each measured daily.
I want the stacked graph to show the weight_change (y) over the datetime (x) but with weight_change for each of the 6 people stacked on top of each other
The closest I have been able to get to it is with:
df = df.groupby['datetime', 'name'], as_index=False).agg({'weight_change': 'sum'})
agg = df.groupby('datetime').sum()
agg.plot.area()
This produces the area graph for the aggregate of the weight_change values (sum of each persons weight_change for each day) but I can't figure out how to split this up for each person like the different values here:
I have tried various things with no luck. Any ideas?
A simplified version of your data:
df = pd.DataFrame(dict(days=range(4)*2,
change=np.random.rand(8)*2.,
name=['John',]*4 + ['Jane',]*4))
df:
change days name
0 0.238336 0 John
1 0.293901 1 John
2 0.818119 2 John
3 1.567114 3 John
4 1.295725 0 Jane
5 0.592008 1 Jane
6 0.674388 2 Jane
7 1.763043 3 Jane
Now we can simply use pyplot's stackplot:
import matplotlib.pyplot as plt
days = df.days[df.name == 'John']
plt.stackplot(days, df.change[df.name == 'John'],
df.change[df.name == 'Jane'])
This produces the following plot: