Plot against dummy variables and grouped values - python

This is some values of the table I have
country colour ...
1 Spain red
2 USA blue
3 Greece green
4 Italy white
5 USA red
6 USA blue
7 Spain red
I want to be able to group the countries together and plot it where the country is in the x axis and the total number of 'colours' is calculated for each country. For example, country USA has 2 blues and 1 red, Spain has 2 reds etc. I want this in a bar chart form. I would like this to be done using either matplotlib or seaborn.
I would assume I would have to use dummy variables for the 'colours' column but I'm not sure how to plot against a grouped column and dummy variables.
Much appreciated if you could show and explain the process. Thank you.

Try with crosstab:
pd.crosstab(df['country'], df['colour']).plot.bar()
Output:

Related

Keep rows in dataframe where values are present in another dataframe

I have two dataframes, df1 and df2:
Transport
City
Color
Car
Paris
red
Car
London
white
Bike
Paris
red
Car
New York
blue
Color
red
blue
blue
They are not the same length.
I want to make a new dataframe based on the first one, where I only keep the row, if the color is also present in the second dataframe, such that the output would be:
Transport
City
Color
Car
Paris
red
Bike
Paris
red
Car
New York
Blue
Is there a way to do that? I want to write something like:
df1[df1.Color.isin(df2.Color)]
But it does not seem to work.
edit: I think the issue is that the data type in the first dataframe is str and not in the second.
If both should match you need to merge and then filter out the nulls:
df1 = df1.merge(right=df2.drop_duplicates(), on=['City','Color'], how='left')
df1.dropna(subset=['Transport'], inplace = True)
Let me know if this works for you
You can try with a merge after dropping duplicates to avoid cartesian product:
df_1.merge(df_2.drop_duplicates(),left_on=['City','Color'],right_on=['City','Color'])
Outputting:
Transport City Color
0 Car Paris red
1 Bike Paris red
2 Car New York blue

Creating single chart from three categoric values using python

I am fairly new to python and its terminology and can be clumsy at describing the problem.Sorry for that.
What I got is three cities that produced three fruits for two years, and I need to draw the single-static chart that summarizes the data best.
The fact that dataframe have 3 categoric values (city, fruits and year) and one measure makes me confused.
At first I try to use stack bar chart, however If I use fruits in the bars and cities in X axis, I could not find where to use year value.
I tried to use pivot method to convert year value into measure, but I could not advance with two measures this time.
I mainly used Matplotlib.
Any help appreciated,
data= {
'city':['amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','paris','paris','paris','paris','paris','paris','berlin','berlin','berlin','berlin','berlin','berlin'],
'fruits':['apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas'],
'year':[2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001],
'amount':[384,289,347,242,390,274,175,334,245,116,252,366,255,400,300,240,600,180]
}
df=pd.DataFrame(data)
df.head()
city
fruits
year
amount
0
paris
apples
2000
384
1
paris
oranges
2000
289
2
paris
bananas
2000
347
3
paris
apples
2001
242
4
paris
oranges
2001
390

Merging different sized data frames and plotting the difference of a column

i have two dataframes Region_education_0 and Region_education_1
Region_education_0
index
Region
ConvertedComp
1
Australia/New Zealand
122573.834171
2
Caribbean
53562.111111
3
Central Asia
134422.000000
4
East Asia
112492.507042
5
Melanesia
605
Region_education_1
index
Region
ConvertedComp
1
Australia/New Zealand
122573.834171
2
Caribbean
53562.111111
3
Central Asia
134422.000000
4
East Asia
112492.507042
Index 5, Melanesia is not present in Region_education_1 because of a condition, i want to compare them and plot so i tried this
from matplotlib.pyplot import *
Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region")
Region_education_combined.columns=["Region","Max of Bachelors Higher Ed","Higher Formal Education"]
Region_education_combined['Diff_HigherEd_Vals'] = Region_education_combined['Higher Formal Education'] - Region_education_combined['Max of Bachelors Higher Ed']
print(Region_education_combined)
comp_df.style.bar(subset=['Diff_HigherEd_Vals'], align='mid', color=['#d65f5f', '#5fba7d'])
index
Max of Bachelors Higher Ed
Higher Formal Education
Diff_HigherEd_Vals
1
151698.500659
122573.834171
-29124.666488
2
28413.753425
53562.111111
53562.111111
3
3944.750000
5883.000000
1938.250000
4
45091.041667
27052.384615
-18038.657051
Region column is missing from the output,to include I region tried
comp_df.style.bar(subset=['Diff_HigherEd_Vals','Region'], align='mid', color=['#d65f5f', '#5fba7d'])
and
comp_df.style.bar(Region_education_combined, align='mid', color=['#d65f5f', '#5fba7d'])
Is there any way to include region in the final output?
and i left out "Index 5, Melanesia" from 'Region_education_0' dataframe is there any way to include that too in the output ?
You can maintain the missing Region by using how="outer" when you call merge like this
Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region")
Pay attention that in this case you will have a table which contains Nan where it is not possible to merge, in your case Melanesia will have a Nan in the Higher Formal Education column. In order to avoid problem you can set a default value with this
Region_education_combined["Higher Formal Education"].fillna(0, inplace=True)

Populate Pandas dataframe with group_by calculations made in Pandas series

I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!
You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')

Grouping by multiple years in a single column and plotting the result stacked

I have a dataframe that looks like this, with the default pandas index starting at 0:
index Year Count Name
0 2005 70000 Apple
1 2005 60000 Banana
2 2006 20000 Pineapple
3 2007 70000 Cherry
4 2007 60000 Coconut
5 2007 40000 Pear
6 2008 90000 Grape
7 2008 10000 Apricot
I would like to create a stacked bar plot of this data.
However, using the df.groupby() function will only allow me to call a function such as .mean() or .count() on this data in order to plot the data by year. I am getting the following result which separates each data point and does not group them by the shared year.
I have seen the matplotlib example for stacked bar charts, but they are grouped by a common index, in this case I do not have a common index I want to plot by. Is there a way to group and plot this data without rearranging the entire dataframe?
If I understood you correctly, you could do this using pivot first:
df1 = pd.pivot_table(df, values='Count', index='Year', columns='Name')
df1.plot(kind='bar')
Output:
Or with the argument stacked=True:
df1.plot(kind='bar', stacked=True)

Categories