I am trying to generate annual data for a certain product when I have data for the base year and growth rate.
In the toy example, each product has different annual growth rate in efficiency by its 'color', and I want to generate yearly data until 2030.
Therefore, I have base year data (base_year) as follows:
year color shape efficiency
0 2018 red circle 50
1 2018 red square 30
2 2018 blue circle 100
3 2018 blue square 60
And each type of product's growth rate (growthrate) as:
color rate
0 red 30
1 blue 20
Results I desire is:
year color shape efficiency
0 2018 red circle 50
1 2018 red square 30
2 2018 blue circle 100
3 2018 blue square 60
4 2019 red circle 65
5 2019 red square 39
6 2019 blue circle 120
7 2019 blue square 72
8 2020 red circle 84.5
... (until 2030)
The data used in the toy code is..
base_year = pd.DataFrame(data = {'year': [2018,2018,2018,2018],
'color': ['red', 'red', 'blue', 'blue'],
'shape' : ['circle', 'square', 'circle', 'square'],
'efficiency' : [50, 30, 100, 60]}, columns = ['year', 'color', 'shape', 'efficiency'])
growthrate = pd.DataFrame(data = {'color': ['red', 'blue'],
'rate' : [30, 20]}, columns = ['color', 'rate'])
I've been trying some approach using .loc, but it seems such approach is quite inefficient.
Any suggestions or hints would be appreciated. Thank you in advance!
Here is one way to do this:
years = 2031 - 2018
df = (pd.concat([df.assign(year=df['year']+i,
efficiency=df['efficiency']*((df['rate']/100+1)**i))
for i, df in enumerate([base_year.merge(growthrate, on='color')] * years)])
.drop('rate', axis=1))
Related
i’m trying to plot a graph of world population from 1961 to 2013 in an animated choropleth graph.
I decided to bin the ‘count’ series that has the number of population for each country and year into a cathegorical Series using pandas.cut().
I created 7 ranges:
bins= [0, 10_000_000, 50_000_000, 100_000_000, 200_000_000, 500_000_000, 1_000_000_000, 1_500_000_000]
labels= ['0 to 10 Millions', '10 to 50 Millions', '50 to 100 Millions', '100 to 200 Millions', '200 to 500 Millinons', '500 Millions to 1 Billion', '> 1 Billion']
This is how the dataset looks like
Dataset
When i do plot, only 6 of them are displayed and they are not even ordered (despite if i check the result of the range column.unique(), it tells me that they are actually 7 and in the right order).
Graph at year: 1961
If i move the slide bar, reaching the year 1982, a country (China) looks like it has no value
Graph at year 1987
But if i check the dataframe, it does have
Dataset filter to check if value of China are missing from 1982 to 1996
After reaching year 1996, the legend changes, showing the range it wasnt shown before, but another one disappear, the country that was shown with no value suddenly it is represented correctly (despite its value were the same as from year 1982 (the first year in which the value of range change)
Graph at year 2013
This is the code
bins= [0, 10_000_000, 50_000_000, 100_000_000, 200_000_000, 500_000_000, 1_000_000_000, 1_500_000_000] # Assigning bins
labels= ['0 to 10 Millions', '10 to 50 Millions', '50 to 100 Millions', '100 to 200 Millions', '200 to 500 Millinons', '500 Millions to 1 Billion', '> 1 Billion'] # Assigning labels
cc = coco.CountryConverter()
lst = pop_df['country_name'].unique() #list of unique countries of the dataframe
pop_iso3 = cc.convert(names=lst, to='ISO3', not_found=np.NaN) #converting the countries in iso3
pop_df['iso_3'] = pop_df['country_name'].map({n:m for n, m in zip(lst, pop_iso3)}) #creating the iso3 column
#setting dataframe years from 1961 to 2013 and binning population count into cathegorical
pop_df['year'] = pop_df['year'].astype(int)
pop_df['pop_range'] = pd.cut(pop_df['count'], bins=bins, labels=labels)
pop_df['pop_range'] = pop_df['pop_range'].astype(str)
cond = (pop_df['year'] >= 1961) & (pop_df['year'] <= 2013)
choro_df = pop_df.loc[cond]
# Graph representation
fig = px.choropleth(
choro_df, locations="iso_3",
color="pop_range",
hover_name="country_name",
scope='world',
animation_frame='year',
color_discrete_sequence=px.colors.sequential.Plasma_r
)
# Additional traces settings
fig.update_traces(
marker=dict(
line=dict(
color='#cfcfbe',
width=1
)
)
)
# #Add chart title, format the chart, etc.
fig.update_layout(
title_text='Countries population by year (1961-2013)',
geo=dict(
showframe=False,
showcoastlines=False,
showlakes=False,
projection_type='equirectangular',
coastlinecolor='#cfcfbe',
coastlinewidth=0.5,
visible=True,
resolution=110
),
dragmode=False,
height=900,
annotations = [{
'x':0.05,
'y':0.15,
'xref':'paper',
'yref':'paper',
'text':'Source: Wolrdbank.org',
'showarrow':False
}]
)
fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 100
fig.layout.updatemenus[0].buttons[0].args[1]['transition']['duration'] = 5
fig.show()
This question already has answers here:
How to add value labels on a bar chart
(7 answers)
Closed 10 months ago.
I have the following data frame.
_id message_date_time country
0 {'$oid': '61f7dfd24b11720cdbda5c86'} {'$date': '2021-12-24T12:30:09Z'} RUS
1 {'$oid': '61f7eb7b4b11720cdbda9322'} {'$date': '2021-12-20T21:58:20Z'} RUS
2 {'$oid': '61f7fdad4b11720cdbdb0beb'} {'$date': '2021-12-15T15:29:13Z'} RUS
3 {'$oid': '61f8234f4b11720cdbdbec52'} {'$date': '2021-12-10T00:03:43Z'} USA
4 {'$oid': '61f82c274b11720cdbdc21c7'} {'$date': '2021-12-09T15:10:35Z'} USA
With these values
df["country"].value_counts()
RUS 156
USA 139
FRA 19
GBR 11
AUT 9
AUS 8
DEU 7
CAN 4
BLR 3
ROU 3
GRC 3
NOR 3
NLD 3
SWE 2
ESP 2
CHE 2
POL 1
HUN 1
DNK 1
ITA 1
ISL 1
BIH 1
Name: country, dtype: int64
I'm trying to plot using the country and frequency of it using the following:
plt.figure(figsize=(15, 8))
plt.xlabel("Frequency")
plt.ylabel("Country")
plt.hist(df["country"])
plt.show()
What I need is to show the country frequency above every bar and keep a very small space between the bars.
Arguably the easiest way it to use plt.bar(). For example:
counts = df["country"].value_counts()
names, values = counts.index.tolist(), counts.values.tolist()
plt.bar(names, values)
height_above_bar = 0.05 # distance of count from bar
fontsize = 12 # the fontsize that you want the count to have
for i, val in enumerate(values):
plt.text(i, val + height_above_bar, str(val), fontsize=12)
plt.show()
For this I have used countplot from seaborn as it's better for checking the counts of each object in a series.
plt.figure(figsize = (20,5))
bars = plt.bar(df["country"], df["counts"])
for bar in bars.patches:
plt.annotate(s = bar.get_height(), xy = (bar.get_x() + bar.get_width() / 2, bar.get_height()), va = "bottom", ha = "center")
plt.show()
The output should be something like this,
If you want something else to be on the graph instead of the height, just change the s parameter in the annotate function to a value of your choice.
Can anyone explain how can I draw a cluster column chart exactly like this in Matplotlib? I found some similar graphs but I want exactly the graph as shown. I have fruit names such as apples and pears etc as keys and their sale in years as values of these keys.
The following code first creates some toy data and then uses matplotlib to draw a bar plot.
import matplotlib.pyplot as plt
from matplotlib.transforms import blended_transform_factory
from matplotlib.ticker import MultipleLocator
import numpy as np
import pandas as pd
import seaborn as sns
fruits = ['apples', 'pears', 'nectarines', 'plums', 'grapes', 'strawberries']
years = [2015, 2016, 2017]
num_fruit = len(fruits)
num_years = len(years)
df = pd.DataFrame({'fruit': np.tile(fruits, num_years),
'year': np.repeat(years, num_fruit),
'value': np.random.randint(1, 8, num_fruit * num_years)})
width = 0.8
for i, fruit in enumerate(fruits):
for j, year in enumerate(years):
plt.bar(i + width / num_years * (j - (num_years - 1) / 2),
df[(df['fruit'] == fruit) & (df['year'] == year)]['value'],
width / num_years, color='skyblue', ec='white')
plt.xticks([i + width / num_years * (j - (num_years - 1) / 2) for i in range(num_fruit) for j in range(num_years)],
np.tile(years, num_fruit), rotation=45)
ax = plt.gca()
ax.yaxis.set_major_locator(MultipleLocator(1))
ax.yaxis.set_minor_locator(MultipleLocator(0.2))
ax.grid(True, axis='y')
ax.autoscale(False, axis='y')
trans = blended_transform_factory(ax.transData, ax.transAxes)
for i, fruit in enumerate(fruits):
ax.text(i, -0.2, fruit, transform=trans, ha='center')
if i != 0:
ax.vlines(i - 0.5, 0, -0.3, color='lightgrey', clip_on=False, transform=trans)
plt.tight_layout()
print(df)
plt.show()
For this example the data looked like:
fruit year value
0 apples 2015 1
1 pears 2015 3
2 nectarines 2015 6
3 plums 2015 3
4 grapes 2015 3
5 strawberries 2015 1
6 apples 2016 4
7 pears 2016 6
8 nectarines 2016 1
9 plums 2016 6
10 grapes 2016 4
11 strawberries 2016 5
12 apples 2017 3
13 pears 2017 6
14 nectarines 2017 7
15 plums 2017 3
16 grapes 2017 5
17 strawberries 2017 1
Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5
I have a dataframe which i want to make a scatter plot of.
the dataframe looks like:
year length Animation
0 1971 121 1
1 1939 71 1
2 1941 7 0
3 1996 70 1
4 1975 71 0
I want the points in my scatter plot to be a different color depending the value in the Animation row.
So animation = 1 = yellow
animation = 0 = black
or something similiar
I tried doing the following:
dfScat = df[['year','length', 'Animation']]
dfScat = dfScat.loc[dfScat.length < 200]
axScat = dfScat.plot(kind='scatter', x=0, y=1, alpha=1/15, c=2)
This results in a slider which makes it hard to tell the difference.
You can also assign discrete colors to the points by passing an array to c=
Like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d = {"year" : (1971, 1939, 1941, 1996, 1975),
"length" : ( 121, 71, 7, 70, 71),
"Animation" : ( 1, 1, 0, 1, 0)}
df = pd.DataFrame(d)
print(df)
colors = np.where(df["Animation"]==1,'y','k')
df.plot.scatter(x="year",y="length",c=colors)
plt.show()
This gives:
Animation length year
0 1 121 1971
1 1 71 1939
2 0 7 1941
3 1 70 1996
4 0 71 1975
Use the c parameter in scatter
df.plot.scatter('year', 'length', c='Animation', colormap='jet')