I am trying to create a stacked area chart, which shows the number of customers by country.
So my data frame is:
date people country
2021-11-18 509 USA
2021-11-18 289 France
2021-11-18 234 Germany
2021-11-18 148 Poland
2021-11-18 101 China
I don't understand how to edit the graphic design (color).
table.groupby(['date','country'])['people'].sum().unstack().plot(
kind='area',
figsize=(10,4))
Also I tried to use the Bokeh library for nice visualization, but i don't know how to write the code
Thanks for your help. It's my first post. Sorry if I missed something.
I think your are looking for varea_stack()-function in bokeh.
My solution is based on the varea_stack-example which is part of the official documentation.
Let's assume this is your data (I added on day):
text = """date people country
2021-11-18 509 USA
2021-11-18 289 France
2021-11-18 234 Germany
2021-11-18 148 Poland
2021-11-18 101 China
2021-11-19 409 USA
2021-11-19 389 France
2021-11-19 134 Germany
2021-11-19 158 Poland
2021-11-19 191 China"""
First I bring the data in the same form of the example:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(text), sep='\s+', parse_dates=True, index_col=0)
df = df.groupby(['date','country']).sum().unstack()
df.columns = df.columns.droplevel(0)
df.index.name=None
df.columns.name=None
Now the DataFrame looks like this:
China France Germany Poland USA
2021-11-18 101 289 234 148 509
2021-11-19 191 389 134 158 409
Now the rest is straight forward. If your index is a DatetimeIndex you have to modify the x_axis_type of the bokeh figure. Id did this for the plot below.
from bokeh.palettes import brewer
from bokeh.plotting import figure, show, output_notebook
output_notebook()
n = df.shape[1]
p = figure(x_axis_type='datetime')
p.varea_stack(stackers=df.columns, x='index', source=df, color=brewer['Spectral'][n],)
show(p)
The output lookslike this:
You can redefine the color using the color-keyword if you like.
you should add colors to your source or you could use color pallettes in bokeh. please check here.
Related
I am trying to create a plotly choropleth map of the uk local authorities, using predicted autism prevalence. The script is getting stuck infinitely loading when I try and assign "colour=predictedprevalence" so that the choropleth shows predicted autism rates from the dataset in each authority area. I am unsure what I need to do.
I have attached an image of where the script hangs shows where the script hangs
import pandas as pd
import json
from urllib.request import urlopen
import numpy as np
#With Plotly
import plotly.express as px
from geojson_rewind import rewind
with urlopen('https://raw.githubusercontent.com/plummy95/1312306/main/localauth.json') as response:
counties = json.load(response)
df=pd.read_csv('https://raw.githubusercontent.com/plummy95/1312306/main/localauth.json')
counties_corrected=rewind(counties,rfc7946=False)
fig = px.choropleth(df, geojson=counties_corrected, locations='nuts318cd', featureidkey="properties.nuts318cd", color='predictedprevalence',
color_continuous_scale="PurPor", labels={'label name':'label name'}, title='MAP TITLE',
scope="europe")
fig.update_geos(fitbounds="locations", visible=False)
This is what my dataset (lauth2020.csv) looks like. There are 311 entries:
dataset
I have been a wee stuck for a few days with this and would really appreciate some assistance.
Thank you
Are you sure you are not specifying the user data incorrectly, it should be in csv format, but you are reading JSON format. I datamined your CSV data image and ran the code. The graph appears to be displayed correctly.
import pandas as pd
import json
from urllib.request import urlopen
import numpy as np
#With Plotly
import plotly.express as px
from geojson_rewind import rewind
with urlopen('https://raw.githubusercontent.com/plummy95/1312306/main/localauth.json') as response:
counties = json.load(response)
counties_corrected=rewind(counties,rfc7946=False)
import io
data = '''
nuts318cd "local authority" year predictedprevalence latitude longitude
UKC11 Adur 2020 357 50.84572 -0.32417
UKC12 Allerdale 2020 551 54.68524 -3.2809
UKC13 "Amber Valley" 2020 746 53.02884 -1.46219
UKC14 Arun 2020 845 50.84321 -0.64999
UKC21 Ashfield 2020 761 53.09747 -1.25422
UKC22 Ashford 2020 742 51.13096 0.823374
UKC23 Babergh 2020 496 52.0645 0.916149
UKD11 "Barking and Dagenham" 2020 1307 51.54555 0.129479
UKD12 Barnet 2020 2476 51.61107 -0.21819
UKD33 Barnsley 2020 1470 53.52577 -1.54925
UKD34 "Barrow-in-Furness" 2020 390 54.15731 -3.1999
UKD35 Basildon 2020 1091 51.59036 0.475055
UKD36 "Basingstoke and Deane" 2020 1055 51.25937 -1.22021
UKD37 Bassetlaw 2020 686 53.35604 -0.9787
UKD41 "Bath and North East Somerset" 2020 1226 51.35604 -2.48654
UKD42 Bedford 2020 1011 52.19628 -0.45463
UKD44 Bexley 2020 1475 51.45822 0.146212
UKD45 Birmingham 2020 7088 52.48404 -1.88141
UKD46 Blaby 2020 595 52.57706 -1.19887
UKD47 "Blackburn with Darwen" 2020 899 53.7008 -2.4636
UKD61 Blackpool 2020 821 53.82164 -3.02199
UKD62 Bolsover 2020 485 53.23875 -1.27228
UKD63 Bolton 2020 1682 53.58449 -2.47952
UKD71 Boston 2020 419 52.97794 -0.11218
UKD72 "Bournemouth, Christchurch and Poole" 2020 2381 50.74609 -1.84807
UKD73 "Bracknell Forest" 2020 755 51.4113 -0.73363
UKD74 Bradford 2020 3145 53.84382 -1.87389
UKE11 Braintree 2020 871 51.91634 0.575911
UKE12 Breckland 2020 790 52.59421 0.818716
UKE13 Brent 2020 2216 51.56438 -0.27568
UKE21 Brentwood 2020 432 51.64108 0.290091
UKE22 "Brighton and Hove" 2020 2070 50.8465 -0.15079
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
fig = px.choropleth(df,
geojson=counties_corrected,
locations='nuts318cd',
featureidkey="properties.nuts318cd",
projection='mercator',
color='predictedprevalence',
color_continuous_scale="PurPor",
labels={'label name':'label name'},
title='MAP TITLE',
scope="europe"
)
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(autosize=False,
width=800,
height=600,
margin={"r":0,"t":20,"l":0,"b":0})
fig.show()
I have a data frame with 20 values, and I am trying to bar.plot it using matplotlib. when I do it, I am not seeing the 20 bars but 10. I have 5 nana values in it and 4 of them.
Here is a sample of dataframe:
Name Bonus
Jack Carpenter 890
John Clegg 653
Mike Holiday 367
Rene Moukad 900
........... ...
my code is standard:
fig,ax = plt.subplots(figsize=(16,6))
plt.bar(df.Name, df.Bonus)
fig.autofmt_xdate(rotation=45)
I got the idea to try and visualize data for election donations from the fec website. Basically, I would like to create a stacked bar chart, with the X-axis being the State, Y-axis being the donated amount, and the 'stacks' being the different candidates, showing how much each candidate received from each state.
Code:
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
pathName = r"R:\Downloads\indiv20\by_date"
dataDir = Path(pathName)
filename = "itcont_2020_20010425_20190425.txt"
fullName = dataDir / filename
data = pd.read_csv(fullName, low_memory=False, sep="|", usecols=[0, 9, 12, 14])
data.columns = ['Filer ID', 'State', 'Occupation', 'Donation Amount ($)']
data = data.dropna(subset=['Donation Amount ($)'])
donations_by_state = data.groupby('State').sum()
plt.bar(donations_by_state.index, donations_by_state['Donation Amount ($)'])
plt.ylabel('Donation Amount ($)')
plt.xlabel('State')
plt.title('Donations per State')
plt.show()
This plots the total contributions per state, and works great. However, when I try this groupby method to group all the data I want, I'm not sure how to plot a stacked bar chart from this data:
donations_per_candidate_per_state = data['Donation Amount ($)'].groupby([data['State'], data['Filer ID']]).sum()
State Filer ID
AA C00005561 350
C00010603 600
C00042366 115
C00309567 1675
C00331694 2500
C00365536 270
C00401224 4495
C00411330 100
C00492991 300
C00540500 300
C00641381 250
C00696948 2800
C00697441 250
C00699090 67
C00703108 1400
AB C00401224 1386
AE C00000935 295
C00003418 276
C00010603 1750
C00027466 320
C00193433 105
C00211037 251
C00216614 226
C00341396 20
C00369033 150
C00394957 50
C00401224 26538
C00438713 50
C00457325 310
C00492785 300
...
ZZ C00580100 1490
C00603084 95
C00607861 750
C00608380 125
C00618371 2199
C00630665 1000
C00632133 600
C00632398 400
C00639500 208
C00639591 1450
C00640623 6402
C00653816 1000
C00666149 1000
C00666453 2800
C00683102 1000
C00689430 3524
C00693234 13283
C00693713 1000
C00694018 2750
C00694455 12761
C00695510 1045
C00696245 250
C00696419 3000
C00696526 500
C00696948 31296
C00697441 34396
C00698050 350
C00698258 2800
C00699090 5757
C00700732 475
Name: Donation Amount ($), Length: 32662, dtype: int64
It seems to have the data tabulated in the way I need, just not sure how to plot it.
You can use the following as described here:
df = donations_per_candidate_per_state.unstack('Filer ID')
df.plot(kind='bar', stacked=True)
I am new to python and I'm trying to plot an overlaid histogram for a manipulated data set from Kaggle. I tried doing it with matplotlib. This is a dataset that shows the history of gun violence in USA in recent years. I have selected only few columns for EDA.
import pandas as pd
data_set = pd.read_csv("C:/Users/Lenovo/Documents/R related
Topics/Assignment/Assignment_day2/04 Assignment/GunViolence.csv")
state_wise_crime = data_set[['date', 'state', 'n_killed', 'n_injured']]
date_value = pd.to_datetime(state_wise_crime['date'])
import datetime
state_wise_crime['Month']= date_value.dt.month
state_wise_crime.drop('date', axis = 1)
no_of_killed = state_wise_crime.groupby(['state','Year'])
['n_killed','n_injured'].sum()
no_of_killed = state_wise_crime.groupby(['state','Year']
['n_killed','n_injured'].sum()
I want an overlaid histogram that shows the no. of people killed and no.of people injured with the different states on the x-axis
Welcome to Stack Overflow! From next time, please post your data like in below format (not a link or an image) to make us easier to work on the problem. Also, if you ask about a graph output, showing the contents of desired graph (even with hand drawing) would be very helpful :)
df
state Year n_killed n_injured
0 Alabama 2013 9 3
1 Alabama 2014 591 325
2 Alabama 2015 562 385
3 Alabama 2016 761 488
4 Alabama 2017 856 544
5 Alabama 2018 219 135
6 Alaska 2014 49 29
7 Alaska 2015 84 70
8 Alaska 2016 103 88
9 Alaska 2017 70 69
As I commented in your original post, a bar plot would be more appropriate than histogram in this case since your purpose appears to be visualizing the summary statistics (sum) of each year with state-wise comparison. As far as I know, the easiest option is to use Seaborn. It depends on how you want to show the data, but below is one example. The code is as simple as below.
import seaborn as sns
sns.barplot(x='Year', y='n_killed', hue='state', data=df)
Output:
Hope this helps.
I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:
Rank Corporation Sector Headquarters Revenue (thousand PLN) Profit (thousand PLN) Employees
1.ÿ PKN Orlen SA oil and gas P?ock 79 037 121 2 396 447 4,445
2.ÿ Lotos Group SA oil and gas Gda?sk 29 258 539 584 878 5,168
3.ÿ PGE SA energy Warsaw 28 111 354 6 165 394 44,317
4.ÿ Jer¢nimo Martins retail Kostrzyn 25 285 407 N/A 36,419
5.ÿ PGNiG SA oil and gas Warsaw 23 003 534 1 711 787 33,071
6.ÿ Tauron Group SA energy Katowice 20 755 222 1 565 936 26,710
7.ÿ KGHM Polska Mied? SA mining Lubin 20 097 392 13 653 597 18,578
8.ÿ Metro Group Poland retail Warsaw 17 200 000 N/A 22,556
9.ÿ Fiat Auto Poland SA automotive Bielsko-Bia?a 16 513 651 83 919 5,303
10.ÿ Orange Polska telecommunications Warsaw 14 922 000 1 785 000 23,805
I have two serious problems with it that I cannot seem to find solution for:
1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.
2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.
Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)
I would use the converters parameter.
pass this to your pd.read_csv call
def space_float(x):
return float(x.replace(' ', ''))
converters = {
'Revenue (thousand PLN)': space_float,
'Profit (thousand PLN)': space_float,
'Rank': str.strip
}
pd.read_csv(... converters=converters ...)