Stacked bar chart of election donation information with groupby - python

I got the idea to try and visualize data for election donations from the fec website. Basically, I would like to create a stacked bar chart, with the X-axis being the State, Y-axis being the donated amount, and the 'stacks' being the different candidates, showing how much each candidate received from each state.
Code:
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
pathName = r"R:\Downloads\indiv20\by_date"
dataDir = Path(pathName)
filename = "itcont_2020_20010425_20190425.txt"
fullName = dataDir / filename
data = pd.read_csv(fullName, low_memory=False, sep="|", usecols=[0, 9, 12, 14])
data.columns = ['Filer ID', 'State', 'Occupation', 'Donation Amount ($)']
data = data.dropna(subset=['Donation Amount ($)'])
donations_by_state = data.groupby('State').sum()
plt.bar(donations_by_state.index, donations_by_state['Donation Amount ($)'])
plt.ylabel('Donation Amount ($)')
plt.xlabel('State')
plt.title('Donations per State')
plt.show()
This plots the total contributions per state, and works great. However, when I try this groupby method to group all the data I want, I'm not sure how to plot a stacked bar chart from this data:
donations_per_candidate_per_state = data['Donation Amount ($)'].groupby([data['State'], data['Filer ID']]).sum()
State Filer ID
AA C00005561 350
C00010603 600
C00042366 115
C00309567 1675
C00331694 2500
C00365536 270
C00401224 4495
C00411330 100
C00492991 300
C00540500 300
C00641381 250
C00696948 2800
C00697441 250
C00699090 67
C00703108 1400
AB C00401224 1386
AE C00000935 295
C00003418 276
C00010603 1750
C00027466 320
C00193433 105
C00211037 251
C00216614 226
C00341396 20
C00369033 150
C00394957 50
C00401224 26538
C00438713 50
C00457325 310
C00492785 300
...
ZZ C00580100 1490
C00603084 95
C00607861 750
C00608380 125
C00618371 2199
C00630665 1000
C00632133 600
C00632398 400
C00639500 208
C00639591 1450
C00640623 6402
C00653816 1000
C00666149 1000
C00666453 2800
C00683102 1000
C00689430 3524
C00693234 13283
C00693713 1000
C00694018 2750
C00694455 12761
C00695510 1045
C00696245 250
C00696419 3000
C00696526 500
C00696948 31296
C00697441 34396
C00698050 350
C00698258 2800
C00699090 5757
C00700732 475
Name: Donation Amount ($), Length: 32662, dtype: int64
It seems to have the data tabulated in the way I need, just not sure how to plot it.

You can use the following as described here:
df = donations_per_candidate_per_state.unstack('Filer ID')
df.plot(kind='bar', stacked=True)

Related

is there a website or a python function that create a DataFrame examples code?

is there a website or a function that create a DataFrame examples code so that it can be used in tutorials?
something like this
df = pd.DataFrame({'age': [ 3, 29],
'height': [94, 170],
'weight': [31, 115]})
or
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
or
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
You can get over 750 datasets from pydataset
pip install pydataset
To see a list of the datasets:
from pydataset import data
# To see a list of the datasets
print(data())
Output:
dataset_id title
0 AirPassengers Monthly Airline Passenger Numbers 1949-1960
1 BJsales Sales Data with Leading Indicator
2 BOD Biochemical Oxygen Demand
3 Formaldehyde Determination of Formaldehyde
4 HairEyeColor Hair and Eye Color of Statistics Students
.. ... ...
752 VerbAgg Verbal Aggression item responses
753 cake Breakage Angle of Chocolate Cakes
754 cbpp Contagious bovine pleuropneumonia
755 grouseticks Data on red grouse ticks from Elston et al. 2001
756 sleepstudy Reaction times in a sleep deprivation study
[757 rows x 2 columns]
Usage
And to use one of the example datasets in a dataframe, it is as simple as using the dataset_id:
from pydataset import data
df = data('cake')
print(df)
Output:
replicate recipe temperature angle temp
1 1 A 175 42 175
2 1 A 185 46 185
3 1 A 195 47 195
4 1 A 205 39 205
5 1 A 215 53 215
.. ... ... ... ... ...
266 15 C 185 28 185
267 15 C 195 25 195
268 15 C 205 25 205
269 15 C 215 31 215
270 15 C 225 25 225
[270 rows x 5 columns]
Note:
There are other packages with their own functionality. Or you can create your own.
You can get over 17000 datasets from the datasets package:
pip install datasets
To list all of the datasets:
from datasets import list_datasets
# Print all the available datasets
print(list_datasets())

Different bar plot for same x-axis value

I'm having a data-frame as follows:
match_id team team_score
411 RCB 263
7937 KKR 250
620 RCB 248
206 CSK 246
11338 KKR 241
61 CSK 240
562 RCB 235
Now, I want to plot a bar plot for all these values as an individual bars, what I'm getting in output is something different:
Is there any way I can make different bars for same x-axis values??
When 'team' is used as x, all the values for each team are averaged and a small error bar shows a confidence interval. To have each entry of the table as a separate bar, the index of the dataframe can be used for x. After creating the bars, they can be labeled with the team names.
Optionally, hue='team'colors the bars per team. Then dodge=False is needed to have the bars positioned nicely. In that case, Seaborn also creates a legend, which is not so useful, as the same information now also is present as the x-values. The legend can be suppressed via ax.legend_.remove().
from matplotlib import pyplot as plt
import pandas as pd
from io import StringIO
import seaborn as sns
data_str = StringIO("""match_id team team_score
411 RCB 263
7937 KKR 250
620 RCB 248
206 CSK 246
11338 KKR 241
61 CSK 240
562 RCB 235""")
df = pd.read_csv(data_str, delim_whitespace=True)
color_dict = {'RCB': 'dodgerblue', 'KKR': 'darkviolet', 'CSK': 'gold'}
ax = sns.barplot(x=df.index, y='team_score', hue='team', palette=color_dict, dodge=False, data=df)
ax.set_xticklabels(df['team'])
ax.legend_.remove()
plt.tight_layout()
plt.show()

How to plot Numerical Values in matplotlib

So I have this kind of database:
Time Type Profit
2 82 s/l -51.3
5 9 t/p 164.32
8 38 s/l -53.19
11 82 s/l -54.4
14 107 s/l -54.53
.. ... ... ...
730 111 s/l -70.72
731 111 s/l -70.72
732 111 s/l -70.72
733 113 s/l -65.13
734 113 s/l -65.13
[239 rows x 3 columns]
I want to plot a chart which shows X as the time (that's already on week hours), and Y as profit(Which can be positive or negative). For Y, I would like for each hour (X) to have 2 bars to show the profit. The negative profit would be positive too in this case but in another bar.
For example we have -65 and 70. They would show as 65 and 70 on the chart but the loss would have a different bar color.
This is my code so far:
#reading the csv file
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Time','Type','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
#Takes in winning trades (t/p) and losing trades(s/l)
df = df[(df['Type'] == 't/p') | (df['Type'] == 's/l')]
#Plots the chart
ax = df.plot(title='Profits and Losses (Hour Of Week)',kind='bar')
#ax.legend(['Losses', 'Winners'])
plt.xlabel('Hour of Week')
plt.ylabel('Amount Of Profit/Loss')
plt.show()
You can groupby, unstack and plot:
(df.groupby(['Time','Type']).Profit.sum().abs()
.unstack('Type')
.plot.bar()
)
For your sample data above, the output is:

Subplot Multiple Columns in Pandas Python

I am new to Python and struggling to solve this one efficiently. I read a number of examples but they were complex and lack of understanding. For the below dataframe, I like to subplot per columns while ignoring the first two i.e Site_ID and Cell_ID:
Availability
VoLTE CSSR
VoLTE Attempts
Each subplot (Availability etc..), will include the "Grouped" Site_ID as legends. Each subplot is saved to a desired location.
Sample Data:
Date Site_ID Cell_ID Availability VoLTE CSSR VoLTE Attempts
22/03/2019 23181 23181B11 100 99.546435 264
03/03/2019 91219 91219A11 100 99.973934 663
17/04/2019 61212 61212A80 100 99.898843 1289
29/04/2019 91219 91219B26 99.907407 100 147
24/03/2019 61212 61212A11 100 99.831425 812
25/04/2019 61212 61212B11 100 99.91107 2677
29/03/2019 91219 91219A26 100 99.980066 1087
05/04/2019 91705 91705C11 100 99.331263 1090
04/04/2019 91219 91219A26 100 99.984588 914
19/03/2019 61212 61212B11 94.21875 99.934376 2318
23/03/2019 23182 23182B11 100 99.47367 195
02/04/2019 91219 91219A26 100 99.980123 958
26/03/2019 23181 23181A11 100 99.48185 543
19/03/2019 61212 61212A11 94.21875 99.777605 1596
18/04/2019 23182 23182B11 100 99.978012 264
26/03/2019 23181 23181C11 100 99.829911 1347
01/03/2019 91219 91219A11 100 99.770661 1499
12/03/2019 91219 91219B11 100 99.832273 1397
19/04/2019 61212 61212B80 100 99.987946 430
12/03/2019 91705 91705C11 100 98.789819 1000
Here is my inefficient solution and given there are over 100 columns, I am quite worried.
#seperates dataframes
Avail = new_df.loc[:,["Site_ID","Cell_ID","Availability"]]
V_CSSR = new_df.loc[:,["Site_ID","Cell_ID","VoLTE CSSR"]]
V_Atte = new_df.loc[:,["Site_ID","Cell_ID","VoLTE Attempts"]]
#plot each dataframe
Avail.groupby("Site_ID")["Availability"].plot(y="Availability", legend = True)
V_CSSR.groupby("Site_ID")["VoLTE CSSR"].plot(y="VoLTE CSSR", legend = True)
V_Atte.groupby("Site_ID")["VoLTE Attempts"].plot(y="VoLTE Attempts", legend = True)
This is the outcome I am after.
Not the best solution, but you can try:
fig, axes = plt.subplots(1,3, figsize=(10,4))
for col, ax in zip(cols, axes):
for site in df.Site_ID.unique():
tmp_df = df[df.Site_ID.eq(site)]
ax.plot(tmp_df.Date, tmp_df[col], label=site)
ax.set_title(col)
ax.legend()
plt.show()
Output:

How to transform data frames into one big dataframe?

I have written a program (code below) that gives me for each file in a folder a data frame. In the data frame are the Quarters in the Year from the file and the counts (how often the quarters occurs in the file). An output for one file in the loop look for example like:
2008Q4 230
2009Q1 186
2009Q2 166
2009Q3 173
2009Q4 246
2010Q1 341
2010Q2 336
2010Q3 200
2010Q4 748
2011Q1 625
2011Q2 690
2011Q3 970
2011Q4 334
2012Q1 573
2012Q2 53
How can I create a big data frame where the counts for the quarters are summed up for all files in the folder?
path = "crisisuser"
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format("csv"))]
os.chdir("..")
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
df=df['quarter'].value_counts().sort_index()
I think you need append all Series to list, then use concat and sum per index values:
out = []
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
out.append(df['quarter'].value_counts().sort_index())
s = pd.concat(out).sum(level=0)

Categories