calculate average number of voters - python

npop
wht
blk
his
bush
gore
198326
74.4
21.5
4.7
34224
47465
20761
82.4
16.8
1.5
5710
2492
146223
80
12.4
1.2
38737
18950
I'm trying to figure out how to calculate the average vote per county obtained by the candidates (bush, gore). I'm using Pandas to work with the datasets. Here's what I have so far:
def averageCandidateVotes(filename, column):
input = pd.read_csv(filename)
candidate = input[column]
avg_vote =
return avg_vote

Related

Trouble rearranging columns with same name in pandas dataframe in Python

I have a pandas dataframe whose columns I have turned into a list and edited and rearranged. I'm trying to reassign the columns like as follows:
sapie_columns = sapie_df_working.columns.tolist()
sapie_columns = [sapie_columns[-1]] + sapie_columns[3:-1]
sapie_df_working = sapie_df_working[sapie_columns]
but it turns my dataframe (initially with 32 columns) into one with 164 columns. I think this is because a number of the existing columns have the same column name (i.e., "90% CI Lower Bound"). I'm curious about why is is happening and how I can rearrange and edit my dataframe's columns as I want to.
For reference, here is a snippet of my dataframe:
# sapie_df_working
2 State FIPS Code County FIPS Code Postal Code Name Poverty Estimate, All Ages 90% CI Lower Bound 90% CI Upper Bound Poverty Percent, All Ages 90% CI Lower Bound 90% CI Upper Bound ... 90% CI Upper Bound Median Household Income 90% CI Lower Bound 90% CI Upper Bound Poverty Estimate, Age 0-4 90% CI Lower Bound 90% CI Upper Bound Poverty Percent, Age 0-4 90% CI Lower Bound 90% CI Upper Bound
3 00 000 US United States 38371394 38309115 38433673 11.9 11.9 11.9 ... 14.9 67340 67251 67429 3146325 3133736 3158914 16.8 16.7 16.9
4 01 000 AL Alabama 714568 695249 733887 14.9 14.5 15.3 ... 20.7 53958 53013 54903 66169 61541 70797 23.3 21.7 24.9
5 01 001 AL Autauga County 6242 4930 7554 11.2 8.8 13.6 ... 19.3 67565 59132 75998 . . . . . .
6 01 003 AL Baldwin County 20189 15535 24843 8.9 6.8 11 ... 16.1 71135 66540 75730 . . . . . .
7 01 005 AL Barbour County 5548 4210 6886 25.5 19.3 31.7 ... 47.2 38866 33510 44222 . . . . . .
df = df[specific_column_names] is indeed producing this result because of duplicate column names. Filtering with column names in this case is tricky, as it's unclear exactly which column is being referred to.
In case of duplicate column names I would instead use column indices to filter the DataFrame.
Let's look at an example:
>>> import pandas as pd
>>> mock_data = [[11.29, 33.1283, -1.219, -33.11, 930.1, 33.91, 0.1213, 0.134], [9.0, 99.101, 9381.0, -940.11, 55.41, -941.1, -1.3913, 1933.1], [-192.1, 0.123, 0.1243, 0.213, 751.1, 991.1, -1.333, 9481.1]]
>>> mock_columns = ['a', 'b', 'c', 'a', 'd', 'b', 'g', 'a']
>>> df = pd.DataFrame(columns=mock_columns, data=mock_data)
>>> df
a b c a d b g a
0 11.29 33.1283 -1.2190 -33.110 930.10 33.91 0.1213 0.134
1 9.00 99.1010 9381.0000 -940.110 55.41 -941.10 -1.3913 1933.100
2 -192.10 0.1230 0.1243 0.213 751.10 991.10 -1.3330 9481.100
>>> columns = df.columns.tolist()
>>> filtered_column_indices = [len(columns) - 1] + list(range(3, len(columns) - 1))
>>> df.iloc[:, filtered_column_indices]
a a d b g
0 0.134 -33.110 930.10 33.91 0.1213
1 1933.100 -940.110 55.41 -941.10 -1.3913
2 9481.100 0.213 751.10 991.10 -1.3330
In the example, instead of extracting column names with [sapie_columns[-1]] + sapie_columns[3:-1], I extracted the equivalent indices and used that to filter the DataFrame using iloc.

How to sum all rows from multiple columns

I want to do several operations that are repeated for several columns but I can't do it with a list-comprehension or with a loop.
The dataframe I have is concern_polls and I want to rescale the percentages and the total amounts.
text very somewhat \
0 How concerned are you that the coronavirus wil... 19.00 33.00
1 How concerned are you that the coronavirus wil... 26.00 32.00
2 Taking into consideration both your risk of co... 13.00 26.00
3 How concerned are you that the coronavirus wil... 23.00 32.00
4 How concerned are you that you or someone in y... 11.00 24.00
.. ... ... ...
625 How worried are you personally about experienc... 33.09 36.55
626 How do you feel about the possibility that you... 30.00 31.00
627 Are you concerned, or not concerned about your... 34.00 35.00
628 Are you personally afraid of contracting the C... 28.00 32.00
629 Taking into consideration both your risk of co... 22.00 40.00
not_very not_at_all url
0 23.00 11.00 https://morningconsult.com/wp-content/uploads/...
1 25.00 7.00 https://morningconsult.com/wp-content/uploads/...
2 43.00 18.00 https://d25d2506sfb94s.cloudfront.net/cumulus_...
3 24.00 9.00 https://morningconsult.com/wp-content/uploads/...
4 33.00 20.00 https://projects.fivethirtyeight.com/polls/202...
.. ... ... ...
625 14.92 12.78 https://docs.google.com/spreadsheets/d/1cIEEkz...
626 14.00 16.00 https://www.washingtonpost.com/context/jan-10-...
627 19.00 12.00 https://drive.google.com/file/d/1H3uFRD7X0Qttk...
628 16.00 15.00 https://leger360.com/wp-content/uploads/2021/0...
629 21.00 16.00 https://docs.cdn.yougov.com/4k61xul7y7/econTab...
[630 rows x 15 columns]
Variables very, somewhat, not_very and not_at_all they are represented as percentages of the column SAMPLE_SIZE, not shown in the sample share. The percentages don't always add up to 100% so I want to rescale it
To do this, I take the following steps: I calculate the sum of the columns -> variable I sum calculate the amount per %. This step could leave it as a variable and not create a new column in it df. I calculate the final amounts
The code I have so far is this:
sums = concern_polls['very'] + concern_polls['somewhat'] + concern_polls['not_very'] + concern_polls['not_at_all']
concern_polls['Very'] = concern_polls['very'] / sums * 100
concern_polls['Somewhat'] = concern_polls['somewhat'] / sums * 100
concern_polls['Not_very'] = concern_polls['not_very'] / sums * 100
concern_polls['Not_at_all'] = concern_polls['not_at_all'] / sums * 100
concern_polls['Total_Very'] = concern_polls['Very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Somewhat'] = concern_polls['Somewhat'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_very'] = concern_polls['Not_very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_at_all'] = concern_polls['Not_at_all'] / 100 * concern_polls['sample_size']
I have tried to raise the function with "list comprehension" but I can't.
Could someone make me a suggestion?
The problems that I find is that I want to add all the rows of several columns, but they are not all of the df doing repetitive operations on several columns, but they are not all of the df
Thank you.
df[newcolumn] = df.apply(lambda row : function(row), axis=1)
is your friend here I think.
"axis=1" means it does it row by row.
As an example :
concern_polls['Very'] = concern_polls.apply(lambda row: row['very'] / sums * 100, axis=1)
And if you want sums to be the total of each of those df columns it'll be
sums = concern_polls[['very', 'somewhat', 'not_very', 'not_at_all']].sum().sum()

Find shared sub-ranges defined by start and endpoints in pandas dataframe

I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100

Txt file into a list of lists in Python

I have looked at other posts here but I cant seem to find any help with what im specifically trying to do.
So I have data in 'food.txt'. It represents the annual consumption per capita and i have to open and read the txt file into a list of lists named data[]
FOOD | 1980 1985 1990 1995 2000 2005
-----+-------------------------------------------------
BEEF | 72.1 68.1 63.9 63.5 64.5 62.4
PORK | 52.1 49.2 46.4 48.4 47.8 46.5
FOWL | 40.8 48.5 56.2 62.1 67.9 73.6
FISH | 12.4 13.7 14.9 14.8 15.2 16.1
This is what i have so far, to make it into lines
data = []
filename = 'food.txt'
with open('food.txt' , 'r') as inputfile:
for line in inputfile:
data.append(line.strip().split(','))
This separates them in separate lines but I cant use this as inputs for graphs which is the second part that i know how to do. I should be able to call on it like i put below because this will only give the numerical values which is what i need.
years = data[0][1:]
porkconsumption = [2][1:]
Any help would be appreciated, thank you.
I suspect what you have, after a processing, is a list with strings in it like so
['ABC 345 678','DEF 789 657']
Change your code to say line.strip().split(), and you will see your data list will be filled with lists like so:
[['ABC', '345', '678'],['DEF','789','657']]
Then loop through these to turn them into numbers you can plot:
pork = map(int,data[2][1:])

Find DAY with HIGHEST and LOWEST windspeeds using R, Python, or EXCEL

I have an excel file with 31 tabs that corresponds to a day in the month of May. Each tab or sheet has 3 columns (Height, Spd, Dir).
I want to find the day that has the maximum wind speeds. I tried using excel's function MAX=MAX(wind1:wind31!C1:C17) to find it but it only gave a maximum value. Is there a way to determine the day that has the highest wind speeds of the entire month not just one max value, seeing as the height plays a role. Do I have to do some statistical juggling (pardon the lingo)?
I have the R software as well as Python but I am mostly a novice.
These are data from 3 of the 31 sheets.
Day 1 Day 2 Day 3 and so on
Height Dir Spd Height Dir Spd Height Dir Spd
139 333 6.5 110 254 3.6 157 341 6.9
790 343 5.9 767 264 4.3 814 357 6.2
1492 343 5.7 1471 274 6.6 1522 0 5.6
3079 297 9.4 3061 284 14.9 3127 317 10.3
4311 293 19 4291 289 21.9 4375 309 14.9
5731 291 28.6 5706 292 30.4 5809 306 19.1
7406 288 38.7 7381 294 42.8 7498 299 22.4
9462 286 47.6 9440 294 56 9550 290 22.5
10694 285 47.9 10679 293 61 10777 288 22.4
12129 281 46.9 12130 296 60.6 12207 292 23.8
13940 279 33.8 13936 296 40.4 13994 282 25.4
16473 279 13.8 16464 282 13.7 16517 286 11.7
18673 278 3 18665 324 2.9 18716 323 2.6
20786 63 2.3 20775 61 2.9 20824 59 4.1
24036 100 6 24015 104 4.4 24072 96 6.9
26676 85 5.5 26656 73 4 26719 83 7.9
31287 103 6.9 31253 102 7.9 31335 101 10.2
If you get your data into a contiguous format like this:
Day Height Dir Spd
1 139 333 6.5
1 790 343 5.9
1 1492 343 5.7
. . . .
. . . .
. . . .
2 110 254 3.6
2 767 264 4.3
. . . .
. . . .
31 26719 83 7.9
31 31335 101 10.2
You can simply use this formula in Excel OFFSET(A1,MATCH(MAX(Spd),Spd,0),0) where cell A1 is the top left of the grid and contains the word Day. Max(Spd) is the max of the whole Spd column. Offset and Match are Excel functions.
Another solution would be to name the ranges of the Spd data in each sheet, say Spd_1, Spd_2,..so on, for each day. The Excel function MAX(INDIRECT("Spd_1")), MAX(INDIRECT("Spd_2")), etc, could then be used on the named ranges represented as strings in a single sheet. You could then use a single max function to find the corresponding day.
If you can load the same data up in R as a data frame, then you can do something like this
subset(df,Spd==max(df[,"Spd"]))$Day where df is the name of the data frame you read in via read.csv, or read.table, or something similar.
Both of the above can be repeated the min in place of max to find the lowest speed.
If you can't get it into that format, or cannot use Excel's INDIRECT, then the best solution would be to use simple VBA in Excel to loop through the sheets.
In all cases you may have to think about how you will deal with ties - as in 2 or more different days with the same (maximum) speed.
If you can live with R making unique column names for repeated column names, you won't need to muck with getting the day # into the individual column names (that munging is a bit much for this post) and you can then just remove the "Day" header row, leaving month of columns of readings together like you have above and make that into a CSV that R can read with read.csv().
This is the R data frame structure from reading in the data snippet above:
dat <- structure(list(Height = c(139L, 790L, 1492L, 3079L, 4311L, 5731L,
7406L, 9462L, 10694L, 12129L, 13940L, 16473L, 18673L, 20786L,
24036L, 26676L, 31287L), Dir = c(333L, 343L, 343L, 297L, 293L,
291L, 288L, 286L, 285L, 281L, 279L, 279L, 278L, 63L, 100L, 85L,
103L), Spd = c(6.5, 5.9, 5.7, 9.4, 19, 28.6, 38.7, 47.6, 47.9,
46.9, 33.8, 13.8, 3, 2.3, 6, 5.5, 6.9), Height.1 = c(110L, 767L,
1471L, 3061L, 4291L, 5706L, 7381L, 9440L, 10679L, 12130L, 13936L,
16464L, 18665L, 20775L, 24015L, 26656L, 31253L), Dir.1 = c(254L,
264L, 274L, 284L, 289L, 292L, 294L, 294L, 293L, 296L, 296L, 282L,
324L, 61L, 104L, 73L, 102L), Spd.1 = c(3.6, 4.3, 6.6, 14.9, 21.9,
30.4, 42.8, 56, 61, 60.6, 40.4, 13.7, 2.9, 2.9, 4.4, 4, 7.9),
Height.2 = c(157L, 814L, 1522L, 3127L, 4375L, 5809L, 7498L,
9550L, 10777L, 12207L, 13994L, 16517L, 18716L, 20824L, 24072L,
26719L, 31335L), Dir.2 = c(341L, 357L, 0L, 317L, 309L, 306L,
299L, 290L, 288L, 292L, 282L, 286L, 323L, 59L, 96L, 83L,
101L), Spd.2 = c(6.9, 6.2, 5.6, 10.3, 14.9, 19.1, 22.4, 22.5,
22.4, 23.8, 25.4, 11.7, 2.6, 4.1, 6.9, 7.9, 10.2)), .Names = c("Height",
"Dir", "Spd", "Height.1", "Dir.1", "Spd.1", "Height.2", "Dir.2",
"Spd.2"), class = "data.frame", row.names = c(NA, -17L))
and, here that is in a slightly better descriptive format:
str(dat)
## 'data.frame': 17 obs. of 9 variables:
## $ Height : int 139 790 1492 3079 4311 5731 7406 9462 10694 12129 ...
## $ Dir : int 333 343 343 297 293 291 288 286 285 281 ...
## $ Spd : num 6.5 5.9 5.7 9.4 19 28.6 38.7 47.6 47.9 46.9 ...
## $ Height.1: int 110 767 1471 3061 4291 5706 7381 9440 10679 12130 ...
## $ Dir.1 : int 254 264 274 284 289 292 294 294 293 296 ...
## $ Spd.1 : num 3.6 4.3 6.6 14.9 21.9 30.4 42.8 56 61 60.6 ...
## $ Height.2: int 157 814 1522 3127 4375 5809 7498 9550 10777 12207 ...
## $ Dir.2 : int 341 357 0 317 309 306 299 290 288 292 ...
## $ Spd.2 : num 6.9 6.2 5.6 10.3 14.9 19.1 22.4 22.5 22.4 23.8 ...
To get the column name of the max speed value for the whole data frame, we'll need to first just work on the "Spd" columns:
# only work with "Spd" columns
tmp <- dat[,which(grepl("Spd", names(dat)))]
# showing what we have left
str(tmp)
## 'data.frame': 17 obs. of 3 variables:
## $ Spd : num 6.5 5.9 5.7 9.4 19 28.6 38.7 47.6 47.9 46.9 ...
## $ Spd.1: num 3.6 4.3 6.6 14.9 21.9 30.4 42.8 56 61 60.6 ...
## $ Spd.2: num 6.9 6.2 5.6 10.3 14.9 19.1 22.4 22.5 22.4 23.8 ...
Then get the max value for each column:
# get max value in each "Spd" column
apply(tmp, 2, max)
## Spd Spd.1 Spd.2
## 47.9 61.0 25.4
But we really just want the column with the overall max value, so we'll feed that apply into which.max:
# which one of those has the max value (returns name & position)
which.max(apply(tmp, 2, max))
## Spd.1
## 2
And are left with the column name/# with the max value.
All of that can be done on one horribly, unreadable line:
which.max(apply(dat[, which(grepl("Spd", names(dat)))], 2, max))
which I'm only including to show it's not as complex of an operation as the explanation might make it seem like it could be.
Python and the pandas module is one possible solution:
#! /usr/bin/env python
import pandas as pd
# Export the tabs as csv-files: day1.csv, day2.csv, ..., day31.csv.
# Assume the first line is a header line and that columns are
# separated by ',':
#
# Height , Dir , Spd
# 139 , 333 , 6.5
# 790 , 343 , 5.9
# ...
#
# Use or own column names and skip header.
column_names = ['height', 'direction', 'speed']
# Read in the data for each day.
alldays = []
for d in range(1, 32):
fname = "day{}.csv".format(d)
frame = pd.read_csv(fname, names=column_names, header=0)
frame['day'] = d
alldays.append(frame)
# Concatenate all days into DataFrame.
data = pd.concat(alldays, ignore_index=True)
# Get index for max and use it to retrieve the day and the speed.
idx_max = data.speed.idxmax()
max_row = data.ix[idx_max]
print("Maximum wind speed {} on day {}".format(max_row.speed, int(max_row.day)))
# Same as above but for the minimum.
idx_min = data.speed.idxmin()
min_row = data.ix[idx_min]
print("Minimum wind speed {} on day {}".format(min_row.speed, int(min_row.day)))
Save this as script highlow.py. Using ipython and the example data provided I get the following:
>>> run highlow
Maximum wind speed 61.0 on day 2
Minimum wind speed 2.3 on day 1
>>> data.speed.describe()
count 51.000000
mean 18.209804
std 16.784853
min 2.300000
25% 5.800000
50% 10.300000
75% 24.600000
max 61.000000
dtype: float64
>>>

Categories