Python, Web Scraping a bar graph - python

I am currently trying to webscrape the bar graph/chart from this page, but am unsure what specific BeautifulSoup features are needed to extract these types of bar charts. Additionally, if anyone has a link to what BeautifulSoup features are used for scraping what types of charts/graphs, that would be greatly appreciated. https://www.statista.com/statistics/215655/number-of-registered-weapons-in-the-us-by-state/
Here is the code I have so far
import pandas as pd
import requests
from bs4 import BeautifulSoup
dp = 'https://www.statista.com/statistics/215655/number-of-registered-weapons-in-the-us-by-state/'
page = requests.get(dp).text
soup = BeautifulSoup(page, 'html.parser')
#This is what I am trying to figure out
new = soup.find("div", id="bar")
print(new)

This script will get all data from the bar graph:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.statista.com/statistics/215655/number-of-registered-weapons-in-the-us-by-state/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tds = soup.select('#statTableHTML td')
data = []
for td1, td2 in zip(tds[::2], tds[1::2]):
data.append({'State':td1.text, 'Number': td2.text})
df = pd.DataFrame(data)
print(df)
Prints:
State Number
0 Texas 725,368
1 Florida 432,581
2 California 376,666
3 Virginia 356,963
4 Pennsylvania 271,427
5 Georgia 225,993
6 Arizona 204,817
7 North Carolina 181,209
8 Ohio 175,819
9 Alabama 168,265
10 Illinois 147,698
11 Wyoming 134,050
12 Indiana 133,594
13 Maryland 128,289
14 Tennessee 121,140
15 Washington 119,829
16 Louisiana 116,398
17 Colorado 112,691
18 Arkansas 108,801
19 New Mexico 105,836
20 South Carolina 99,283
21 Minnesota 98,585
22 Nevada 96,822
23 Kentucky 93,719
24 Utah 93,440
25 New Jersey 90,217
26 Missouri 88,270
27 Michigan 83,355
28 Oklahoma 83,112
29 New York 82,917
30 Wisconsin 79,639
31 Connecticut 74,877
32 Oregon 74,722
33 District of Columbia 59,832
34 New Hampshire 59,341
35 Idaho 58,797
36 Kansas 54,409
37 Mississippi 52,346
38 West Virginia 41,651
39 Massachusetts 39,886
40 Iowa 36,540
41 South Dakota 31,134
42 Nebraska 29,753
43 Montana 23,476
44 Alaska 20,520
45 North Dakota 19,720
46 Maine 17,410
47 Hawaii 8,665
48 Vermont 7,716
49 Delaware 5,281
50 Rhode Island 4,655
51 *Other US Territories 866

Maybe you can find more about Web Scraping from this web site https://www.datacamp.com/community/tutorials/web-scraping-using-python

Related

Is there a way to iterate through a column in pandas if it is an index

I have a pandas DataFrame which looks like this
Region Sub Region Country Size Plants Birds Mammals
Africa Northern Africa Algeria 2380000 22 41 15
Egypt 1000000 8 58 14
Libya 1760000 7 32 8
Sub-Saharan Africa Angola 1250000 34 53 32
Benin 115000 20 40 12
Western Africa Cape Verde 4030 51 35 7
Americas Latin America Antigua 440 4 31 3
Argentina 2780000 70 42 52
Bolivia 1100000 106 8 55
Northern America Canada 9980000 18 44 24
Grenada 340 3 29 2
USA 9830000 510 251 91
Asia Central Asia Kazakhstan 2720000 14 14 27
Kyrgyz 200000 13 3 15
Uzbekistan 447000 16 7 19
Eastern Asia China 9560000 593 136 96
Japan 378000 50 77 49
South Korea 100000 31 28 33
So I am trying to prompt the user to input a value and if the input exists within the Sub Region column, perform a particular task.
I tried turning the 'Sub region' column to a list and iterate through it if it matches the user input
sub_region_list=[]
for i in world_data.index.values:
sub_region_list.append(i[1])
print(sub_region_list[0])
That is not the output I had in mind.
I believe there is an easier way to do this but can not seem to figure it out
You can use get_level_values to filter.
sub_region = input("Enter a sub region:")
if sub_region not in df.index.get_level_values('Sub Region'):
raise ValueError("You must enter a valid sub-region")
If you want to save the column values in a list, try:
df.index.get_level_values("Sub Region").unique().to_list()

Web scrape Sports-Reference with Python Beautiful Soup

I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.
I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"
When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.
Ultimately, I want to get the All-American data and turn it into a data frame.
import requests
import bs4
import pandas as pd
saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))
All Americans
sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:
import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)
Output:
all_americans
name year
0 Jonathan Allen 2016
1 Javier Arenas 2009
2 Mark Barron 2011
3 Antoine Caldwell 2008
4 Ha Ha Clinton-Dix 2013
5 Terrence Cody 2008-2009
6 Landon Collins 2014
7 Amari Cooper 2014
8 Landon Dickerson 2020
9 Minkah Fitzpatrick 2016-2017
10 Reuben Foster 2016
11 Najee Harris 2020
12 Derrick Henry 2015
13 Dont'a Hightower 2011
14 Mark Ingram 2009
15 Jerry Jeudy 2018
16 Mike Johnson 2009
17 Barrett Jones 2011-2012
18 Mac Jones 2020
19 Ryan Kelly 2015
20 Cyrus Kouandjio 2013
21 Chad Lavalais 2003
22 Alex Leatherwood 2020
23 Rolando McClain 2009
24 Demarcus Milliner 2012
25 C.J. Mosley 2012-2013
26 Reggie Ragland 2015
27 Josh Reed 2001
28 Trent Richardson 2011
29 A'Shawn Robinson 2015
30 Cam Robinson 2016
31 Andre Smith 2008
32 DeVonta Smith 2020
33 Marcus Spears 2004
34 Patrick Surtain II 2020
35 Tua Tagovailoa 2018
36 Deionte Thompson 2018
37 Chance Warmack 2012
38 Ben Wilkerson 2004
39 Jonah Williams 2018
40 Quinnen Williams 2018
bowl_win_loss:
' .63 (#23)'

Leave rows in pandas dataframe based on a column in anotherr data frame

I have this df
nhl_df=pd.read_csv("assets/nhl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]
cities = cities.rename(columns={'Population (2016 est.)[8]': 'Population'})
cities = cities[['Metropolitan area','Population']]
print(cities)
Metropolitan area Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
It has 50 rows.
My second df has 28 rows
W/L Ratio
city
Arizona 0.707317
Boston 2.500000
Buffalo 0.555556
Calgary 1.057143
Carolina 1.028571
Chicago 0.846154
Colorado 1.433333
Columbus 1.500000
Dallas–Fort Worth 1.312500
Detroit 0.769231
Edmonton 0.900000
Florida 1.466667
Los Angeles 1.655862
Minnesota 1.730769
Montreal 0.725000
Nashville 2.944444
New York City 1.111661
Ottawa 0.651163
Philadelphia 1.615385
Pittsburgh 1.620690
San Jose 1.666667
St. Louis 1.375000
Tampa Bay 2.347826
Toronto 1.884615
Vancouver 0.775000
Vegas 2.125000
Washington 1.884615
Winnipeg 2.600000
I need to remove from the first dataframe the rows where the metropolitan area is not in the city column of the 2nd data frame.
I tried this:
cond = nhl_df['city'].isin(cities['Metropolitan Area'])
But I got this error which makes no sense
KeyError: 'city'
You need select column Metropolitan Area and in second cities DataFrame index, last filter with ~ for invert mask:
cond = cities['Metropolitan Area'].isin(nhl_df.index)
df = cities[~cond]
If first clumn is not index in city DataFrame:
cond = cities['Metropolitan Area'].isin(nhl_df['city'])
df = cities[~cond]

Webscraping data from collapsible table returning empty frame

Trying to scrape COVID cases from here:
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
If you click on the "+" next to "States" below the map, you'll see the count of cases for each state. I want a dataframe that looks like this from each state
Alabama 1841
Alaska 185
American Samoa 0
With my attempt, containers is empty
my_url = 'https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div", {"class" : "rt-td"})
I understand I'll need to loop through to get the info for each state but I need help getting the basic code to work. This is my first attempt at webscraping; I'm pretty sure I'm using the wrong tags or findAll arguments. I've tried a couple of different combination and none work.
I found a lady who did something similar to what I want here:
https://towardsdatascience.com/scrape-cdc-for-covid-19-cases-a162924073ad
But she's a developer I think and her skills are above mine. There has to be a much simpler way to do this. Right?
Thanks in advance.
Yes, there is far better way here. the data is returned as a json response. Simply pull the json, then use pandas to normalize it.
import requests
from pandas.io.json import json_normalize
url="https://www.cdc.gov/coronavirus/2019-ncov/map-cases-us.json"
jsonData = requests.get(url).json()
df = json_normalize(jsonData['data'])
df = df[['Jurisdiction', 'Cases Reported']].dropna()
Output:
print(df)
Jurisdiction Cases Reported
0 Alabama 1841.0
1 Alaska 185.0
2 American Samoa 0.0
3 Arizona 2269.0
4 Arkansas 853.0
5 California 13438.0
6 Colorado 4950.0
7 Connecticut 5675.0
8 Delaware 673.0
9 District of Columbia 998.0
10 Florida 11961.0
11 Georgia 6752.0
12 Guam 110.0
13 Hawaii 324.0
14 Idaho 1101.0
15 Illinois 11256.0
16 Indiana 4411.0
17 Iowa 868.0
18 Kansas 813.0
19 Kentucky 955.0
20 Louisiana 13010.0
21 Maine 470.0
22 Marshall Islands 0.0
23 Maryland 4045.0
24 Massachusetts 12500.0
25 Michigan 15718.0
26 Micronesia 0.0
27 Minnesota 986.0
28 Mississippi 1738.0
29 Missouri 2367.0
30 Montana 300.0
31 Nebraska 367.0
32 Nevada 1836.0
33 New Hampshire 669.0
34 New Jersey 37505.0
35 New Mexico 624.0
36 New York 119435.0
37 North Carolina 2870.0
38 North Dakota 207.0
39 Northern Marianas 8.0
40 Ohio 4043.0
41 Oklahoma 1250.0
42 Oregon 1068.0
43 Palau 0.0
44 Pennsylvania 11510.0
45 Puerto Rico 475.0
46 Rhode Island 922.0
47 South Carolina 2049.0
48 South Dakota 240.0
49 Tennessee 3633.0
50 Texas 6812.0
51 Utah 1605.0
52 Vermont 512.0
53 Virgin Islands 42.0
54 Virginia 2878.0
55 Washington 6973.0
56 West Virginia 324.0
57 Wisconsin 2267.0
58 Wyoming 200.0
To find this (it's not always the case), but you want to go to the site, and right click and open Inspect (Dev Tools). Then you want to search in Network -> XHR. If it's empty/blank you may need to refresh/reload the page.
Then you need to search/investigate to see if the data you want is there. Which in this case, I did find it there (as you can see in the highlight part:
Once you find it, you can go to Headers to find the relevant info/parameters you'll need to fetch the data.
This can/will be different for other sites, some more complicated than this, and some sites won't work at all. But this is the general method.

Pybaseball: Extract standings data and save to disk using pandas

What I am trying to do is take this output from pybaseball which is set in as a list.
[ Tm W L W-L% GB 1 Boston Red Sox 94 44 .681 -- 2 New York Yankees 86 51 .628]
and put it into a csv file using pandas. So far these are the are the queries I have tried I have the information for this output set as data. Whenever I try to import it from pd.DataFrame() it tells me that:
AttributeError: 'list' object has no attribute 'to_csv'.
So I add a dataframe to that using df = pd.Dataframe(data) and that prints out just the headers
0 Teams W L W-L% GB
0 Tm Tm
1 W W
2 L L
3 W-L% W-L%
4 GB GB
How would I get this to import all of the information in the list to csv?
from pybaseball import standings
import pandas as pd
data = standings()
data.to_csv('file.csv', header = True, sep = ',')
Looks like standings() returns a list of dataframes:
from pybaseball import standings
import pandas as pd
data = standings()
print type(data)
print type(data[0])
Output:
<type 'list'>
<class 'pandas.core.frame.DataFrame'>
To write it to file, you need to concatenate the list of dataframes into a single dataframe before writing:
all_data = pd.concat(data)
print all_data
all_data.to_csv("baseball_data.csv", sep=",", index=False)
Output:
Tm W L W-L% GB
1 Boston Red Sox 95 44 .683 --
2 New York Yankees 86 52 .623 8.5
3 Tampa Bay Rays 74 63 .540 20.0
4 Toronto Blue Jays 62 75 .453 32.0
5 Baltimore Orioles 40 98 .290 54.5
1 Cleveland Indians 77 60 .562 --
2 Minnesota Twins 63 74 .460 14.0
3 Chicago White Sox 56 82 .406 21.5
4 Detroit Tigers 55 83 .399 22.5
5 Kansas City Royals 46 91 .336 31.0
1 Houston Astros 85 53 .616 --
2 Oakland Athletics 83 56 .597 2.5
3 Seattle Mariners 77 61 .558 8.0
4 Los Angeles Angels 67 71 .486 18.0
5 Texas Rangers 60 78 .435 25.0
1 Atlanta Braves 76 61 .555 --
2 Philadelphia Phillies 72 65 .526 4.0
3 Washington Nationals 69 69 .500 7.5
4 New York Mets 62 75 .453 14.0
5 Miami Marlins 55 83 .399 21.5
1 Chicago Cubs 81 56 .591 --
2 Milwaukee Brewers 78 61 .561 4.0
3 St. Louis Cardinals 76 62 .551 5.5
4 Pittsburgh Pirates 67 71 .486 14.5
5 Cincinnati Reds 59 79 .428 22.5
1 Colorado Rockies 75 62 .547 --
2 Los Angeles Dodgers 75 63 .543 0.5
3 Arizona Diamondbacks 74 64 .536 1.5
4 San Francisco Giants 68 71 .489 8.0
5 San Diego Padres 55 85 .393 21.5
And you'll have a file baseball_data.csv which is a comma-separated representation of the dataframe above.

Categories