Iterating through list of URLs in Python - bs4 - python

I have one .txt file (named test_1.txt) that is formatted as follows:
https://maps.googleapis.com/maps/api/directions/xml?origin=Bethesda,MD&destination=Washington,DC&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Miami,FL&destination=Mobile,AL&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Scranton,PA&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Baltimore,MD&destination=Charlotte,NC&sensor=false&mode=walking
If you go to one of the links above you'll see the output in XML. With the code written below, I've managed to get it to iterate through to the second directions request (Miami to Mobile) and it prints seemingly random data that isn't what I want. I also am able to get this working, printing exactly the data I need when just going to one URL at a time with the .txt but directly from the code. Is there any reason it is only going to the second URL and printing the wrong info? Python code is below:
import urllib2
from bs4 import BeautifulSoup
with open('test_1.txt', 'r') as f:
f.readline()
mapcalc = f.readline()
response = urllib2.urlopen(mapcalc)
soup = BeautifulSoup(response)
for leg in soup.select('route > leg'):
duration = leg.duration.text.strip()
distance = leg.distance.text.strip()
start = leg.start_address.text.strip()
end = leg.end_address.text.strip()
print duration
print distance
print start
print end
EDIT:
This is the output of the Python Code in the Shell:
56
1 min
77
253 ft
Miami, FL, USA
Mobile, AL, USA

Here's a link that could shed more light on the behavior you can get when opening files and reading lines, etc. (related to Lev Levitsky's comment).
One way:
import httplib2
from bs4 import BeautifulSoup
http = httplib2.Http()
with open('test_1.txt', 'r') as f:
for mapcalc in f:
status, response = http.request(mapcalc)
for leg in BeautifulSoup(response):
duration = leg.duration.text.strip()
distance = leg.distance.text.strip()
start = leg.start_address.text.strip()
end = leg.end_address.text.strip()
print duration
print distance
print start
print end
f.close()
I'm new to this sort of thing but I got the above code to work with the following output:
4877
1 hour 21 mins
6582
4.1 mi
Bethesda, MD, USA
Washington, DC, USA
56
1 min
77
253 ft
Miami, FL, USA
Mobile, AL, USA
190
3 mins
269
0.2 mi
Chicago, IL, USA
Scranton, PA, USA
12
1 min
15
49 ft
Baltimore, MD, USA
Charlotte, NC, USA

Related

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR
The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

How do I use Python and BeautifulSoup to scrape data from an html table?

if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app
If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]
I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]

How to loop through scraped items and add them to a dictionary or pandas dataframe?

For a project I'm scraping data from futbin players and I would like to add that scraped data to a dict or pandas dataframe. I'm stuck for a couple of hours and would like some help if possible. I will put my code below on what I have so far. This piece of code only prints out the data and from that I'm clueless about what to do.
Code:
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', id='info_content')
rows = info.find_all('td')
for info in rows:
print(info.text.strip())
The work you have already done to identify the table you want is good.
use read_html() to convert to a dataframe
basic transforms to turn it into columns rather than key value pairs
in list comprehension get details of all wanted footballers
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
def myhtml(url):
# use BS4 to get table that has required data
html = str(BeautifulSoup(requests.get(url).content, 'html.parser').find('div', id='info_content').find("table"))
# read_html() returns a list, take first one, first column are attribute name, transpose to build DF
return pd.read_html(html)[0].set_index(0).T
df = pd.concat([myhtml(u) for u in urls])
Name
Club
Nation
League
Skills
Weak Foot
Intl. Rep
Foot
Height
Weight
Revision
Def. WR
Att. WR
Added on
Origin
R.Face
B.Type
DOB
Robert Lewandowski FIFA 21 Career Mode
Age
1
Edson Arantes Nascimento
FUT 21 ICONS
Brazil
Icons
5
4
5
Right
173cm
5'8"
70
Icon
Med
High
2020-09-10
Prime
nan
Unique
23-10-1940
nan
1
Robert Lewandowski
FC Bayern
Poland
Bundesliga
4
4
4
Right
184cm
6'0"
80
TOTY
Med
High
2021-01-22
TOTY
nan
Unique
nan
Robert Lewandowski FIFA 21 Career Mode
I would do it with open() and write()
file = open ("filename.txt", "w")
The w specifies the following :
"w" - Write - Opens a file for writing, creates the file if it does not exist
And then :
file.write (text_to_save)
Be sure to include os.path!
import os.path

concat pdf tables into one excel table using python

I'm using tabula in order to concat all tables in the following pdf file
To be a one table within excel format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well
I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.
tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.
from tabula import read_pdf
import pandas as pd
# display all columns in dataframe
pd.set_option('display.width', None)
columns = [210, 350, 420, 450] # boundaries in points
#columns = ['210,350,420,450'] # boundaries in points
pages = '90-92'
#pages = [90,91,92]
#pages = list(range(90,93))
#pages = 'all' # read all pages
tables = read_pdf("goal.pdf",
pages=pages,
pandas_options={'header': None},
columns=columns,
guess=False)
df = pd.concat(tables).reset_index(drop=True)
#df.rename(columns=df.iloc[0], inplace=True) # convert first row to headers
#df.drop(df.index[0], inplace=True) # remove first row with headers
# display
#for x in range(0, len(df), 20):
# print(df.iloc[x:x+20])
# print('----------')
print(df.iloc[45:50])
#df.to_csv('output-pdf.csv')
#print(df[ df['State'].str.contains(' ') ])
#print(df[ df.iloc[:,3].str.contains(' ') ])
Result:
0 1 2 3 4
45 JARRARD, GARY 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
46 JARRARD, GARY 2219 COLORADO BLVD DENTON TX (940) 380-1661
47 MASON HARRISON, RATLIFF ENTERPRISES 1815 W. UNIVERSITY DRIVE DENTON TX (940) 387-5431
48 MASON HARRISON, RATLIFF ENTERPRISES 109 N. LOOP #288 DENTON TX (940) 484-2904
49 MASON HARRISON, RATLIFF ENTERPRISES 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
EDIT:
It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.
I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in City is too long.
col3 = df.iloc[:,3]
print(df[ col3.str.contains(' ') ])
Result:
0 1 2 3 4
1941 UMSTATTD RESTAURANTS, LLC 120 WEST US HIGHWAY 54 EL DORADO SPRING MS O (417) 876-5755
2079 SIMONS, GARY 1412 BURLINGTON NORTH KANSAS CIT MY O (816) 421-5941
2763 GRISHAM, ROBERT (RB) 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
2764 STAUFFER, JACOB 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830

Python parsing HTML with BeautifulSoup

I'm trying to take specific data from this [webpage][1] and eventually want to put it into a table of my own, except for right now, I just want to be able to get the data that I want to show up. With the code below I am able to get all the teams with class team even to show up, however I want to have both 'team odd' and 'team even' to show up preferably having team odd show up first then team even.
I'm only focused on taking the names out for now. Any help would be greatly appreciated I've been trying to figure this out all day and quite crack it! I just started learning python and don't want you to give me the answer, just point me in the correct direction.
Thanks!
import bs4, requests
from bs4 import BeautifulSoup
# Scraping all data from website
url = 'http://www.scoresandodds.com/index.html'
response = requests.get(url)
html = response.content
# Taking content from above and searching through it find certain elements with certain attributes
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
for row in table.findAll('tr', attrs={'class' : 'team even'}):
list_of_cells = []
for cell in row.findAll('td'):
text=cell.text.replace(' ', '')
list_of_cells.append(text)
print(list_of_cells)
To just get the names is simple, use class_="game"so you get both odd and even then just pull the td with the text name:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://www.scoresandodds.com/index.html").content)
table = soup.select_one("#mlb").find_next("table")
head = ",".join([th.text for th in table.select("tr th")])
print(head)
for tr in table.find_all("tr",class_="team"):
print(tr.find("td","name").text.strip())
Which will give you:
951 SAN FRANCISCO GIANTS
952 PITTSBURGH PIRATES
953 SAN DIEGO PADRES
954 CINCINNATI REDS
955 CHICAGO CUBS
956 MIAMI MARLINS
957 NEW YORK METS
958 ATLANTA BRAVES
959 ARIZONA DIAMONDBACKS
960 COLORADO ROCKIES
961 SEATTLE MARINERS
962 DETROIT TIGERS
963 CHICAGO WHITE SOX
964 BOSTON RED SOX
965 OAKLAND ATHLETICS
966 LOS ANGELES ANGELS
967 PHILADELPHIA PHILLIES
968 MINNESOTA TWINS
To get multiple data, you can pass a list of classes:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://www.scoresandodds.com/index.html").content)
table = soup.select_one("#mlb").find_next("table")
head = ",".join([th.text for th in table.select("tr th")])
print(head)
for tr in table.find_all("tr",class_="team"):
print(", ".join([td.text.strip() for td in tr.find_all("td",["name","pitcher","currentline","score"]) ]))
If we look at the source, you see some class names are repeated like line:
So we can also use the id's to get the current and runline etc.. data using partial id text:
for tr in table.find_all("tr", class_="team"):
print(tr.select_one("td[id*=Pitcher]").text)
print(tr.select_one("td[id*=Current]").text)
print(tr.select_one("td[id*=Line]").text)
print("")
Whic would give you:
(r) surez, a
8.5o15
+1.5(-207)
(l) niese, j
-108
-1.5(+190)
(l) friedrich, c
9.5o15
+1.5(-195)
(l) lamb, j
-115
-1.5(+179)
(l) lester, j
-156
-1.5(-105)
(l) chen, w
7.5o15
+1.5(-103)
(r) harvey, m
-155
-1.5(+106)
(r) wisler, m
7.5u15
+1.5(-115)
(r) greinke, z
-150
-1.5(+109)
(r) butler, e
10.5
+1.5(-118)
(r) sampson, a
10u15
+1.5(-170)
(l) norris, d
-123
-1.5(+156)
(r) shields, j
10o20
+1.5(+117)
(r) porcello, r
-235
-1.5(-127)
(r) graveman, k
8o15
+1.5(-170)
(r) lincecum, t
-133
-1.5(+156)
(r) eickhoff, j
8.5
+1.5(-154)
(r) nolasco, r
-151
-1.5(+142)
You should be able to piece it all together to get all the table data you want.

Categories