I am trying to isolate the Location column and then eventually get it to output to a database file. My code is as follows:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
trs = soup.find_all('td')
for tr in trs:
for link in tr.find_all('a'):
fulllink = link.get ('href')
tds = tr.find_all("tr")
location = str(tds[3].get_text())
print location
but I always get 1 of 2 errors either list being out of range or exit code '0'. I am uncertain on beautfulsoup as I am trying to learn it so any help is appreciated thanks !
There is an easier way to locate the Location column. Use a table.wikitable tr CSS Selector, find all td elements for every row and get the 4th td by index.
Besides, if there are multiple locations inside a cell, you need to treat them separately:
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
soup = BeautifulSoup(urllib2.urlopen(url))
for row in soup.select('table.wikitable tr'):
cells = row.find_all('td')
if cells:
for text in cells[3].find_all(text=True):
text = text.strip()
if text:
print text
Prints:
Afghanistan
Nigeria
Cameroon
Niger
Chad
...
Iran
Nigeria
Mozambique
You simply swap td and tr balises in your code. And be careful with str() function because you can have unicode string in your web page that cannot be converted in simple ascii string. Your code should be:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
trs = soup.find_all('tr') # 'tr' instead of td
for tr in trs:
for link in tr.find_all('a'):
fulllink = link.get ('href')
tds = tr.find_all("td") # 'td' instead of td
location = tds[3].get_text() # remove of str function
print location
And voilà!!
Related
Need to get the links from a td in rows that has a certain td value.
this is a tr in the table and I want to get the link from the div "Match" if the div "Home team" is of a certain value. There are many rows and I want to find every link that is matching. I have tried this and every time I only get the first row of the table. Here is the link https://wp.nif.no/PageTournamentDetailWithMatches.aspx?tournamentId=403373&seasonId=200937&number=all . Note that I translated some of the values to English in the examples below
homegames = browser.find_elements_by_xpath('//div[#data-title = "Home team"]/a[text()="Cleveland"]//parent::div//parent::td//parent::tr')
for link in homegames:
print(link.find_element_by_xpath('//td[3]/div/a').get_attribute('href'))
<td><div data-title="Date">23.10.2021</div></td>
<td><div data-title="Tid">16:15</div></td>
<td>div data-title="Matchnr">
2121503051
</div>
</td><td><div data-title="Home team">Cleveland</div></td>
<td><div data-title="Away team">
Ohio Travellers</div></td>
<td><div data-title="Court">F21</div></td><td><div data-title="Result">71 - 64</div></td>
<td><div data-title="Referee">John Doe<br>Will Smith<br></div></td></tr>```
The data is within the html source (so no need to use Selenium). But regardless of using Selenium or not, what you can do here is let BeautifulSoup find the specific tags you are after.
Without Selenium, it requires a little manipulation as decode the html.
import requests
from bs4 import BeautifulSoup
import json
import html
keyword = 'Askim'
url = 'https://wp.nif.no/PageTournamentDetailWithMatches.aspx?tournamentId=403373&seasonId=200937&number=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('div', {'class':'xwp_table_bg'}).find_next('input')['value']
jsonData = json.loads(jsonStr)
links_list = []
for each in jsonData['data']:
#each = jsonData['data'][6]
htmlStr = ''.join(each)
htmlStr = html.unescape(htmlStr)
soup = BeautifulSoup(htmlStr, 'html.parser')
if soup.find('div', {'data-title':'Hjemmelag'}, text=keyword):
link = soup.find('div', {'data-title':'Kampnr'}).find('a')['href']
links_list.append(link)
import requests
from bs4 import BeautifulSoup
results = requests.get("https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists")
src = results.content
soup = BeautifulSoup(src, 'lxml')
trs = soup.find_all("tr")
for tr in trs:
print(tr.text)
This is the code I write for the scraping table from the page "https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists"
If I am only targeting the table in the session "List of most Olympic gold medals over career", how can I specify the table I need? There are 2 sortable jquery-tablesorter so I cannot use the class attribute to select the table I needed.
One more question, if I know that the page I am scraping contains a lot of tables and the one I need always have 10 td in 1 row, can I have something like
If len(td) == 10:
print(tr)
to extract the data I wanted
Update on code:
from bs4 import BeautifulSoup
results = requests.get("https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists")
src = results.content
soup = BeautifulSoup(src, 'lxml')
tbs = soup.find("tbody")
trs = tbs.find_all("tr")
for tr in trs:
print(tr.text)
I have one of the solution, not a good one, just to extract the first table from the page which is the one I needed, any suggestion/ improvement are welcomed!
Thank you.
To only get the first table you can use a CSS Selector nth-of-type(1):
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
table = soup.select_one("table.wikitable:nth-of-type(1)")
trs = table.find_all("tr")
for tr in trs:
print(tr.text)
I want to extract text from th tags in a table so I can print a list of metro stations from a table in a Wikipedia page. I only need text from a certain table (there are two of them in the page)
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_London_Underground_stations"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
stations_table = soup.find("table", class_= "wikitable sortable plainrowheaders")
stations_table
for i in soup.find_all('th', stations_table):
print(i.text)
I can get the table stored in the stations_table variable but cannot print the text in th tags within the wikitable sortable plainrowheaders table. While it does print the station name, it also prints the headers:
Station
Local authority
Zone(s)[†]
Opened[4]
Main lineopened
Usage[5]
How can I filter those out?
It shows all th in table - not only stations but also headers like Stations, Lines
To skip it I search all tr, skip first row and then I search th in every row
for i in stations_table.find_all('tr')[1:]
print(i.find('th').text.strip())
Full code
import urllib.request
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_London_Underground_stations"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
stations_table = soup.find("table", class_= "wikitable sortable plainrowheaders")
for i in stations_table.find_all('tr')[1:]:
print(i.find('th').text.strip())
#print(i.th.text.strip())
for i in soup.find_all('th', stations_table):
searches for all the table headings and the table rows. What can be done for this, is to extract all the rows and start printing from the second row (ignoring the title's row) as below
for i in stations_table.find_all('tr')[1:]:
print(i.find('th').text)
I am trying to get the following done:
Open a webpage and get the tr tag with bold font and then filter the links
This works great:
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen("https://wiki.guildwars.com/wiki/Weekly_activities")
data = response.read()
soup = BeautifulSoup(data, 'html.parser')
weekliesunsorted = soup.findAll('tr', style="font-weight: bold;")
pvebonus = weekliesunsorted[0].findAll('a')[0]
#and so on...
But unfortunately this code doesn't, the fluxunsorted variable is empty, but the tr row does exist
response = urllib2.urlopen("https://wiki.guildwars.com/wiki/Flux")
data = response.read()
soup = BeautifulSoup(data, 'html.parser')
fluxunsorted = soup.findAll('tr', style="font-weight:bold;") #empty?
flux = fluxunsorted[0].findAll('a')[0]
That's why I am getting the index out of range error.
Why is the list empty?
I am writing a Python script using BeautifulSoup to scrape values from this webpage: https://uk-air.defra.gov.uk/latest/currentlevels
I want to use soup.find() to get values for "Hourly mean Nitrogen dioxide" and "Last updated" from the table row where the "Monitoring site" is "Edinburgh St Leonards".
As I am new to web scraping I am having a bit of trouble so would be grateful for any help on this.
Scrap all the html tables in a list of tables.
The table index may change, then you should not rely on a row/column index.
A part of the folowing script look up for the index of the searched data. Moreover, it prints the header name: so you know want are the data you get.
from bs4 import BeautifulSoup
import urllib.request
import re
with urllib.request.urlopen('https://uk-air.defra.gov.uk/latest/currentlevels?view=region') as response:
htmlData = response.read()
soup = BeautifulSoup(htmlData, 'html5lib')
tables = soup.find_all('table', attrs={'class':'current_levels_table'})
#what you want to check:
Iwant = ['nitrogen', 'update']
about = 'Edinburgh'
for table in tables:
#get header to have the data (we're looking for) column number and table real names
table_head = table.find('thead')
headrows = table_head.find_all('tr')
measures = headrows[1].find_all('th')
for colnum, measure in enumerate(measures):
index.update({colnum: measure.text.strip() for wanted in Iwant if re.search(wanted+'(?iu)', measure.text)})
#get table content and look for Edinburgh
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cels = row.find_all('td')
rowContent = [cel.text.strip().replace(u'\xa0', u' ').replace(u'\n Timeseries Graph', u'') for cel in cels if cel]
if re.search(about+'(?iu)', rowContent[0]):
for indexwanted, measurewanted in index.items():
print(measurewanted, ':', rowContent[indexwanted])
Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.
First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr element. Now identify the td elements in the tr. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.
You will need to do something clever to extract properly readable strings from these results.
>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
Edinburgh St Leonards
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'
you can start with this:
import requests
from bs4 import BeautifulSoup
# Request the page, set headers to prevent 403 Forbidden
page = requests.get(
url='https://uk-air.defra.gov.uk/latest/currentlevels',
headers={'User-Agent': 'Not blank'})
# Get html from page
html = page.text
# BeautifulSoup object
soup = BeautifulSoup(html, 'html5lib')
for table in soup.find_all('table'):
# Print all tables on the page
print(table)