Access element using xpath? - python

I would like to get the links to all of the elements in the first column in this page (http://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama).
I am comfortable using BeautifulSoup, but it seems less well-suited to this task (I've been trying to access the first child of the contents of each tr but that hasn't been working so well).
The xpaths follow a regular pattern, the row number updating for each new row in the following expression:
xpath = '//*[#id="mw-content-text"]/table[1]/tbody/tr[' + str(counter) + ']/td[1]/a'
Would someone help me by posting a means of iterating through the rows to get the links?
I was thinking something along these lines:
urls = []
while counter < 100:
urls.append(get the xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[' + str(counter) + ']/td[1]/a'))
counter += 1
Thanks!

Here's the example on how you can get all of the links from the first column:
from lxml import etree
import requests
URL = "http://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama"
response = requests.get(URL)
parser = etree.HTMLParser()
tree = etree.fromstring(response.text, parser)
for row in tree.xpath('//*[#id="mw-content-text"]/table[1]/tr'):
links = row.xpath('./td[1]/a')
if links:
link = links[0]
print link.text, link.attrib.get('href')
Note, that, tbody is appended by the browser - lxml won't see this tag (just skip it in xpath).
Hope that helps.

This should work:
from lxml import html
urls = []
parser = html.parse("http://url/to/parse")
for element in parser.xpath(your_xpath_query):
urls.append(element.attrib['href'])
You could also access the href attribute in the XPath query directly, e.g.:
for href in parser.xpath("//a/#href"):
urls.append(href)

The page you linked to does not seem to have content at the XPath you specified. Here is a different XPath which does the job:
import urllib2
import lxml.html as LH
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', "Mozilla/5.0")]
url = 'http://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'
xpath = '//table[#class="wikitable sortable"]//tr/td[1]/a/#href'
doc = LH.parse(opener.open(url))
urls = doc.xpath(xpath)
print(urls)

Maybe you are looking to something like
urls = []
while True:
try:
counter = len(urls)+1
(node,) = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[' + str(counter) + ']/td[1]/a')
urls.append(node)
except ValueError:
break

Related

How to get the href value of a link with bs4?

I need help where I can extract all the matches from 2020/2021's URLs from this [website][1] and scrape them.
I am sending a request to this link.
The section of the HTML that I want to retrieve is this part:
Here's the code that I am using:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results'
response = requests.get(website)
soup = BeautifulSoup(response.content,'html.parser')
match_result = soup.find_all('a',{'class':'match-info-link-FIXTURES'});
soup.get('href')
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
url_joined = []
for link_2 in url_part_2:
url_joined.append(urllib.parse.urljoin(url_part_1,link_2))
first_link = url_joined[0]
match_url = soup.find_all('div',{'class':'link-container border-bottom'});
soup.get('href')
url_part_3 = 'https://www.espncricinfo.com/'
url_part_4 = []
for item in match_result:
url_part_4.append(item.get('href'))
print(url_part_4)
[1]: https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results
You don't need the second item.find_all('a',{'class':'match-info-link-FIXTURES'}): call below for item in match_result: since you already have the tags with the hrefs.
You can get the href with item.get('href').
You can do:
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
The result will look something like:
['/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-final-1237181/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-sunrisers-hyderabad-qualifier-2-1237180/full-scorecard',
'/series/ipl-2020-21-1210595/royal-challengers-bangalore-vs-sunrisers-hyderabad-eliminator-1237178/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-qualifier-1-1237177/full-scorecard',
'/series/ipl-2020-21-1210595/sunrisers-hyderabad-vs-mumbai-indians-56th-match-1216495/full-scorecard',
...
]
From official doc's
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.
Try
soup.find_all("a", class_="match-info-link-FIXTURES")

Using startswith function to filter a list of urls

I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() .
I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:
import requests
from bs4 import BeautifulSoup
import re
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
print(filter_links(links))
I have used the built-in startswith function, but its filtering out everything except the starting url.
Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?
Try this :
import requests
from bs4 import BeautifulSoup
import re
import tldextract
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
ext = tldextract.extract(start_url)
domain = ext.domain
filtered_links = []
for link in links:
if domain in link:
filtered_links.append(link)
return filtered_links
print(filter_links(links))
Note :
You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.
Output :
['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
Okay so you made an indentation error in filter_links(links). The function should be like this
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.
Hope this helps :)
Possible Solution
What about if you kept all links which 'contain' the domain?
For example
import pandas as pd
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
# results in a dataframe with links containing "enzymebiosystems".
If you want to search multiple domains, see this answer

Extract CSS Selector for an element with Selenium

For my project I need to extract the CSS Selectors for a given element that I will find through parsing. What I do is navigate to a page with selenium and then with python-beautiful soup I parse the page and find if there are any elements that I need the CSS Selector of.
For example I may try to find any input tags with id "print".
soup.find_all('input', {'id': 'print')})
If I manage to find such an element I want to fetch its extract it's CSS Selector, something like "input#print". I don't just find using id's but also a combination of classes and regular expressions.
Is there any way to achieve this?
Try this.
from scrapy.selector import Selector
from selenium import webdriver
link = "https://example.com"
xpath_desire = "normalize-space(//input[#id = 'print'])"
path1 = "./chromedriver"
driver = webdriver.Chrome(executable_path=path1)
driver.get(link)
temp_test = driver.find_element_by_css_selector("body")
elem = temp_test.get_attribute('innerHTML')
value = Selector(text=elem).xpath(xpath_desire).extract()[0]
print(value)
Ok, I am totally new to Python so i am sure that there is a better answer for this, but here's my two cents :)
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/49168556/extract-css-selector-for-
an-element-with-selenium"
element = 'a'
idName = 'nav-questions'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all(element, id = idName)
if tags:
for tag in tags :
getClassNames = tag.get('class')
classNames = ''.join(str('.' + x) for x in getClassNames)
print element + '#' + idName + classNames
else:
print ':('
This would print something like:
a#nav-questions.-link.js-gps-track

Having problems following links with webcrawler

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url
Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!
I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url
I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.
Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

Scraperwiki + lxml. How to get the href attribute of a child of an element with a class?

On the link that contains 'alpha' in the URL has many links(hrefs) which I would like to collect from 20 different pages and paste onto the end of the general url(second last line). The href are found in a table which class is mys-elastic mys-left for the td and the a is obviously the element which contains the href attribute. Any help would greatly be appreciated for I have been working at this for about a week.
for i in range(1, 11):
# The HTML Scraper for the 20 pages that list all the exhibitors
url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=' + str(i) + '#GotoResults'
print url
list_html = scraperwiki.scrape(url)
root = lxml.html.fromstring(list_html)
href_element = root.cssselect('td.mys-elastic mys-left a')
for element in href_element:
# Convert HTMl to lxml Object
href = href_element.get('href')
print href
page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href)
print page_html
No need to muck about with javascript - it's all there in the html:
import scraperwiki
import lxml.html
html = scraperwiki.scrape('http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm? alpha=%40&type=alpha&page=1')
root = lxml.html.fromstring(html)
# get the links
hrefs = root.xpath('//td[#class="mys-elastic mys-left"]/a')
for href in hrefs:
print 'http://ahr13.mapyourshow.com' + href.attrib['href']
import lxml.html as lh
from itertools import chain
URL = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='
BASE = 'http://ahr13.mapyourshow.com'
path = '//table[2]//td[#class="mys-elastic mys-left"]//#href'
results = []
for i in range(1,21):
doc=lh.parse(URL+str(i))
results.append(BASE+i for i in doc.xpath(path))
print list(chain(*results))

Categories