I am using the following function construct a css selector using BS4:
def nth_of_type(elem):
count, curr = 0, 0
for i, e in enumerate(elem.find_parent().find_all(recursive=False), 1):
if e.name == elem.name:
count += 1
if e == elem:
curr = i
return '' if count == 1 else ':nth-child({})'.format(curr)
def getCssPath(elem):
rv = [elem.name + nth_of_type(elem)]
while True:
elem = elem.find_parent()
if not elem or elem.name == '[document]':
return '>'.join(rv[::-1])
rv.append(elem.name + nth_of_type(elem))
So if I scrape a page using:
page_r = requests.get('<my url>')
page_soup = BeautifulSoup(page_r.content, 'html.parser')
elements = page_soup.find_all('a')
print(getCssPath(elements[0])
# html>body>div:nth-child(2)>div:nth-child(6)>div>div>main>article>div>div:nth-child(1)>div:nth-child(1)>div>div>div>div:nth-child(2)>div:nth-child(1)>div>div>div>div>div:nth-child(1)>div:nth-child(2)>div>div:nth-child(4)>a`
But this is very long so I want to get the shortest CSS selector. Similar to the one you can get when you right click on the element in chrome and do Copy > Selector. This can involve classes and ids etc.
Is there any BS4 function already to get that or how should this function be modified to get that?
Related
I'm trying to a get the number of pages from a webpage and their links but for some of the webpages there is no href on the page number. So I've tried to create an attempt on an if statement, but it still returns the error.
The aim was to say if the page count is not present then just assign it the value one. Though I'm pretty unexperienced so I'f appreciate some support on this.
It seems to fail on the seventh page:
final_data = []
for m in range(0, 10):
df = {'data':[]}
driver.get(links_countries['links'][m])
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html5lib')
pages = soup.select("#yw2 a")
pag = []
for p in pages:
pag.append(p.get_text(strip=True).replace('', ''))
pag = [string for string in pag if string != ""]
if int(pag[-1]) < 1:
int(1)
else:
continue
print('Page number', pag)
href = []
for t in pages:
href.append(t['href'])
href = [string for string in href if string != ""]
urls = "https://www.transfermarkt.co.uk" + href[0]
print(urls)
This output is given by replacing the if statement with pag = pag[-1]
Output:
Page number 2
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/1/kontinent_id/0/yt0/Show/0/
Page number 10
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/2/kontinent_id/0/yt0/Show/0/
Page number 10
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/3/kontinent_id/0/yt0/Show/0/
Page number 10
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/4/kontinent_id/0/yt0/Show/0/
Page number 4
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/5/kontinent_id/0/yt0/Show/0/
Page number 5
https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop/plus/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/jahrgang/0/land_id/6/kontinent_id/0/yt0/Show/0/
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-199-1fd034d415eb> in <module>
10 pag.append(p.get_text(strip=True).replace('', ''))
11 pag = [string for string in pag if string != ""]
---> 12 pag = int(pag[-1])
13 print('Page number', pag)
14 href = []
IndexError: list index out of range
Take a look at the error message.
IndexError: list index out of range
The reason why you get the above error is because your list pag is empty, so you cannot slice it.
If you want to skip the page, then at least use this check instead of handling the error.
if pag:
pag = int(pag[-1])
else:
continue
The statement if pag is equivalent to if len(pag) > 0. However, I follow Google Python Style Guide so I'm going to stick with if pag.
Managed to at least skip the url with:
try:
pag = int(pag[-1])
except:
continue
Im trying to write a scraper that randomly chooses a wiki article link from a page, goes there, grabs another, and loops that. I want to exclude links with "Category:", "File:", "List" in the href. Im pretty sure the links i want are all inside of p tags, but when I include "p" in find_all, i get "int object is not subscriptable" error.
The code below returns wiki pages but does not exclude the things i want to filter out.
This is a learning journey for me. All help is appreciated.
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
print(url)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# Here i am trying to select hrefs with /wiki/ in them and exclude hrefs with "Category:" etc. It does select for wikis but does not exclude anything.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
# Use this link to scrape
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
You need to modify the for loop, .attrs is used to access the attributes of any tag. If you want to exclude links if the href value contains particular keyword then use !=-1 comparison.
Modified code:
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
if("href" in link.attrs):
if link.attrs['href'].find("/wiki/") == -1 or link.attrs['href'].find("Category:") != -1 or link.attrs['href'].find("File:") != -1 or link.attrs['href'].find("List") != -1:
continue
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
if(linkToScrape):
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape.attrs['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")
This section seems problematic.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
find returns the index of the substring you are looking for, you are also using it wrong.
So if wiki is not found or Category:, File: etc. appears in href, then continue.
if link['href'].find("/wiki/") == -1 or \
link['href'].find("Category:") != -1 or \
link['href'].find("File:") != -1 or \
link['href'].find("List")!= -1 :
print("skipped " + link["href"])
continue
Saint Petersburg
https://en.wikipedia.org/wiki/St._Petersburg
National Diet Library
https://en.wikipedia.org/wiki/NDL_(identifier)
Template talk:Authority control files
https://en.wikipedia.org/wiki/Template_talk:Authority_control_files
skipped #searchInput
skipped /w/index.php?title=Template_talk:Authority_control_files&action=edit§ion=1
User: Tom.Reding
https://en.wikipedia.org/wiki/User:Tom.Reding
skipped http://toolserver.org/~dispenser/view/Main_Page
Iapetus (moon)
https://en.wikipedia.org/wiki/Iapetus_(moon)
87 Sylvia
https://en.wikipedia.org/wiki/87_Sylvia
skipped /wiki/List_of_adjectivals_and_demonyms_of_astronomical_bodies
Asteroid belt
https://en.wikipedia.org/wiki/Main_asteroid_belt
Detached object
https://en.wikipedia.org/wiki/Detached_object
Use :not() to handle the list of exclusions within href alongside * contains operator. This will filter out hrefs containing (*) specified substrings. Precede this with an attribute = value selector that contains * /wiki/. I have specified a case insensitive match via i, for the first two, which can be removed:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money')
soup = bs(r.content, 'lxml') # 'html.parser'
links = [i['href'] for i in soup.select('#bodyContent a[href*="/wiki/"]:not([href*="Category:" i], [href*="File:" i], [href*="List"])')]
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"col _2-gKeQ"})
page_nr=soup.find_all("a",{"class":"_33m_Yg"})[-1].text
print(page_nr,"number of pages were found")
#all[0].find("div",{"class":"_1vC4OE _2rQ-NK"}).text
l=[]
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(0,int(page_nr)*10,10):
print( )
r=requests.get(base_url+str(page)+".html")
c=r.content
#c=r.json()["list"]
soup=BeautifulSoup(c,"html.parser")
for item in all:
d ={}
#price
d["Price"] = item.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
#Name
d["Name"] = item.find("div",{"class":"_3wU53n"}).text
for li in item.find_all("li",{"class":"_1ZRRx1"}):
if " EMI" in li.text:
d["EMI"] = li.text
else:
d["EMI"] = None
for li1 in item.find_all("li",{"class":"_1ZRRx1"}):
if "Special " in li1.text:
d["Special Price"] = li1.text
else:
d["Special Price"] = None
for val in item.find_all("li",{"class":"tVe95H"}):
if "Display" in val.text:
d["Display"] = val.text
elif "Warranty" in val.text:
d["Warrenty"] = val.text
elif "RAM" in val.text:
d["Ram"] = val.text
l.append(d)
import pandas
df = pandas.DataFrame(l)
This might work on standard pagination
i = 1
items_parsed = set()
loop = True
base_url = "https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page={}&q=laptop&sid=6bo%2Fb5g&viewType=list"
while True:
page = requests.get(base_url.format(i))
items = requests.get(#yourelements#)
if not items:
break
for item in items:
#Scrap your item and once you sucessfully done the scrap, return the url of the parsed item into url_parsed (details below code) for example:
url_parsed = your_stuff(items)
if url_parsed in items_parsed:
loop = False
items_parsed.add(url_parsed)
if not loop:
break
i += 1
I formatted your URL where ?page=X with base_url.format(i) so it can iterate until you have no items found on the page OR sometimes you return on page 1 when you reached max_page + 1.
If above the maximum page you get the items you already parsed on the first page you can declare a set() and put the URL of every items you parsed and then check if you already parsed them.
Note that this is just an idea.
Since the page number in the URL is almost in the middle I'd apply a similar change to your code:
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page="
end_url ="&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(1, page_nr + 1):
r=requests.get(base_url+str(page)+end_url+".html")
You have access to only first 10 pages from initial URL.
You can make a loop from "&page=1" to "&page=26".
My problem is related to this answer.
I have following code:
import urllib.request
from bs4 import BeautifulSoup
time = 0
html = urllib.request.urlopen("https://www.kramerav.com/de/Product/VM-2N").read()
html2 = urllib.request.urlopen("https://www.kramerav.com/de/Product/SDIA-IN2-F16").read()
try:
div = str(BeautifulSoup(html).select("div.large-image")[0])
if(str(BeautifulSoup(html).select("div.large-image")[1]) != ""):
div += str(BeautifulSoup(html).select("div.large-image")[1])
time = time + 1
except IndexError:
div = ""
time = time + 1
finally:
print(str(time) + div)
The site of the variable html has 2 div-classes named "large-image". The site of the variable html2 only has 1.
With html the program works as intended. But if I switch to html2 the variable div is going to be completely empty.
I would like to save the 1 div-class rather than saving nothing. How could I archieve this?
the variable div is going to be completely empty.
That's because your error handler assigned it the empty string.
Please don't use subscripts, conditionals, and handlers in that way. It would be more natural to iterate over the results of select() with for, building up a result list (or string).
Also, you should create soup = BeautifulSoup(html) just once, as that can be a fairly expensive operation, since it carefully parses a potentially long web page. With that, you could build up a list of HTML fragments with:
images = [image
for image in soup.select('div.large-image')]
Or if for some reason you're not fond list comprehensions, you could equivalently write:
images = []
for image in soup.select('div.large-image'):
images.append(image)
and then get the required html with div = '\n'.join(images).
You can concatenate all items inside for loop
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
or using join()
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
You can also write in file directly inside for loop and you get to row
for item in all_divs:
csv_writer.writerow( [str(item).strip()] )
time += 1
Working example
import urllib.request
from bs4 import BeautifulSoup
import csv
div = ""
time = 0
f = open('output.csv', 'w')
csv_writer = csv.writer(f)
all_urls = [
"https://www.kramerav.com/de/Product/VM-2N",
"https://www.kramerav.com/de/Product/SDIA-IN2-F16",
]
for url in all_urls:
print('url:', url)
html = urllib.request.urlopen(url).read()
try:
soup = BeautifulSoup(html)
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
# or
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
# or
for item in all_divs:
#div += str(item)
#time += 1
csv_writer.writerow( [time, str(item).strip()] )
except IndexError as ex:
print('Error:', ex)
time += 1
finally:
print(time, div)
f.close()
I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.
The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates">Coordinates
The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '
Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?
Below is the function with this code for reference:
**def getValidLink(self, currResponse):
currRoot = BeautifulSoup(currResponse.text,"lxml")
temp = currRoot.body.findAll('p')[0]
parenOpened = False
parenCompleted = False
openCount = 0
foundParen = False
while temp.next:
temp = temp.next
curr = str(temp)
if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
foundParen = True
break
if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
link = temp
break
temp = currRoot.body.findAll('p')[0]
if foundParen:
while temp.next and not parenCompleted:
temp = temp.next
curr = str(temp)
if '(' in curr:
openCount += 1
if parenOpened is False:
parenOpened = True
if ')' in curr and parenOpened and openCount > 1:
openCount -= 1
elif ')' in curr and parenOpened and openCount == 1:
parenCompleted = True
try:
return temp.findNext('a').attrs['href']
except KeyError:
print "\nReached article with no main body!\n"
return None
try:
return str(link.attrs['href'])
except KeyError:
print "\nReached article with no main body\n"
return None**
I think you are seriously overcomplicating the problem.
There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "html.parser")
In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]:
['West Africa',
'Guinea',
'Liberia',
...
'Allen Iverson',
'Magic Johnson',
'Victor Oladipo',
'Frances Tiafoe']
Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.
If you need a single element, use select_one() instead of select().
Also, if you want to solve it via find*(), pass the recursive=False argument.