Python Beautiful Soup: Not finding all links - python

I am trying to scrap match data for League of Legends. However, when using this link (https://eu.lolesports.com/en/schedule#slug=all), the match links (with the button: "Show Match") do not show up.
I am using the following code:
url = 'https://eu.lolesports.com/en/schedule#slug=na-lcs'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, "lxml")
for a in soup.find_all('a', href=True):
link = (a['href'])
print(link)
I would like to find the matchlinks that have this format "/en/lck/lck_2018_spring/match/2018-01-23/bbq-olivers-vs-rox-tigers". But instead I only get links like these:
http://euw.leagueoflegends.com/en/legal/privacy
https://euw.leagueoflegends.com/en/legal/cookie-policy /en/about
http://twitch.tv/riotgames http://facebook.com/lolesports
http://twitter.com/lolesports http://youtube.com/lolesports
http://www.azubu.tv/lolesports http://instagram.com/lolesports
http://leagueoflegends.com
Is there something that can be done to change my code so I can get the matchlinks? Thanks in advance

It looks like a javascript rendered page. You need to use a webkit library to render the page -> get the html -> scrape the html for links.
This link should be useful: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

Related

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

how to fix the def to return the links

I located some links on a web site with beautifullSoup and need to return them in a list(or txt file) to use them later on.
It's to get some text from the links on the sites they lead to. I tried to make a def to return the links but I'm not smart enough to get the def working.
for link in soup.find_all('a', href=True):
print(link["href"])
I get a list of links from the code above and could make it write into a text file (by myself) and make a new python script but I would rather prefer to "return" it to continue the script and by the way learn something.
i came up with this but doesnt work:
def linkgetter(soup):
for link in soup.find('a', href=True):
return soup
it prints out the whole site's html code and doesn't filter the links.
def get_links(soup):
return [link["href"] for link in soup.find_all('a', href=True)]
You can try the following:
from bs4 import BeautifulSoup
import urllib2
import re
def parse_links(url):
links = []
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
links.append(link.get('href'))
return links
print parse_links("https://stackoverflow.com/questions/57826906/how-to-fix-the-def-to-return-the-links#57826972")
If you would like to get the links starting with http://, you can use:
soup.findAll('a', attrs={'href': re.compile("^http://")})

How to extract href attribute in html source code

This is HTML source code that I am dealing with:
<a href="/people/charles-adams" class="gridlist__link">
So what I want to do is to extract the href attribute, in this case would be "/people/charles-adams", with beautifulsoup module. I need this because I want to get html source code with soup.findAll method for that particular webpage. But I am struggling to extract such attribute from the webpage. Could anyone help me with this problem?
P.S.
I am using this method to get html source code with Python module beautifulSoup:
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
Try something like:
refs = soup.find_all('a')
for i in refs:
if i.has_attr('href'):
print(i['href'])
It should output:
/people/charles-adams
You can tell beautifulsoup to find all anchor tags with soup.find_all('a'). Then you can filter it with list comprehension and get the links.
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
tags = [tag for tag in tags if tag.has_attr('href')]
links = [tag['href'] for tag in tags]
links will be ['/people/charles-adams']

How to disable all links not in a list, using beautiful soup

I am currently working on a web application (using flask for backend).
In my backend, I retrieve the page source of a given url using selenium. I want to go through the page_source and disable all links whose href is not inside a list. Something like:
body = browser.page_source
soup = BeautifulSoup(body, 'html.parser')
for link in soup.a:
if not (link['href'] in link_list):
link['href']=""
I am new to beautiful soup, so I am unsure about the syntax. I am using Beautiful soup 4
Figured it out:
soup = BeautifulSoup(c_body, 'lxml') #you can also use html.parser
for a in soup.findAll('a'):
if not (a['href'] in src_lst): #src_list is a list of the urls you want to keep
del a['href']
a.name='span' #to avoid the style associated with links
soup.span.unwrap() #to remove span tags and keep text only
c_body=str(soup) #c_body will be displayed in an iframe using srccdoc
EDIT: Above code might break if there are no span tags, so this would be a better approach::
soup = BeautifulSoup(c_body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr("href"):
if not (a['href'] in src_lst):
del a['href']
a.name='span'
if len(soup.findAll('span')) > 0:
soup.span.unwrap()
c_body=str(soup)

How can I reach seq tag data via web scraping with BeautifulSoup?

I am a newbie to web scraping. I am trying to get FASTA file from here, but somehow I cannot. First of all the problem starting for me span tag, I tried some couple of suggestions but not working for me I am suspecting that maybe there is a privacy problem
The FASTA file in this class, but when I run this code, I just can see FASTA title:
url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
print link.text
url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
print link.text
##When I try to reach directly via span, output is empty.
div = soup.find("div", {'id':'viewercontent1'})
spans = div.find_all('span')
for span in spans:
print span.string
Every scraping job involves two phases:
Understand the page that you want to scrape. (How it works? content loaded from Ajax? redirections? POST? GET? iframes? antiscraping stuff?...)
Emulate the webpage using your favourite framework
Do not write a single line of code before to work on point 1. Google network inspector is your friend, use it!
Regarding your webpage, it seems that the report is loaded into a viewer getting data from this url:
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=193211599&db=nuccore&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Use that url and you will get your report.

Categories