Cannot scrape google patent URL through python and Beautiful Soup - python

I am currently trying to scrape a link to Google Patents on this page,
https://datatool.patentsview.org/#detail/patent/10745438, but when I am trying to print out all of the links with an 'a' tag, only an unrelated website comes up.
Here is my code so far:
url = 'https://datatool.patentsview.org/#detail/patent/10745438'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
links = []
print(soup)
for link in soup.find_all('a', href=True):
print(link['href'])
When I print out the soup, the 'a' tag with the link to the google patents isn't printed, nor is the link in the array. The only thing printed is
http://uspto.gov/
tel:1-800-786-9199
./#viz/relationships
./#viz/locations
./#viz/comparisons
, which is all unnecessary information. Is google protecting their links in some way, or is there any other way I can retrieve the link to the google patent or redirect to the page?

Don't scrape it, just do some link hacking:
url = 'https://datatool.patentsview.org/#detail/patent/10745438'
google_patents_url = 'https://www.google.com/patents/US' + url.rsplit('/', 1)[1]

Related

Fetch only specific links using selenium in python

I am trying to fetch the links of all news articles related to Apple, using this webpage: https://finance.yahoo.com/quote/AAPL/news?p=AAPL. But there are also a lot of links for advertisements in between and other links guiding to other pages of the website. How do I selectively only fetch links to news articles?
Here is the code I have written so far:
driver = webdriver.Chrome(executable_path='C:\\Users\\Home\\OneDrive\\Desktop\\AJ\\chromedriver_win32\\chromedriver.exe')
driver.get("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
links=[]
for a in driver.find_elements_by_xpath('.//a'):
links.append(a.get_attribute('href'))
def get_info(url):
#send request
response = requests.get(url)
#parse
soup = BeautifulSoup(response.text)
#get information we need
news = soup.find('div', attrs={'class': 'caas-body'}).text
headline = soup.find('h1').text
date = soup.find('time').text
return news, headline, date
Can anyone guide on how to do this or to a resource that can help with this? Thanks!
Try this xpath to get all the news links from that page.
//li[contains(#class,'js-stream-content')]/div[#data-test-locator='mega']//h3/a
driver.implicitly_wait(10)
driver.maximize_window()
driver.get("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
time.sleep(10)
links = driver.find_elements_by_xpath("//li[contains(#class,'js-stream-content')]/div[#data-test-locator='mega']//h3/a")
for link in links:
print(link.get_attribute("href"))

Trying to Crawl Yelp Search Results page for Profile URLs

I am trying to scrape the profile URLs from a Yelp search results page using Beautiful Soup. This is the code I currently have:
url="https://www.yelp.com/search?find_desc=tree+-+removal+-+&find_loc=Baltimore+MD&start=40"
response=requests.get(url)
data=response.text
soup = BeautifulSoup(data,'lxml')
for a in soup.find_all('a', href=True):
with open(r'C:\Users\my.name\Desktop\Yelp-URLs.csv',"a") as f:
print(a,file=f)
This gives me every href link on the page, not just profile URLs. Additionally, I am getting the full class string (a class lemon....), when I just need the business profile URL's.
Please help.
You can narrow down the href limitation by using select.
for a in soup.select('a[href^="/biz/"]'):
with open(r'/Users/my.name/Desktop/Yelp-URLs.csv',"a") as f:
print(a.attrs['href'], file=f)

How to disable all links not in a list, using beautiful soup

I am currently working on a web application (using flask for backend).
In my backend, I retrieve the page source of a given url using selenium. I want to go through the page_source and disable all links whose href is not inside a list. Something like:
body = browser.page_source
soup = BeautifulSoup(body, 'html.parser')
for link in soup.a:
if not (link['href'] in link_list):
link['href']=""
I am new to beautiful soup, so I am unsure about the syntax. I am using Beautiful soup 4
Figured it out:
soup = BeautifulSoup(c_body, 'lxml') #you can also use html.parser
for a in soup.findAll('a'):
if not (a['href'] in src_lst): #src_list is a list of the urls you want to keep
del a['href']
a.name='span' #to avoid the style associated with links
soup.span.unwrap() #to remove span tags and keep text only
c_body=str(soup) #c_body will be displayed in an iframe using srccdoc
EDIT: Above code might break if there are no span tags, so this would be a better approach::
soup = BeautifulSoup(c_body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr("href"):
if not (a['href'] in src_lst):
del a['href']
a.name='span'
if len(soup.findAll('span')) > 0:
soup.span.unwrap()
c_body=str(soup)

Python Beautiful Soup: Not finding all links

I am trying to scrap match data for League of Legends. However, when using this link (https://eu.lolesports.com/en/schedule#slug=all), the match links (with the button: "Show Match") do not show up.
I am using the following code:
url = 'https://eu.lolesports.com/en/schedule#slug=na-lcs'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, "lxml")
for a in soup.find_all('a', href=True):
link = (a['href'])
print(link)
I would like to find the matchlinks that have this format "/en/lck/lck_2018_spring/match/2018-01-23/bbq-olivers-vs-rox-tigers". But instead I only get links like these:
http://euw.leagueoflegends.com/en/legal/privacy
https://euw.leagueoflegends.com/en/legal/cookie-policy /en/about
http://twitch.tv/riotgames http://facebook.com/lolesports
http://twitter.com/lolesports http://youtube.com/lolesports
http://www.azubu.tv/lolesports http://instagram.com/lolesports
http://leagueoflegends.com
Is there something that can be done to change my code so I can get the matchlinks? Thanks in advance
It looks like a javascript rendered page. You need to use a webkit library to render the page -> get the html -> scrape the html for links.
This link should be useful: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

How can I reach seq tag data via web scraping with BeautifulSoup?

I am a newbie to web scraping. I am trying to get FASTA file from here, but somehow I cannot. First of all the problem starting for me span tag, I tried some couple of suggestions but not working for me I am suspecting that maybe there is a privacy problem
The FASTA file in this class, but when I run this code, I just can see FASTA title:
url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
print link.text
url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
print link.text
##When I try to reach directly via span, output is empty.
div = soup.find("div", {'id':'viewercontent1'})
spans = div.find_all('span')
for span in spans:
print span.string
Every scraping job involves two phases:
Understand the page that you want to scrape. (How it works? content loaded from Ajax? redirections? POST? GET? iframes? antiscraping stuff?...)
Emulate the webpage using your favourite framework
Do not write a single line of code before to work on point 1. Google network inspector is your friend, use it!
Regarding your webpage, it seems that the report is loaded into a viewer getting data from this url:
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=193211599&db=nuccore&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Use that url and you will get your report.

Categories