get the url of the first hit [webscraping python]

get the url of the first hit [webscraping python] - python

I have a bunch of google queries for which I would like to get the url for the first hit.
A piece of my code:
import requests
query = 'hello world'
url = 'http://google.com/search?q=' + query
page= requests.get(url)
print(url)
Then what I would like to retrieve is the first google hit website, in this case, the Wikipedia page: https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
I have the rest of the code but I don't know how to retrieve the url.

You can use select_one to limit to first match. Use the class r to limit to results. It is faster to use class and type selectors than attribute which is why I use r and a.
import requests
from bs4 import BeautifulSoup as bs
query = 'hello world'
url = 'http://google.com/search?q=' + query
page= requests.get(url)
soup = bs(page.content, 'lxml')
print(soup.select_one('.r a')['href'])

I would suggest using something like Beautiful Soup to target the HTML elements which contain the URL's of the results. Then, you can store the URL and do with it as you please.
import requests
from bs4 import BeautifulSoup
query = 'hello world'
url = 'http://google.com/search?q=' + query
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))

You can use BeautifulSoup to find the web results, then find the first element that returns a href:
import requests
import bs4
query = 'hello world'
url = 'http://google.com/search?q=' + query
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
page= requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(page.text, 'html.parser')
for elem in soup(text='Web results'):
print (elem.find_next('a')['href'])
Output:
print (elem.find_next('a')['href'])
https://en.wikipedia.org/wiki/%22Hello,_World!%22_program

Related

Scraping HREF Links contained within a Table

I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)

The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))

Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)

Web scraping finviz for fundamental data on marketcap

I'm trying to scrape finviz(https://finviz.com/quote.ashx?t=aapl) for marketcap in the fundamental table but couldn't for the life of me locate the table or the class with beautiful soup. It seems that every time I use soup.find_all() it works for 'div', 'td', 'table' etc. but it returns an empty list when I try to add a class like {'class':'snapshot-td2'}. Does anyone know how I may fix this?
import requests
from bs4 import BeautifulSoup
import bs4
def parseMarketCap():
response = requests.get("https://finviz.com/quote.ashx?t=aapl")
soup = bs4.BeautifulSoup(response.content, "lxml")
table = soup.find_all('td', {'class':'snapshot-td2'})
print(table)
I also tried the following for the table, but no luck as well:
table = soup.find_all('table', {'class': "snapshot-table2"})
inspect
fundamental table

You need an user-agent header, then you can use -soup-contains: to target the preceding td by its text then move to the desired value field:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finviz.com/quote.ashx?t=aapl', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('td:-soup-contains("Market Cap")').find_next('b').text)

as QHarr suggest that you to add user-agent for proper response
response = requests.get("https://finviz.com/quote.ashx?t=aapl",headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'})
soup = BeautifulSoup(response.text, "lxml")
content_table = soup.find('table', {'class':'snapshot-table2'})
tabel_rows = content_table.find_all('tr')
market_cap = tabel_rows[1].find_all('td')[1].text
print(market_cap)

Try to use User-Agent for your request like this:
user_agent = {'User Agent':'YOUR_USER_AGENT'}
r = requests.get('YOURURL', headers=user_agent)
...

Find specific Tag Python BeautifulSoup

Hey I'm trying to extract URLs between 2 tags
This is what i got so far:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = []
for links in soup.findAll('cite'):
print(links.get('cite'))
I have tried different things but I couldn't extract the URL between
<cite>.....</cite>
My code Updated
import requests
from bs4 import BeautifulSoup as bs
dorks = input("Keyword : ")
binglist = "http://www.bing.com/search?q="
with open(dorks , mode="r",encoding="utf-8") as my_file:
for line in my_file:
clean = binglist + line
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get(clean, headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
print(links)
In keyword file you just need to put any keyword like :
test
games
Thanks for your help

You can do it as follows:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find('cite')
for link in links:
print(link.text)
You can webscrape Bing as follows:
import requests
from bs4 import BeautifulSoup as bs
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get("https://www.bing.com/search?q=test", headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
for link in links:
print(link.text)
This code does the following:
With request we get the Web Page we're looking for. We set headers to avoid being blocked by Bing (more information, see: https://oxylabs.io/blog/5-key-http-headers-for-web-scraping)
Then we HTML'ify the code, and extract all codetags (this returns a list)
For each element in the list, we only want what's inside the codetag, using .text we print the inside of this tag.
Please pay attention to the headers!

Try this:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find_all('cite')
for link in links:
print(link.text)

You're looking for this to get links from Bing organic results:
# container with needed data: title, link, snippet, etc.
for result in soup.select(".b_algo"):
link = result.select_one("h2 a")["href"]
Specifically for example provided by you:
from bs4 import BeautifulSoup
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
link = soup.select_one('.b_attribution cite').text
print(link)
# https://www.developpez.net/forums/d1497343/environnements-developpem...
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}
params = {
"q": "lasagna",
"hl": "en",
}
html = requests.get("https://www.bing.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, "lxml")
for links in soup.select(".b_algo"):
link = links.select_one("h2 a")["href"]
print(link)
------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with extraction, maintain, bypass from the blocks part, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate to achieve your goal:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "bing",
"q": "lion king"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''
Disclaimer, I work for SerpApi.

google search html doesn't contain div id='resultStats'

I'm trying to get the number of search results of a google search, which looks like this in the html, if i just save it from the browser:
<div id="resultStats">About 8,660,000,000 results<nobr> (0.49 seconds) </nobr></div>
But the HTML retrieved by python looks like a mobile website when I open it in a browser and it doesn't contain 'resultStats'.
I already tried (1) adding parameters to the URL like https://www.google.com/search?client=firefox-b-d&q=test and (2) copying a complete URL from a browser, but it didn't help.
import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))
Error:
Traceback: line 11, in google_results
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
AttributeError: 'NoneType' object has no attribute 'text'

The solution is to add headers (Thanks, John):
import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))
Output:
9280000000

Web Scraping: How do I get 'href' links and scrape table from them

I am trying to scrape table from link. So that need to scrape 'href' links from it and then try to scrape the table from it . I try following code but couldn't find:
from bs4 import BeautifulSoup
import requests
url = 'http://www.stats.gov.cn/was5/web/search?channelid=288041&andsen=%E6%B5%81%E9%80%9A%E9%A2%86%E5%9F%9F%E9%87%8D%E8%A6%81%E7%94%9F%E4%BA%A7%E8%B5%84%E6%96%99%E5%B8%82%E5%9C%BA%E4%BB%B7%E6%A0%BC%E5%8F%98%E5%8A%A8%E6%83%85%E5%86%B5'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#table = soup.find("table")
#print(table)
# links = []
# for href in soup.find_all(class_='searchresulttitle'):
# print(href)
# links.append(href.find('a').get('href'))
# print(links)
link = soup.find(attr={"class":"searchresulttitle"})
print(link)
So please guide me how to find href and scrape table from them

The URLs are stored in the HTML as variables inside Javascript. BeautifulSoup can be used to grab all the <script> elements and then a regular expression can be used to extract the value for urlstr.
Assuming Python 3.6 is being used, a dictionary can be used to create a unique ordrered list of the URLs displayed:
from bs4 import BeautifulSoup
import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'http://www.stats.gov.cn/was5/web/search?channelid=288041&andsen=%E6%B5%81%E9%80%9A%E9%A2%86%E5%9F%9F%E9%87%8D%E8%A6%81%E7%94%9F%E4%BA%A7%E8%B5%84%E6%96%99%E5%B8%82%E5%9C%BA%E4%BB%B7%E6%A0%BC%E5%8F%98%E5%8A%A8%E6%83%85%E5%86%B5'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
urls = {} # Use a dictionary to create unique, ordered URLs (Assuming Python >=3.6)
for script in soup.find_all('script'):
for m in re.findall(r"var urlstr = '(.*?)';", script.text):
urls[m] = None
urls = list(urls.keys())
print(urls)
This would display URLS starting as:
['http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html',
'http://www.stats.gov.cn/tjsj/zxfb/201810/t20181024_1629464.html',
'http://www.stats.gov.cn/tjsj/zxfb/201810/t20181015_1627579.html',
...]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

get the url of the first hit [webscraping python] - python

Related

Scraping HREF Links contained within a Table

Web scraping finviz for fundamental data on marketcap

Find specific Tag Python BeautifulSoup

google search html doesn't contain div id='resultStats'

Web Scraping: How do I get 'href' links and scrape table from them

Categories

Resources