I have an issue using Python and BeautifulSoup to extract urls from the Bing search engine. I want to extract content within <div class="b_title"> tags, but when I run this code, the urls var is empty:
import requests, re
from bs4 import BeautifulSoup
payload = { 'q' : 'sport', 'first' : '11' }
headers = { 'User-agent' : 'Mozilla/11.0' }
req = requests.get( 'https://www.bing.com/search', payload, headers=headers )
soup = BeautifulSoup( req.text, 'html.parser' )
urls = soup.find_all('div', class_="b_title")
print urls
You need to go 2 elements above and select li element with its class (it worked for me) or you can use SelectorGadets to grab CSS selectors with select() or select_one() method.
Code and full example:
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
"https://www.bing.com/search?form=QBRE&q=lasagna",
headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for container in soup.select('.b_algo h2 a'):
links = container['href']
print(links)
Output:
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.allrecipes.com/recipes/502/main-dish/pasta/lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
https://www.delish.com/cooking/recipe-ideas/recipes/a51337/classic-lasagna-recipe/
https://www.marthastewart.com/343399/lasagna
https://www.thepioneerwoman.com/food-cooking/recipes/a11728/best-lasagna-recipe/
https://therecipecritic.com/lasagna-recipe/
Alternatively, you can use Bing Search Engine Results API from SerpApi. It's a paid API with a free plan.
Part of JSON:
"organic_results": [
{
"position": 1,
"title": "World's Best Lasagna | Allrecipes",
"link": "https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/",
"displayed_link": "https://www.allrecipes.com/recipe/23600",
"sitelinks": {
"inline": [
{
"title": "Play Video",
"link": "https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/"
}
]
}
}
]
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"q": "lasagna",
"engine": "bing",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for link in results["organic_results"]:
print(f"Link: {link['link']}")
Output:
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.simplyrecipes.com/recipes/lasagna/
https://www.delish.com/cooking/recipe-ideas/recipes/a51337/classic-lasagna-recipe/
https://www.marthastewart.com/343399/lasagna
https://www.thepioneerwoman.com/food-cooking/recipes/a11728/best-lasagna-recipe/
Disclaimer I work for SerpApi.
Related
I am trying to pull the data that proceeds 'series: ', as shown below.
}
},
series: [{ name: '', showInLegend: false, animation: false, color: '#c84329', lineWidth: 2, data: [[1640926800000,164243],[1638248400000,224192],[1635566400000,143606],[1632974400000,208461],[1630382400000,85036],[1627704000000,25604],[1625025600000,44012],[1622433600000,111099],[1619755200000,53928],[1617163200000,12286],[1614488400000,12622],[1612069200000,4519],[1609390800000,12665],[1606712400000,314],[1604116800000,3032],[1601438400000,4164],[1598846400000,3302],[1596168000000,22133],[1593489600000,8098],[1590897600000,-1385],[1588219200000,43165],[1585627200000,427],[1582952400000,175],[1580446800000,174],[1577768400000,116],[1575090000000,196],[1572494400000,215],[1569816000000,418],[1567224000000,375],[1564545600000,375],[1561867200000,179],[1559275200000,132],[1556596800000,146],[1554004800000,163],[1551330000000,3],[1548910800000,49],[1546232400000,-29],[1543381200000,108],[1540958400000,35],[1538280000000,159],[1535688000000,287],[1533009600000,1152],[1530331200000,1306]] }],
navigation: { menuItemStyle: { fontSize: '9px' } }
});
More specifically, I'm trying to pull data, which has a list of unix timestamps and ints. This is what I have so far...
url = "https://socialblade.com/twitter/user/twitter"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
soup = bs(response.read(), 'html.parser')
soup = soup.find_all('script', {"type": "text/javascript"})
script = soup[6].text
Any thoughts?
The datatype of the script is a string so we can use the "re" module to find all occurrences of "data" in the script and then we can observe that every data in the script ends with "}" so we can find out the first "}" after data now using the index of the start of "data" substring and index of first "}" after data we can use string slicing to find out the data. you can see the code and output given below.
import re
sub = "data"
res = re.finditer(sub, script)
for i in res:
k = script.find("}",i.start())
print(script[i.start():k])
Output is:
Complete script for your required data
import requests
from bs4 import BeautifulSoup
url = "https://socialblade.com/twitter/user/twitter"
s = requests.Session()
r = requests.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
},
)
soup = BeautifulSoup(r.text, "html.parser")
req = soup.find_all("script", {"type": "text/javascript"})
script = req[6].contents[0]
data = script[2447: 3873]
print(data)
I want to get the Verify href from GmailnatorInbox and this site contains the href discord verify which is the following Discord Verify HREF
I want to get this href using bs4 and pass it into a selenium driver link like driver.get(url) the url being the href ofc.
Can someone make some code to scrape the href from the gmailnator inbox please? I did try the page source however the page source does not contain the href.
This is the code I have written to get the href but the href that I require (discord one) is in a frame source so I think that's why it doesnt come up.
UPDATE! EVERYTHING IS DONE AND FIXED
driver.get('https://www.gmailnator.com/inbox/#for.ev.e.r.my.girlt.m.p#gmail.com')
time.sleep(6)
driver.find_element_by_xpath('//*[#id="mailList"]/tbody/tr[2]/td/a/table/tbody/tr/td[1]').click()
time.sleep(4)
url = driver.current_url
email_for_data = driver.current_url.split('/')[-3]
print(url)
time.sleep(2)
print('Getting Your Discord Verify link')
print('Time To Get Your Discord Link')
soup = BeautifulSoup(requests.get(url).text, "lxml")
data_email = soup.find("")
token = soup.find("meta", {"name": "csrf-token"})["content"]
cf_email = soup.find("a", class_="__cf_email__")["data-cfemail"]
endpoint = "https://www.gmailnator.com/mailbox/get_single_message/"
data = {
"csrf_gmailnator_token": token,
"action": "get_message",
"message_id": url.split("#")[-1],
"email": f"{email_for_data}",
}
headers = {
"referer": f"https://www.gmailnator.com/{email_for_data}/messageid/",
"cookie": f"csrf_gmailnator_cookie={token}; ci_session={cf_email}",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 "
"YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
r = requests.post(endpoint, data=data, headers=headers)
the_real_slim_shady = (
BeautifulSoup(r.json()["content"], "lxml")
.find_all("a", {"target": "_blank"})[1]["href"]
)
print(the_real_slim_shady)
You can fake it all with pure requests to get the Verify link. First, you need to get the token and the cf_email values. Then, things are pretty straightforward.
Here's how to get the link:
import requests
from bs4 import BeautifulSoup
url = "https://www.gmailnator.com/geralddoreyestmp/messageid/#179b454b4c482c4d"
soup = BeautifulSoup(requests.get(url).text, "lxml")
token = soup.find("meta", {"name": "csrf-token"})["content"]
cf_email = soup.find("a", class_="__cf_email__")["data-cfemail"]
endpoint = "https://www.gmailnator.com/mailbox/get_single_message/"
data = {
"csrf_gmailnator_token": token,
"action": "get_message",
"message_id": url.split("#")[-1],
"email": "geralddoreyestmp",
}
headers = {
"referer": "https://www.gmailnator.com/geralddoreyestmp/messageid/",
"cookie": f"csrf_gmailnator_cookie={token}; ci_session={cf_email}",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 "
"YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
r = requests.post(endpoint, data=data, headers=headers)
the_real_slim_shady = (
BeautifulSoup(r.json()["content"], "lxml")
.find_all("a", {"target": "_blank"})[1]["href"]
)
print(the_real_slim_shady)
Output (your link will be different!):
https://click.discord.com/ls/click?upn=qDOo8cnwIoKzt0aLL1cBeARJoBrGSa2vu41A5vK-2B4us-3D77CR_3Tswyie9C2vHlXKXm6tJrQwhGg-2FvQ76GD2o0Zl2plCYHULNsKdCuB6s-2BHk1oNirSuR8goxCccVgwsQHdq1YYeGQki4wtPdDA3zi661IJL7H0cOYMH0IJ0t3sgrvr2oMX-2BJBA-2BWZzY42AwgjdQ-2BMAN9Y5ctocPNK-2FUQLxf6HQusMayIeATMiTO-2BlpDytu-2FnIW4axB32RYQpxPGO-2BeHtcSj7a7QeZmqK-2B-2FYkKA4dl5q8I-3D
im trying to get the names of all games within this website "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList".To do so im using the following code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
table = soup.find_all('div', attrs={'class':'providerCard'})
for game in range(0,len(table)-1):
print(table[game].find('a')['title'])
and i get what i want.
I would like to replicate the same across all pages available on the website, but given that the url is not changing, I looked at the network (XMR) events on the page happening when clicking on a different page and I tried to send a request using the following code:
for page_no in range(1, 100):
data = {
"blck":"fltrGamesBlk",
"ajax":"1",
"lang":"end",
"p":str(page_no),
"translit":"The-Best-Slots",
"tag":"TOP",
"dt1":"",
"dt2":"",
"sorting":"SRANK",
"cISO":"GB",
"dt_period":"",
"rtp_1":"50.00",
"rtp_2":"100.00",
"max_exp_1":"2.00",
"max_exp_2":"250000.00",
"min_bet_1":"0.01",
"min_bet_2":"5.00",
"max_bet_1":"3.00",
"max_bet_2":"10000.00"
}
page = requests.post('https://slotcatalog.com/index.php',
data=data,
headers={'Host' : 'slotcatalog.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'
})
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.find_all('div', attrs={'class':'providerCard'}):
name = row.find('a')['title']
print(name)
result : ("KeyError: 'title'") - meaning that its not finding the class "providerCard".
Has the request to the website been done in the wrong way? If so, where should i change the code?
thanks in advance
Alright, so, you had a typo. XD It was this "lang":"end" from the payload but it should have been "lang": "en", among other things.
Anyhow, I've cleaned your code up a bit and it works as expected. You can keep looping for all the games, if you want.
import requests
from bs4 import BeautifulSoup
headers = {
"referer": "https://slotcatalog.com/en/The-Best-Slots",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
payload = {
"blck": "fltrGamesBlk",
"ajax": "1",
"lang": "en",
"p": 1,
"translit": "The-Best-Slots",
"tag": "TOP",
"dt1": "",
"dt2": "",
"sorting": "SRANK",
"cISO": "EN",
"dt_period": "",
"rtp_1": "50.00",
"rtp_2": "100.00",
"max_exp_1": "2.00",
"max_exp_2": "250000.00",
"min_bet_1": "0.01",
"min_bet_2": "5.00",
"max_bet_1": "3.00",
"max_bet_2": "10000.00"
}
page = requests.post(
"https://slotcatalog.com/index.php",
data=payload,
headers=headers,
)
soup = BeautifulSoup(page.content, "html.parser")
print([i.get("title") for i in soup.find_all("a", {"class": "providerName"})])
Output (for page 1 only):
['Starburst', 'Bonanza', 'Rainbow Riches', 'Book of Dead', "Fishin' Frenzy", 'Wolf Gold', 'Twin Spin', 'Slingo Rainbow Riches', "Gonzo's Quest", "Gonzo's Quest Megaways", 'Eye of Horus (Reel Time Gaming)', 'Age of the Gods God of Storms', 'Lightning Roulette', 'Buffalo Blitz', "Fishin' Frenzy Megaways", 'Fluffy Favourites', 'Blue Wizard', 'Legacy of Dead', '9 Pots of Gold', 'Buffalo Blitz II', 'Cleopatra (IGT)', 'Quantum Roulette', 'Reel King Mega', 'Mega Moolah', '7s Deluxe', "Rainbow Riches Pick'n'Mix", "Shaman's Dream"]
I'm looking to capture a word corrected by Google search. I tried based on the code below, generating a search with the words misspelled "pithon" but I couldn't get the word "python" suggested in the "Including results for:" Follow the code snippet and part of the page source where I want to get the word:
q="pithon"
q = str(str.lower(q)).strip()
url = "http://www.google.com/search?q=" + urllib.parse.quote(q)
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
ans = soup.findAll('a', attrs={'class' : 'spell'})
print(ans)
<p class="gqLncc card-section" aria-level="3" role="heading">
<span class="gL9Hy d2IKib">Including results for:</span>
<a class="gL9Hy" href="/search?client=firefox-b-d&channel=crow2&sxsrf=ALeKk03QpAwp78UNkWeDgzZYYpT73zlopg:1592965054382&q=python&spell=1&sa=X&ved=2ahUKEwiFytKhsZnqAhWVIbkGHUPEAsAQBSgAegQIDRAq">
<b>
<i>python</i>
</b>
</a>
</p>
To get correct result of the page, specify User-Agent string. Alson, add parameter hl=en to get english version of the page:
import requests
from bs4 import BeautifulSoup
q="pithon"
parameters = {'q': q, 'hl': 'en'}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
url = "http://www.google.com/search"
html = requests.get(url, params=parameters, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
did_you_mean = soup.select_one('span:contains("Did you mean:")')
if did_you_mean:
print(did_you_mean.find_next('i').text)
Prints:
python
A complimentary answer to Andrej Kesely to scrape corrected word and a link using css selectors.
Note that Andrej said that his solution works for English pages only. Solution below works on other languages as well.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'londebn',
'hl': 'en',
'gl': 'us',
}
html = requests.get('https://www.google.com/search?q=', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
correct_me = soup.select_one('a.gL9Hy').text
correct_me_link = soup.select_one('a.gL9Hy')['href']
print(f'Corrected word: {correct_me}')
print(f'https://www.google.com{correct_me_link}')
# output:
'''
Corrected word: london
https://www.google.com/search?hl=en&gl=us&q=london&spell=1&sa=X&ved=2ahUKEwigm4Cgqq3xAhUNXc0KHUNpBOgQBSgAegQIARAw
'''
Alternatively, you can use Google Search Engine Results API from SerpApi to get corrected word. It's a paid API with a free trial of 5,000 searches. Check out the Playground.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "londebn",
"google_domain": "google.com",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
corrected_word = results['search_information']['spelling_fix']
print(corrected_word)
# output:
'''
london
'''
Disclaimer, I work for SerpApi.
I'm trying to build a simple script to scrape Google's first Search Results Page and export the results in .csv.
I managed to get URLs and Titles, but I cannot retrieve Descriptions.
I have been using the following code:
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "pizza recipe"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
desc = g.select('span')
description = g.find('span',{'class':'st'}).text
item = {
"title": title,
"link": link,
"description": description
}
results.append(item)
import pandas as pd
df = pd.DataFrame(results)
df.to_excel("Export.xlsx")
I get the following message when I run the code:
description = g.find('span',{'class':'st'}).text
AttributeError: 'NoneType' object has no attribute 'text'
Essentially, the field is empty.
Can somebody please help me this line so that I can get all the information from the snippet?
It's not within the div class="r". It's under div class="s"
So change to this for description:
description = g.find_next_sibling("div", class_='s').find('span',{'class':'st'}).text
From the current element, it'll find the next div, with class="s". Then you can pull out the <span> tag
Try to use select_one() or select() bs4 methods. They're more flexible and easy to read. CSS selectors reference.
Also, you can pass URL params since requests do everything for you like so:
# instead of this:
query = "pizza recipe"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
# try to use this:
params = {
'q': 'fus ro dah', # query
'hl': 'en'
}
requests.get('URL', params=params)
If you want to write to .csv then you need to use .to_csv() rather than .to_excel()
If you want to get rid of pandas index column, then you can pass index=False, e.g df.to_csv('FILE_NAME', index=False)
Code and example in the online IDE:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'fus ro dah', # query
'hl': 'en'
}
resp = requests.get("https://google.com/search", headers=headers, params=params)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "html.parser")
results = []
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
snippet = result.select_one('#rso .lyLwlc').text
item = {
"title": title,
"link": link,
"description": snippet
}
results.append(item)
df = pd.DataFrame(results)
df.to_csv("BS4_Export.csv", index=False)
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what selectors to use and why they don't work although they should since it's already done for the end-user.
Code to integrate:
from serpapi import GoogleSearch
import os
import pandas as pd
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "fus ro dah",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
data.append({
"title": title,
"link": link,
"snippet": snippet
})
df = pd.DataFrame(results)
df.to_csv("SerpApi_Export.csv", index=False)
P.S - I wrote a bit more detailed blog post about how to scrape Google Organic Results.
Disclaimer, I work for SerpApi.