I'm trying to write a program, that downloads the most upvoted picture from a subreddit, but for some reason the BeautifulSoup does not find all the links on a website, I know I could try it with other methods but I'm curious why isn't it finding all the link every time.
Here is the code as well.
from PIL import Image
import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/r/wallpaper/top/'
result = requests.get(url)
soup = BeautifulSoup(result.text,'html.parser')
for link in soup.find_all('a'):
print (link.get('href'))
Site is loaded with JavaScript, bs4 will not be able to render JavaScript therefor, I've been able to locate the data within script tag.
import requests
import re
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
r = requests.get(url, headers=headers)
match = re.search(r"window.___r = ({.+})", r.text).group(1)
data = json.loads(match)
# print(data.keys())
# humanreadable = json.dumps(data, indent=4)
main("https://www.reddit.com/r/wallpaper/top/")
Shorter version:
match = re.finditer(r'permalink":"(.+?)"', r.text)
for item in match:
print(item.group(1))
Output:
https://www.reddit.com/r/wallpaper/comments/fv9ubr/khyber_pakhtunkhwa_pakistan_balakot_1920x1024/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/?instanceId=t3_p%3DgAAAAABeiiTtw4FM0zBerf9DDiq5tmonjJbAwzQb_UwA-VHlw2J8zUxw-y6Doa6j-jPP0qt05lRZfyReQwnLH9pN6wdSBBvqhgxgRS3uKyKCRvkk6WNwns5wpad0ijMgHwqVnZSGMT0KWP4WB15zBNkb3j96ifm23pT4uACb6cpNVh-TE05GiTtDnD9UUMir02Z7hOr0x4f_wLJEIplafXRp2yiAFPh5VzH_4VSsPx9zV7v3IJwN5ctYLfIcdCW5Z3W-z3bbOVUCU2HqqRAoh0XEj0LrgdicMexa9fzPbtWOshfx3kIazwFhYXoSowPBRZUquSs9zEaQwP1B-wg951edNb7RSjYTrDpQ75zsMfIkasKvAOH-V58%3D
https://www.reddit.com/r/wallpaper/comments/fv6wew/lone_road_in_nowhere_arizona_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvaqaa/the_hobbit_house_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fvcs4j/something_i_made_in_illustrator_5120_2880/
https://www.reddit.com/r/wallpaper/comments/fv09u2/bath_time_in_rocky_mountain_national_park_1280x720/
https://www.reddit.com/r/wallpaper/comments/fuyomz/up_is_still_my_favorite_film_grandpa_carl_cams/
https://www.reddit.com/r/wallpaper/comments/fvagex/beautiful_and_colorful_nature_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv3nnn/maroon_bells_co_photo_credit_to/
https://www.reddit.com/r/wallpaper/comments/fuyg0z/volcano_lightening_19201080/
https://www.reddit.com/r/wallpaper/comments/fvgohk/doctor_strange1920x1080/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/?instanceId=t3_p%3DgAAAAABeiiTt9isPY03zwoimtzcC7w3uLzUDCuoD5cU6ekeEYt48cRAqoMsc1ZDBJ6OeK1U3Bs2Zo1ZSWzdQ4DOux21vGvWzJkxNWQ14XzDWag_GlrE-t_4rpFA_73kW94xGUQchsXL7f4VkbbHIyn8SMlUlTtt3j3lJCViwINOQgIF3p5N8Q4ri-swtJC-JyEUYa4dJazlZ9xLYyOHSvMkiR3k9lDx0NEKqpqfbQ9__f3xLUzgS4yF4OngMDFUVFa5nyH3I32mkP3KezXLxOR6H8CSGI_jqRA4dBV-AnHLuzPlgENRpfaMhWJ04vTEOjmG4sm4xs65OZCumqNstzlDEvR7ryFwL6LeH02a9E3czck5jfKY7HXQ%3D
https://www.reddit.com/r/wallpaper/comments/fuzjza/ghost_cloud_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvg88o/park_autumn_tress_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv47r8/audi_quattro_s1_3840x2160_fh4/
https://www.reddit.com/r/wallpaper/comments/fuybjs/spacecrafts_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fv043i/dragonfly_1280x720/
https://www.reddit.com/r/wallpaper/comments/fv06ud/muskrat_swim_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvdafk/natural_beauty_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbnuc/cigar_man_19201080/
https://www.reddit.com/r/wallpaper/comments/fvcww4/thunder_road_3840_x_2160/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/?instanceId=t3_p%3DgAAAAABeiiTtxVzGp9KwvtRNa1pOVCgz2IBkTGRxqdyXk4WTsjAkWS9wzyDVF_1aSOz36HqHOVrngfj3z_9O1cAkzz-0fwhxyJ_8jePT3F88mrveLChf_YRIbAtxb-Ln_OaeeXUnyrFVl-OPN7cqXvtgh3LoymBx3doL-bEVnECOWkcSXvUIwpMn-flVZ5uNcGL1nKEiszUcORqq1oQ32BnrmWHomrDb3Q%3D%3D
https://www.reddit.com/r/wallpaper/comments/fv3xqs/social_distancing_log_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbcpl/neon_city_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fvbhdb/sunrise_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fv2eno/second_heavy_bike_in_ghost_recon_breakpoint/
Related
https://imgur.com/a/JcTnbiw
how do I retrieve the highlighted text with beautifulsoup?
a example would be the best answer, thank you ;)
edit; heres the code
import requests
import pyperclip
from bs4 import BeautifulSoup
import time
url = 'https://sales.elhst.co/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36"}
site = requests.get(url, headers=headers)
site = str(site)
if site == "<Response [200]>":
print("Site is up..")
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
time.sleep(2)
target = soup.find("pp", id="copies")
print(target)
and the output is:
Site is up..
<pp id="copies"></pp>
and i wanna to get this text:
https://imgur.com/a/JcTnbiw
is there any way to do it?
The data you see on the page is loaded from external URL. You can try this script to print number of copies:
import re
import json
import requests
url = 'https://sales.elhst.co/socket.io/?EIO=3&transport=polling'
copies_url = 'https://sales.elhst.co/socket.io/?EIO=3&transport=polling&sid={sid}'
r = requests.get(url).text
sid = json.loads(re.search(r'(\{".*)', r).group(1))['sid']
r = requests.get(copies_url.format(sid=sid)).text
copies = json.loads(re.search(r'(\[".*)', r).group(1))[-1]
print(copies)
Prints:
0
from lxml import html
import requests
page = requests.get('http://url')
tree = html.fromstring(page.content)
#This will extract the text you need
buyers = tree.xpath('//pp[#id="copies"]/text()')
It should work. But I don't know pp tag. I think it's a mistake and there should be tag <p>.
More info about lxml here.
i want to scrape emails on search resulted query. but when i access to class with css selecter "select" and print it always shows empty list. How can i access .r class or "class=g"?
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
responce = requests.get(url)
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)
Your program is correct, but to get correct answer from Google, you need to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'}
responce = requests.get(url, headers=headers) # <-- specify custom header
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)
Prints:
[<div class="r"><a href="https://www.yahoo.com/news/11-course-complete-computer-science-171322233.html" onmousedown="return rwt(this,'','','','1','AOvVaw2wM4TUxc_4V7s9GjeWTNAG','','2ahUKEwjt17Kk-YjnAhW2R0EAHcnsC3QQFjAAegQIAxAB','','',event)"><div class="TbwUpd"><img alt="https://...
...
To get the emails out of the Google Search results you need to use regex
# this regex needs possible modifications
re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', variable_where_to_search_from)
Code:
from bs4 import BeautifulSoup
import requests, lxml, re
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q="computer science ""usa" "#yahoo.com"', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
try:
snippet = result.select_one('.lyLwlc').text
except:
snippet = None
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', str(snippet))
email = '\n'.join(match_email).strip()
print(email)
----------
'''
ahmed_733#yahoo.com
yjzou#uguam.uog
yzou2002#yahoo.com
...
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
It doesn't extract emails using regex although it would be a great possible feature. The main difference is that much easier and faster to get things done rather than creating everything from scratch.
Code to integrate:
from serpapi import GoogleSearch
import re
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": '"computer science ""usa" "#yahoo.com"',
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
try:
snippet = result['snippet']
except:
snippet = None
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', str(snippet))
email = '\n'.join(match_email).strip()
print(email)
---------
'''
shaikotweb#yahoo.com
ahmed_733#yahoo.com
RPeterson#L1id.com
rj_peterson#yahoo.com
'''
Disclaimer, I work for SerpApi.
I'm trying to get the number of search results of a google search, which looks like this in the html, if i just save it from the browser:
<div id="resultStats">About 8,660,000,000 results<nobr> (0.49 seconds) </nobr></div>
But the HTML retrieved by python looks like a mobile website when I open it in a browser and it doesn't contain 'resultStats'.
I already tried (1) adding parameters to the URL like https://www.google.com/search?client=firefox-b-d&q=test and (2) copying a complete URL from a browser, but it didn't help.
import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))
Error:
Traceback: line 11, in google_results
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
AttributeError: 'NoneType' object has no attribute 'text'
The solution is to add headers (Thanks, John):
import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))
Output:
9280000000
I am scraping artists from discogs.com. I am unable to get the artist names as they appear on the page. E.g. artist Andrés appears as Andr\xe9s when I run my code.
Can anyone explain what I'm doing wrong?
from bs4 import BeautifulSoup
import requests
import urllib2
from itertools import chain
import codecs
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
all_artists = []
result_pages = 1 #446
def load_artists():
for page in xrange(1, result_pages+1):
url = url = 'https://www.discogs.com/search/?sort=have%2Cdesc&style_exact=House&genre_exact=Electronic&decade=2010&page=' + str(page)
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content.decode('utf-8'), 'html.parser')
[all_artists.append(tag["title"]) for tag in soup.select('div#search_results h5 span')]
load_artists()
all_artists
you need to use python3, and you will no longer suffer from this
Nothing is wrong, they are output as unicode, they print correctly when you ask Python to print them:
for a in all_artists:
print(a)
...
Andrés
...
I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'.
import urllib2
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A'
req = urllib2.Request(url = url, headers = headers)
html = urllib2.urlopen(req)
bsobj = BeautifulSoup(html.read(),'lxml')
b = bsobj.findAll("div",{"class": "row amenities"})
for the result of b, it does not return all the list inside the tag.
And for the last one of it is '+ plus', looks like as following.
<span data-reactid=".mjeft4n4sg.0.0.0.0.1.8.1.0.0.$1.1.0.0">+ Plus</span></strong></a></div></div></div></div></div>]
This is because data filled up using reactjs after page load. So if you download it via requests you can't see the data.
Instead you have to use selenium web driver, open page and process all the javascripts. Then you can get ccess to all data you expect