Extracting a specific substring from a specific hyper-reference using Python - python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...

As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'

You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Related

Regex issue in Python printing entire URL

I am trying to pull all the urls that contain "https://play.google.com/store/" and print the entire string. When I run my current code, it only prints "https://play.google.com/store/" but I am looking for the entire URL. Can someone point me in the right direction? Here is my code:
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://www.pocketgamer.com/android/best-tycoon-games-android/?page=3"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll("a", target="_blank"):
links.append(link.get('href'))
x = re.findall("https://play.google.com/store/", str(links))
print(x)
re.findall just returns the part of the text that matches the regex, so all you are getting is the https://play.google.com/store/ that is in the regex. You could modify the regex, but given what you are searching is a list of links, it's easier to just check if they start with https://play.google.com/store/. For example:
x = [link for link in links if link.startswith('https://play.google.com/store/')]
Output (for your query):
[
'https://play.google.com/store/apps/details?id=com.auxbrain.egginc',
'https://play.google.com/store/apps/details?id=net.kairosoft.android.gamedev3en',
'https://play.google.com/store/apps/details?id=com.pixodust.games.idle.museum.tycoon.empire.art.history',
'https://play.google.com/store/apps/details?id=com.AdrianZarzycki.idle.incremental.car.industry.tycoon',
'https://play.google.com/store/apps/details?id=com.veloxia.spacecolonyidle',
'https://play.google.com/store/apps/details?id=com.uplayonline.esportslifetycoon',
'https://play.google.com/store/apps/details?id=com.codigames.hotel.empire.tycoon.idle.game',
'https://play.google.com/store/apps/details?id=com.mafgames.idle.cat.neko.manager.tycoon',
'https://play.google.com/store/apps/details?id=com.atari.mobile.rctempire',
'https://play.google.com/store/apps/details?id=com.pixodust.games.rocket.star.inc.idle.space.factory.tycoon',
'https://play.google.com/store/apps/details?id=com.idlezoo.game',
'https://play.google.com/store/apps/details?id=com.fluffyfairygames.idleminertycoon',
'https://play.google.com/store/apps/details?id=com.boomdrag.devtycoon2',
'https://play.google.com/store/apps/details?id=com.TomJarStudio.GamingShop2D',
'https://play.google.com/store/apps/details?id=com.roasterygames.smartphonetycoon2'
]

Scraping Search Results Using Soup and Python, Split Returns Only One Value Instead Of a List?

When trying to scrape google search results using Soup and Python 3.0x+, the result after a split is only one value which is one URL from a number of urls.
Expected Output is list of all the urls found instead of one, which then will be cleaned using head,sep,tail partition method.
It happens after this for loop.
for link in links:
x = re.split('="/url?q="',link["href"].replace("/url?q=",""))
the value links has all the results from the search page and the loop is supposed to iterate through all the links using the parameter link:
Full Code
import requests
from urllib.parse import urlparse
import re
from bs4 import BeautifulSoup
import urllib.request
srchTerm = ['64503']
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
for term in srchTerm:
resp = opener.open("https://www.google.com/search?q=site:https://private.xx.co.bd/++" + term)
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
links = soup.find_all("a",href=re.compile("(?<=/url\?q=)(https://private.xx.co.bd/)"))
for link in links:
x = re.split('="/url?q="',link["href"].replace("/url?q=",""))
## for linka in x:
##head, sep, tail = linka.('&sa')
##print(head)
This Prints only one result of:
<a data-uch="1" href="/url?q=https://private.xx.co.bd/blalbalba/4B1041344.aspx&sa=U&ved=2ahUKEwi-pOWSv4HqAhWGJTQIHUI-BCgQFjACegQIAxAB&usg=AOvVaw3joBh4SH9QwW5WHmwn-7cs"><h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd"><span dir="rtl">xxxxxxx</span></div></h3><div class="BNeawe UPmit AP7Wnd"><span dir="rtl">xxx‹ https://private.xxx.co.il</span></div></a>

Scraping the URLs of dynamically changing images from a website

I'm creating a python program that collects images from this website by Google
The images on the website change after a certain number of seconds, and the image url also changes with time. This change is handled by a script on the website. I have no idea how to get the image links from it.
I tried using BeautifulSoup and the requests library to get the image links from the site's html code:
import requests
from bs4 import BeautifulSoup
url = 'https://clients3.google.com/cast/chromecast/home'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('img')
for tag in tags:
print(tag)
But the code returns:
{{background_url}}' in the image src ("ng-src")
For example:
<img class="S9aygc-AHe6Kc" id="picture-background" image-error-handler="" image-index="0" ng-if="backgroundUrl" ng-src="{{backgroundUrl}}"/>
How can I get the image links from a dynamically changing site? Can BeautifulSoup handle this? If not what library will do the job?
import requests
import re
def main(url):
r = requests.get(url)
match = re.search(r"(lh4\.googl.+?mv)", r.text).group(1)
match = match.replace("\\", "").replace("u003d", "=")
print(match)
main("https://clients3.google.com/cast/chromecast/home")
Just a minor addition to the answer by αԋɱҽԃ αмєяιcαη (ahmed american) in case anyone is wondering
The subdomain (lhx) in lhx.google.com is also dynamic. As a result, the link can be lh3 or lh4 et cetera.
This code fixes the problem:
import requests
import re
r = requests.get("https://clients3.google.com/cast/chromecast/home").text
match = re.search(r"(lh.\.googl.+?mv)", r).group(1)
match = match.replace('\\', '').replace("u003d", "=")
print(match)
The major difference is that the lh4 in the code by ahmed american has been replaced with "lh." so that all images can be collected no matter the url.
EDIT: This line does not work:
match = match.replace('\\', '').replace("u003d", "=")
Replace with:
match = match.replace("\\", "")
match = match.replace("u003d", "=")
None of the provided answers worked for me. Issues may be related to using an older version of python and/or the source page changing some things around.
Also, this will return all matches instead of only the first match.
Tested in Python 3.9.6.
import requests
import re
url = 'https://clients3.google.com/cast/chromecast/home'
r = requests.get(url)
for match in re.finditer(r"(ccp-lh\..+?mv)", r.text, re.S):
image_link = 'https://%s' % (match.group(1).replace("\\", "").replace("u003d", "="))
print(image_link)

Unable to extract content from DOM element with $0 thru BeautifulSoup

Here is the website I am to scrape the number of reviews
So here i want to extract number 272 but it returns None everytime .
I have to use BeautifulSoup.
I tried-
sources = requests.get('https://www.thebodyshop.com/en-us/body/body-butter/olive-body-butter/p/p000016')
soup = BeautifulSoup(sources.content, 'lxml')
x = soup.find('div', {'class': 'columns five product-info'}).find('div')
print(x)
output - empty tag
I want to go inside that tag further.
The number of reviews is dynamically retrieved from an url you can find in network tab. You can simply extract from response.text with regex. The endpoint is part of a defined ajax handler.
You can find a lot of the API instructions in one of the js files: https://thebodyshop-usa.ugc.bazaarvoice.com/static/6097redes-en_us/bvapi.js
For example:
You can trace back through a whole lot of jquery if you really want.
tl;dr; I think you need only add the product_id to a constant string.
import requests, re
from bs4 import BeautifulSoup as bs
p = re.compile(r'"numReviews":(\d+),')
ids = ['p000627']
with requests.Session() as s:
for product_id in ids:
r = s.get(f'https://thebodyshop-usa.ugc.bazaarvoice.com/6097redes-en_us/{product_id}/reviews.djs?format=embeddedhtml')
p = re.compile(r'"numReviews":(\d+),')
print(int(p.findall(r.text)[0]))

Trying to scrape information from an iterated list of links using Beautiful Soup or ElementTree

I'm attempting to scrape an xml database list of links for these addresses. (The 2nd link is an example page that actually contains some addresses. Many of the links don't.)
I'm able to retrieve the list of initial links I'd like to crawl through, but I can't seem to go one step further and extract the final information I'm looking for (addresses).
I assume there's an error with my syntax, and I've tried scraping it using both beautiful soup and Python's included library, but it doesn't work.
BSoup:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
# find the links to companies
company_menu = bs.find_all('loc')
for company in company_menu:
data = bs.find("html",{"i"})
print data
Non 3rd Party:
import requests
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i[0].text
Any input is appreciated! Thanks.
Your syntax is ok. You need to simply follow those links in the first page, here's how it will look like for the Milano page:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
company_menu = bs.find_all('loc')
for item in company_menu:
if 'milano' in item.text:
subpage = requests.get(item.text)
subsoup = BeautifulSoup(subpage.text)
adresses = subsoup.find_all(class_='riquadro_agenzia_off')
for adress in adresses:
companyname.append(adress.text)
print companyname
To get all addresses you can simply remove if 'milano' block in the code. I don't know if they are all formatted according to coherent rules, for milano addresses are under div with class="riquandro_agenzia_off", if other subpages are also formatted in this way then it should work. Anyway this should get you started. Hope it helps.

Categories