Unique phrase in the source code of an HTML page in Python3 - python

I'm trying to figure out how to get Python3 to display a certain phrase from an HTML document. For example, I'll be using the search engine https://duckduckgo.com .
I'd like the code to do key search for var error=document.getElementById; and get it to display what in the parenthesis are, in this case, it would be "error_homepage". Any help would be appreciated.
import urllib.request
u = input ('Please enter URL: ')
x = urllib.request.urlopen(u)
print(x.read())

You can simply read the website of interest, as you suggested, using urllib.request, and use regular expressions to search the retrieved HTML/JS/... code:
import re
import urllib.request
# the URL that data is read from
url = "http://..."
# the regex pattern for extracting element IDs
pattern = r"var error = document.getElementById\(['\"](?P<element_id>[a-zA-Z0-9_-]+)['\"]\);"
# fetch HTML code
with urllib.request.urlopen(url) as f:
html = f.read().decode("utf8")
# extract element IDs
for m in re.findall(pattern, html):
print(m)

Related

Removing Empty Lines of a list in python

My goal is to get a simple text output like:
https://widget.reviews.io/rating-snippet/dist.js
But I keep getting output like this:
https://widget.reviews.io/rating-snippet/dist.js
All these empty lines are the problem
--> Before there where [] but I removed them with ''.join
Now I only have these empty lines.
Here is my code:
import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://www.nutrimuscle.com")
soup = BeautifulSoup(html.text, "html.parser")
# Find all script tags
for n in soup.find_all('script'):
# Check if the src attribute exists, and if it does grab the source URL
if 'src' in n.attrs:
javascript = n['src']
# Otherwise assume that the javascript is contained within the tags
else:
javascript = ''
kameleoonRegex = re.compile(r'[\w].*rating-snippet/dist.js')
#Everything I tried :D
kameleeonScript = kameleoonRegex.findall(javascript)
text = ''.join(kameleeonScript)
print(text)
It's probably not that hard but I've been on this for hours
if kameleeonScript: print(kameleeonScript[0])
did the job :)

Extract information present in dictionaries from script while web scraping

I am trying to scrape
URL="https://www.bankmega.com/en/about-us/bank-mega-network/"
to extract Bank name and address information. I am able to see the required information within the script tags. How can I extract it?
import requests
from bs4 import BeautifulSoup
import json
r = requests.get(URL)
soup = BeautifulSoup(r.content)
soup.find_all('script',type="text/javascript")
if you are able to select the relevant javascript, the easiest way is probably to search the script text for the first occurance of "[" and "]" since these two are the boundary of the dictionary. If you are able to put only the content (including the square brackets) into a seperate string, you can use the json-library to convert the string into a python object. The code below is a bit ugly when performing the string-cleaning, but it does the job.
import requests
from bs4 import BeautifulSoup
import json
import re
URL="https://www.bankmega.com/en/about-us/bank-mega-network/"
r = requests.get(URL)
soup = BeautifulSoup(r.content)
for element in soup.find_all('script',type="text/javascript"):
if "$('#table_data_atm').hide();" in element.get_text():
string_raw = element.get_text()
first_bracket_open = string_raw.find("[")
first_bracket_close = string_raw.find("]")
cleaned_string = string_raw[first_bracket_open:first_bracket_close+1].replace('city:', '"city":').replace('lokasi:', '"lokasi":').replace('alamat:', '"alamat":').replace("\n", "")
cleaned_string = re.sub("\s\s+", " ", cleaned_string)
cleaned_string = cleaned_string.replace(", },", "},").replace(", ]", "]").replace("\t", " ")
parsed = json.loads(cleaned_string)
print(parsed)

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Extracting table from html using python

I am trying to extract the table "Pharmacology-and-Biochemistry"from the url https://pubchem.ncbi.nlm.nih.gov/compound/23677941#section=Pharmacology-and-Biochemistry i have written this code
from lxml import etree
import urllib.request as ur
url = "https://pubchem.ncbi.nlm.nih.gov/compound /23677941#section=Chemical-and-Physical-Properties"
web = ur.urlopen(url)
s = web.read()
html = etree.HTML(s)
print (html)
nodes = html.xpath('//li[#id="Pharmacology-and-Biochemistry"/descendant::*]')
print (tr_nodes)
but the script is not getting the node specified in xpath and output is empty list
[]
I tried several other xpaths but nothing worked!
please help me !!
I think the problem is that in this url doesn't exists the table that you are searching.
Try to run this:
from urllib import urlopen
text = urlopen('https://pubchem.ncbi.nlm.nih.gov/compound/23677941#section=Pharmacology-and-Biochemistry').read()
print 'Pharmacology-and-Biochemistry' in text
The result is:
False

(Python) Trying to isolate some data from a website

Essentially the script will download images from wallbase.cc's random and toplist pages. Essentially it looks for a 7 digit string which identifies each image as that image. It the inputs that id into a url and downloads it. The only problem I seem to have is isolating the 7 digit string.
What I want to be able to do is..
Search for <div id="thumbxxxxxxx" and then assign xxxxxxx to a variable.
Here's what I have so far.
import urllib
import os
import sys
import re
#Written in Python 2.7 with LightTable
def get_id():
import urllib.request
req = urllib.request.Request('http://wallbase.cc/'+initial_prompt)
response = urllib.request.urlopen(req)
the_page = response.read()
for "data-id="" in the_page
def toplist():
#We need to define how to find the images to download
#The idea is to go to http://wallbase.cc/x and to take all of strings containing <a href="http://wallbase.cc/wallpaper/xxxxxxx" </a>
#And to request the image file from that URL.
#Then the file will be put in a user defined directory
image_id = raw_input("Enter the seven digit identifier for the image to be downloaded to "+ directory+ "...\n>>> ")
f = open(directory+image_id+ '.jpg','wb')
f.write(urllib.urlopen('http://wallpapers.wallbase.cc/rozne/wallpaper-'+image_id+'.jpg').read())
f.close()
directory = raw_input("Enter the directory in which the images will be downloaded.\n>>> ")
initial_prompt = input("What do you want to download from?\n\t1: Toplist\n\t2: Random\n>>> ")
if initial_prompt == 1:
urlid = 'toplist'
toplist()
elif initial_prompt == 2:
urlid = 'random'
random()
Any/all help is very much appreciated :)
You probably want to use a web scraping library like BeautifulSoup, see eg. this SO question on web scraping in Python.
import urllib2
from BeautifulSoup import BeautifulSoup
# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
href = l.get('href')
print href # u'http://wallbase.cc/wallpaper/1750539'
print href.split('/')[-1] # u'1750539'
If you want to only use the default library, you could use regular expressions.
pattern = re.compile(r'<div id="thumb(.{7})"')
...
for data-id in re.findall(pattern, the_page):
pass # do something with data-id

Categories