Extract information present in dictionaries from script while web scraping - python

I am trying to scrape
URL="https://www.bankmega.com/en/about-us/bank-mega-network/"
to extract Bank name and address information. I am able to see the required information within the script tags. How can I extract it?
import requests
from bs4 import BeautifulSoup
import json
r = requests.get(URL)
soup = BeautifulSoup(r.content)
soup.find_all('script',type="text/javascript")

if you are able to select the relevant javascript, the easiest way is probably to search the script text for the first occurance of "[" and "]" since these two are the boundary of the dictionary. If you are able to put only the content (including the square brackets) into a seperate string, you can use the json-library to convert the string into a python object. The code below is a bit ugly when performing the string-cleaning, but it does the job.
import requests
from bs4 import BeautifulSoup
import json
import re
URL="https://www.bankmega.com/en/about-us/bank-mega-network/"
r = requests.get(URL)
soup = BeautifulSoup(r.content)
for element in soup.find_all('script',type="text/javascript"):
if "$('#table_data_atm').hide();" in element.get_text():
string_raw = element.get_text()
first_bracket_open = string_raw.find("[")
first_bracket_close = string_raw.find("]")
cleaned_string = string_raw[first_bracket_open:first_bracket_close+1].replace('city:', '"city":').replace('lokasi:', '"lokasi":').replace('alamat:', '"alamat":').replace("\n", "")
cleaned_string = re.sub("\s\s+", " ", cleaned_string)
cleaned_string = cleaned_string.replace(", },", "},").replace(", ]", "]").replace("\t", " ")
parsed = json.loads(cleaned_string)
print(parsed)

Related

How to search for a specific unicode string when web scraping?

I recently got interested in web scraping on Python and did it on some simple examples, but I don't know how to handle other languages that don't follow the ASCII codes. For example, searching for a specific string in the HTML file or using those strings to be written in a file.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
soup1 = bs4.BeautifulSoup(requests.get(book_url).text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
Looking at this website under book_url, each row has different text, but the text is in the Persian language.
Let say I need the last row to be considered.
The text is "صدای کل کتاب"
How can I search for this string in <li>, <div>, and <a> tags?
You need to set the encoding from requests to UTF-8. It looks like the requests module was not using the decoding you wanted. As mentioned in this SO post, you can tell requests what encoding to expect.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
The only change here is
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Python3 scraper. Doesn't parse the xpath till the end

I'm using lxml.html module
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here (http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z) is full of universities.
The problem is that it parses till the letter 'H' (244 unis).
I can't understand why, as I see it parses all the HTML till the end.
I also documented my self that 244 is not a limit of a list or anything in python3.
That HTML page simply isn't HTML, it's totally broken. But the following will do what you want. It uses the BeautifulSoup parser.
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info.
For web-scraping i recommend you to use BeautifulSoup 4
With bs4 this is easily done:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)

Scraping in Python with BeautifulSoup

I've read quite a few posts here about this, but I'm very new to Python in general so I was hoping for some more info.
Essentially, I'm trying to write something that will pull word definitions from a site and write them to a file. I've been using BeautifulSoup, and I've made quite some progress, but here's my issue -
from __future__ import print_function
import requests
import urllib2, urllib
from BeautifulSoup import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)
print(visible_text, file=wordlist)
this seems to pull what I need, but puts it in this format
[u'passable\n adj 1: able to be passed or traversed or crossed; "the road is\n passable"
but I need it to be in plaintext. I've tried using a sanitizer (I was running it through bleach, but that didn't work. I've read some of the other answers here, but they don't explain HOW the code works, and I don't want to add something if I don't understand how it works.
Is there any way to just pull the plaintext?
edit: I ended up doing
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
The code is already giving you plaintext, it just happens to have some characters encoded as entity references. In this case, special characters, which form part of the XML/HTML syntax are encoded to prevent them from breaking the structure of the text.
To decode them, use the HTMLParser module:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('"the road is passable"')
>>> u'"the road is passable"'

search a specific word in BeautifulSoup python

I am trying to make a python script that reads crunchyroll's page and gives me the ssid of the subtitle.
For example :- http://www.crunchyroll.com/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035
Go to the source code and look for ssid,I want to extract the numbers after ssid of this element
English (US)
I want to extract "154757", but I can't seem to get my script working
This is my current script:
import feedparser
import re
import urllib2
from urllib2 import urlopen
from bs4 import BeautifulSoup
feed = feedparser.parse('http://www.crunchyroll.com/rss/anime')
url1 = feed['entries'][0]['link']
soup = BeautifulSoup(urlopen(url1), 'html.parser')
How can I modify my code to search and extract that particular number?
This should get you started with being able to extract the ssid for each entry. Note that some of those link don't have any ssid so you'll have to account for that with some error catching. No need for re or the urllib2 modules here.
import feedparser
import requests
from bs4 import BeautifulSoup
d = feedparser.parse('http://www.crunchyroll.com/rss/anime')
for url in d.entries:
#print url.link
r = requests.get(url.link)
soup = BeautifulSoup(r.text)
#print soup
subtitles = soup.find_all('span',{'class':'showmedia-subtitle-text'})
for ssid in subtitles:
x = ssid.findAll('a')
for a in x:
print a['href']
Output:
--snip--
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166035
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165817
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165819
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166783
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165839
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165989
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166051
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166011
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165995
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165997
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166033
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165825
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166013
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166009
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166003
/etotama/episode-11-catrat-shuffle-678659?ssid=166007
/etotama/episode-11-catrat-shuffle-678659?ssid=165969
/etotama/episode-11-catrat-shuffle-678659?ssid=166489
/etotama/episode-11-catrat-shuffle-678659?ssid=166023
/etotama/episode-11-catrat-shuffle-678659?ssid=166015
/etotama/episode-11-catrat-shuffle-678659?ssid=166049
/etotama/episode-11-catrat-shuffle-678659?ssid=165993
/etotama/episode-11-catrat-shuffle-678659?ssid=165981
--snip--
There are more but I left them out for brevity. From these results you should be able to easily parse out the ssid with some slicing since it looks like the ssid are all 6 digits long. Doing something like:
print a['href'][-6:]
would do the trick and get you just the ssid.

Categories