I am building a scraper where I want to extract the data from some tags as it is without any conversion. But Beautifulsoup changing some hex values to ASCII. For example, this code gets converted into ASCII
html = """\
<title>Billing address - PayPal</title>
<title>Billing address - PayPal</title>"""
Here's the small example of the code
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for element in soup.findAll(['title', 'form', 'a']):
print(str(element))
But I want to extract the data in the same form. I believe BeautifulSoup 4 auto converting HTML entities and this is what I don't want. Any help would be really appreciated.
BTW I am using Python 3.5 and Beautifulsoup 4
you might try using re module ( Regular Expressions ). for an instance the code below will extract the title tag info without converting it: (I assumed that you declared html variable before)
import re
result = re.search('\<title\>.*\<\/title\>',html).group(0)
print(result) # It'll print <title>Billing address - PayPal</title>
You may do the same for the other tags as well
Related
I have complete html of a page and from that I need to find GA (google Analytics) id of it. For example:
<script>ga('create', 'UA-4444444444-1', 'auto');</script>
From above string I need to get UA-4444444444-1, which starts from "UA-" and ends with "-1". I have tried this:
re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)
but didn't get any success. Please let me know what mistake I am making.
Thanks
It seems that you are overthinking it, you could just seek for the UA token directly:
re.findall(r"UA-\d+-\d+")
Never use regex in parsing through the html. BeautifulSoup should be find in extracting text from tags. Here we extract script tags from html, then we apply regex to text located in script tags.
import re
from bs4 import BeautifulSoup as bs4
html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"
soup = bs4(html, 'lxml')
pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
ids.append(pattern.findall(i.text)[0])
print(ids)
Im living in Germany, where ZIP Codes are in most of the cases a 5 digit number f.e. 53525. I would really like to extract that information from a website using beautiful Soup.
I am new to Python/Beautiful Soup and I am not sure how to translate "Find every 5 Numbers in a row + "SPACE"" into Python language.
import requests
import urllib.request,re
from bs4 import BeautifulSoup
source = requests.get('DOMAIN').text
soup = BeautifulSoup(source, 'lxml')
soup.find_all(NOTSUREHERE)
In the simplest scenario:
NOTSUREHEREshould be replaced by name = 'tag_name', being tag_name a possible tag in which you are certain to find ZIP codes (and no other numerical field that could be mistaken by a ZIP Code)
Then, each element of that object should be passed to re.findall(regex, string) being: regex = '([0-9]{5})' (from what I understand the pattern was) and string the element from which you're extracting ZIP Codes.
import requests
import urllib.request,re
from bs4 import BeautifulSoup
source = requests.get('DOMAIN').text
soup = BeautifulSoup(source, 'lxml')
tag_list = soup.find_all(name = 'tag_name')
match_list = []
for tag in tag_list:
match_list.append(re.findall('([0-9]{5})', str(tag)))
You should watch out for possible matches that aren't ZIP codes. It could be the case of refining the soup.find_all() call by adding more arguments. The documentation might give you even more options, but the attrs argument could be set to {'target_attribute':'target_att_value'} those being an attribute and a value that definitely mark a tag with a ZIP code.
EDIT: Regarding possible empty elements, this link has a very straightforward solution: Removing empty elements from an array in Python
I'm trying to write a simple application that reads the HTML from a webpage, converts it to a string, and displays certain slices of that string to the user.
However, it seems like these slices change themselves! Each time I run my code I get a different output! Here's the code.
# import urllib so we can get HTML source
from urllib.request import urlopen
# import time, so we can choose which date to read from
import time
# save HTML to a variable
content = urlopen("http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang")
# make HTML readable and covert HTML to a string
content = str(content.read())
# select part of the string containing the prayer time table
table = content[24885:24935]
print(table) # print to test what is being selected
I'm not sure what's going on here.
You should really be using something like Beautiful soup. Something along the lines of the following should help. From looking at the source code for that url there is not id/class for the table which makes it a little bit more trickier to find.
from bs4 import BeautifulSoup
import requests
url = "http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang"
r = requests.get(url)
soup = BeautifulSoup(r.text)
for table in soup.find_all('table'):
# here you can find the table you want and deal with the results
print(table)
You shouldn't be looking for the part you want by grabbing the specific indexes of the list, websites are often dynamic and the list contain the exact same content each time
What you want to do is search for the table you want, so say the table started with the keyword class="prayer_table" you could find this with str.find()
better yet, extract the tables from the webpage instead of relying on str.find() The code below is from a question on extract tables from a webpage reference
from lxml import etree
import urllib
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?
Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.
I have this table with the result of which I want the string content of all tags which i do like this:
from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
print "Reading:" , url
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
rows.pop(0) #first row is header
for row in rows:
tags = row.findAll(lambda tag: tag.name=='a')
content = []
for tagcontent in tags:
content.append(tagcontent.string)
print content
if __name__ == '__main__':
content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
metSoup = parseWithSoup(content)
however the output is as follows:
[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...
My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...
The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.
Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.
What you see are Python unicode strings.
Check the Python documentation
http://docs.python.org/howto/unicode.html
in order to deal correctly with unicode strings.