Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.
I have this table with the result of which I want the string content of all tags which i do like this:
from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
print "Reading:" , url
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
rows.pop(0) #first row is header
for row in rows:
tags = row.findAll(lambda tag: tag.name=='a')
content = []
for tagcontent in tags:
content.append(tagcontent.string)
print content
if __name__ == '__main__':
content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
metSoup = parseWithSoup(content)
however the output is as follows:
[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...
My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...
The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.
Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.
What you see are Python unicode strings.
Check the Python documentation
http://docs.python.org/howto/unicode.html
in order to deal correctly with unicode strings.
Related
I have a type element, bs4.element.Tag, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18].
The variable in question is script. My problem is that I want to access its attributes, for example script ['goodsInfo'], obviously being an element type bs4.element.Tag, try to do: script.attrs and return me {}. Then I tried to convert it to the type json: json.loads (str (script)) and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
This is my code:
import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'
response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")
scripts = soup.find_all('script')
script = scripts[18]
print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
print(type(script))
#output: bs4.element.Tag
print(str(json.loads(str(script))))
You can use json module to extract the data, but first it's necessary to locate the right info - you can use re module for that.
For example:
import re
import json
import requests
url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'
txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])
Prints:
Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17
BS4 does not parse javascript, from BS4's Tag object's POV the text in a <script> tag is, well, just text. I don't have any idea what this script looks like (since you didn't post it and I'm not going to bother try and find it), but if your expectations were that script ['goodsInfo'] would return the value of a JS variables named 'goodInfo' then, bad news, it's not going to work that way.
Also, Javascript is not JSON, so the chances a JS snippet will be valid json are rather small to say the least. The proper syntax to test it would be quite simply the same as the one you used for you first use case, ie json.loads(script.text), but I assume that's the first thing you tried ;-)
So, well, I'm afraid you'll have to manually parse this script to extract the relevant part. Depending on what the js code looks like, it may be a matter of a few lines of basic string parsing / regexp stuff, or it may require a proper Javascript parser etc.
I am building a scraper where I want to extract the data from some tags as it is without any conversion. But Beautifulsoup changing some hex values to ASCII. For example, this code gets converted into ASCII
html = """\
<title>Billing address - PayPal</title>
<title>Billing address - PayPal</title>"""
Here's the small example of the code
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for element in soup.findAll(['title', 'form', 'a']):
print(str(element))
But I want to extract the data in the same form. I believe BeautifulSoup 4 auto converting HTML entities and this is what I don't want. Any help would be really appreciated.
BTW I am using Python 3.5 and Beautifulsoup 4
you might try using re module ( Regular Expressions ). for an instance the code below will extract the title tag info without converting it: (I assumed that you declared html variable before)
import re
result = re.search('\<title\>.*\<\/title\>',html).group(0)
print(result) # It'll print <title>Billing address - PayPal</title>
You may do the same for the other tags as well
I'm trying to append a to results and it should print the plain http:// links. I want to be able to print out results like so: results[:4]
I'm thankfull for any help! Thanks!
This is the code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
results = []
def extract(soup):
section = soup.find('section', {'class' : 'content left'})
for post in section.findAll('article'):
header = post.find('header', {'class' : 'loop-data'})
a = header.findAll('a', href=True)
for x in a:
results.append(x.get('href'))
print results
br = Browser()
url = "http://www.hotglobalnews.com/category/politics/"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
This is my outcome:
[u'http://www.hotglobalnews.com/canada-just-legalized-heroin-to-control- drug-addiction/', u'http://www.hotglobalnews.com/justin-trudeau-announces-deal-with-uber-uberweed/', u'http://www.hotglobalnews.com/donald-trump-to-legalize-marijuana-in-all-50-states/', u'http://www.hotglobalnews.com/obama-to-create-law-banning-words/', u'http://www.hotglobalnews.com/trudeau-says-trump-is-a-racist-bastard/', u'http://www.hotglobalnews.com/donald-trump-to-build-replica-of-guantanamo-bay-for-mexicans/', u'http://www.hotglobalnews.com/donald-trump-to-legalize-incest-marriages-if-elected/', u'http://www.hotglobalnews.com/justin-trudeau-to-build-statue-of-trudeau-in-2017/', u'http://www.hotglobalnews.com/donald-trump-muslims-invented-global-warming-to-destroy-u-s-economy/', u'http://www.hotglobalnews.com/isis-member-found-disguised-as-syrian-refugee-in-canada/', u'http://www.hotglobalnews.com/donald-trump-says-he-is-more-influential-than-martin-luther-king-jr/', u'http://www.hotglobalnews.com/obama-wears-fuck-trump-tshirt-to-white-house-barbecue/', u'http://www.hotglobalnews.com/donald-trump-says-he-could-shoot-somebody/', u'http://www.hotglobalnews.com/donald-trump-says-black-history-month-is-too-long/', u'http://www.hotglobalnews.com/justin-trudeau-to-ban-uber-in-canada/', u'http://www.hotglobalnews.com/justin-trudeau-accepts-comedy-central-new-years-roast/', u'http://www.hotglobalnews.com/donald-trumps-muslim-comment-disqualifies-him-from-presidency/', u'http://www.hotglobalnews.com/paris-terrorist-spotted-live-on-news-after-terror-attacks-on-paris/', u'http://www.hotglobalnews.com/anonymus-hacker-collective-declares-war-on-islamic-sate-group/', u'http://www.hotglobalnews.com/paris-attacks-over-100-killed-in-gunfire-and-blasts2/']
Nothing is wrong with the list you get. The u sigil tells you that the stuff in the string is Unicode, but that's not "wrong" in any way. Printing the string will produce the desired result (provided your OS is correctly configured to display the characters; for what looks like essentially plain ASCII strings, this should not be an issue).
Python 3 changes these things somewhat, but generally for the better. You still need to understand the difference between byte strings and Unicode strings (at least if you need to work with byte strings too), but by default all strings are Unicode, which makes good sense in this day and age.
https://nedbatchelder.com/text/unipain.html is still a good place to start, especially if you have not yet made the transition to Python 3.
I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.
Thanks in advance,
I'm currently using beautiful soup to parse comment tags out of a set block of HTML. The issue I'm having is the html that is scraped has no quotations encapsulating the attribute values of the HTML tags. However BeautifulSoup seems to add these in, which in some case may be desirable but unfortunately not for my case.
Which would be the best route to either leave the actually HTML intact without adding the quotes in via BeautifulSoup - or can these be added back in?
You have a tag where some attribute values are quoted and some unquoted. What do you mean by 'add quoting back': either edit each attribute value to kludge the quotes in (probably a terrible idea), or else add quoting when it renders. It depends on what other processing you're doing to the tag. Here's code to add quotes when it prints:
input = "<html><sometag attr1=dont_quote_me attr2='but this one is quoted'>Text</sometag></html>"
bs = BeautifulSoup(input)
bs2 = bs.find('sometag')
for a in bs2.attrs:
(attr,aval) = a
print "%s='%s'" % (attr,aval),
gives attr1='dont_quote_me' attr2='but this one is quoted'
It's up to you which way. I assume they're all single-words i.e. match regex \w+