Why is the ouput in wrong unicode? - python

I'm trying to append a to results and it should print the plain http:// links. I want to be able to print out results like so: results[:4]
I'm thankfull for any help! Thanks!
This is the code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
results = []
def extract(soup):
section = soup.find('section', {'class' : 'content left'})
for post in section.findAll('article'):
header = post.find('header', {'class' : 'loop-data'})
a = header.findAll('a', href=True)
for x in a:
results.append(x.get('href'))
print results
br = Browser()
url = "http://www.hotglobalnews.com/category/politics/"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
This is my outcome:
[u'http://www.hotglobalnews.com/canada-just-legalized-heroin-to-control- drug-addiction/', u'http://www.hotglobalnews.com/justin-trudeau-announces-deal-with-uber-uberweed/', u'http://www.hotglobalnews.com/donald-trump-to-legalize-marijuana-in-all-50-states/', u'http://www.hotglobalnews.com/obama-to-create-law-banning-words/', u'http://www.hotglobalnews.com/trudeau-says-trump-is-a-racist-bastard/', u'http://www.hotglobalnews.com/donald-trump-to-build-replica-of-guantanamo-bay-for-mexicans/', u'http://www.hotglobalnews.com/donald-trump-to-legalize-incest-marriages-if-elected/', u'http://www.hotglobalnews.com/justin-trudeau-to-build-statue-of-trudeau-in-2017/', u'http://www.hotglobalnews.com/donald-trump-muslims-invented-global-warming-to-destroy-u-s-economy/', u'http://www.hotglobalnews.com/isis-member-found-disguised-as-syrian-refugee-in-canada/', u'http://www.hotglobalnews.com/donald-trump-says-he-is-more-influential-than-martin-luther-king-jr/', u'http://www.hotglobalnews.com/obama-wears-fuck-trump-tshirt-to-white-house-barbecue/', u'http://www.hotglobalnews.com/donald-trump-says-he-could-shoot-somebody/', u'http://www.hotglobalnews.com/donald-trump-says-black-history-month-is-too-long/', u'http://www.hotglobalnews.com/justin-trudeau-to-ban-uber-in-canada/', u'http://www.hotglobalnews.com/justin-trudeau-accepts-comedy-central-new-years-roast/', u'http://www.hotglobalnews.com/donald-trumps-muslim-comment-disqualifies-him-from-presidency/', u'http://www.hotglobalnews.com/paris-terrorist-spotted-live-on-news-after-terror-attacks-on-paris/', u'http://www.hotglobalnews.com/anonymus-hacker-collective-declares-war-on-islamic-sate-group/', u'http://www.hotglobalnews.com/paris-attacks-over-100-killed-in-gunfire-and-blasts2/']

Nothing is wrong with the list you get. The u sigil tells you that the stuff in the string is Unicode, but that's not "wrong" in any way. Printing the string will produce the desired result (provided your OS is correctly configured to display the characters; for what looks like essentially plain ASCII strings, this should not be an issue).
Python 3 changes these things somewhat, but generally for the better. You still need to understand the difference between byte strings and Unicode strings (at least if you need to work with byte strings too), but by default all strings are Unicode, which makes good sense in this day and age.
https://nedbatchelder.com/text/unipain.html is still a good place to start, especially if you have not yet made the transition to Python 3.

Related

Adding chevrons and underscores with BeautifulSoup

I want BeautifulSoup to add the strings like this into my HTML pages:
{{< Transfer/component_short_name >}}
(If you are interested why, this is a Hugo shortcode, a kind of variable for markdown)
when I build it programmatically in python and add it using tag.insert_after(), what ends up in the document looks like this:
{{< Transfer/component\_short\_name >}}
which of course does not work the same.
I managed a workaround for the chevrons > < using string replaces, but the underscores '_' would require going into regex, leaving complicated code for a simple operation, so I'm wondering whether there's an option in BeautifulSoup.
I tried various approaches, such as var_name = var_name.replace("\\_", "_") , but that does not work.
I don't see a way to avoid the < and > conversion using BeautifulSoup, but as you say they could be converted afterwards. In the following example there is no underscore escaping:
from bs4 import BeautifulSoup
import re
shortcode = "{{< Transfer/component_short_name >}}"
html = "<html><body><h1>hello world</h1></body>"
soup = BeautifulSoup(html, "html.parser")
soup.h1.insert_after(shortcode)
fixed = re.sub('\{\{<|>\}\}|\\\_', lambda x: {'{{<' : '{{<', '>}}' : '>}}', '\\_' : '_'}[x.group(0)], str(soup))
print(fixed)
Giving the HTML as:
<html><body><h1>hello world</h1>{{< Transfer/component_short_name >}}</body></html>
Here, the \_ replacement does not appear to be needed but I have included it for completeness.

How do I remove double quotes from whithin retreived JSON data

I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.
I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.
example_string = '{"Category_A" : "Words typed describing stuff",
"Category_B" : "Other words speaking more irrelevant stuff",
"Category_X" : "Here is where the "PROBLEM" lies"}'
Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads(), it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035
I'm not at all sure how to address this issue.
EDIT Here's the actual code leading to the error:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re
uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()
listing_soup = BeautifulSoup(page_html, "lxml")
json_script = listing_soup.find("script", "type":"application/ld+json"}).strings
extracted_json_str = ''.join(json_script)
## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+| | |amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
repl='',
string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
repl = r"'",
string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
repl=r" -",
string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
repl="",
string = extracted_json_str_CLEAN3)
## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).
Thanks!
you need to apply the escape sequence(\) if you want double Quote(") in your value. So, your String input to json.loads() should look like below.
example_string = '{"Category_A": "Words typed describing stuff", "Category_B": "Other words speaking more irrelevant stuff", "Category_X": "Here is where the \\"PROBLEM\\" lies"}'
json.loads can parse this.
# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
extraction = extraction.replace("\"", "\'")
print(extraction)
In this case you will convert " from extraction in ' I mean something you will need to convert because python give uyou a way to use both if uyou want to use " inside of a string you will need to inverse that simbols:
example:
"this is a 'test'"
'this was a "test"'
"this is not a \"test\""
#in case the condition is meat
if "\"" in item:
#use this
item = item.replace("\"", "\'")
#or use this
item = item.replace("\"", "\\\"")

Stripping HTML Tags from Forum using Python/bs4

I am a (very) new Python user, and decided some of my first work would be to grab some lyrics from a forum and sort according to word frequency. I obviously haven't gotten to the frequency part yet, but the following is the code that does not work for obtaining the string values I want, resulting in an "AttributeError: 'ResultSet' object has no attribute 'getText' ":
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.thefewgoodmen.com/thefgmforum/threads/gdr-marching-songs-section-b.14998'
wp = urllib.request.urlopen(url)
soup = BeautifulSoup(wp.read())
message = soup.findAll("div", {"class": "messageContent"})
words = message.getText()
print(words)
If I alter the code to have getText() operate on the soup object:
words = soup.getText()
I, of course, get all of the string values throughout the webpage, rather than those limited to only the class messageContent.
My question, therefore, is two-fold:
1) Is there a simple way to limit the tag-stripping to only the intended sections?
2) What simple thing do I not understand in that I cannot have getText() operate on the message object?
Thanks.
The message in this case is a BeautifulSoup ResultSet, which is a list of BeautifulSoup Tag(s). What you need to do is call getText on each element of message like so,
words = [item.getText() for item in message]
Similarly, if you are just interested in a single Tag (let's say the first one for the sake of argument), you could get its content with,
words = message[0].getText()

Python 3 - Getting some strings from a HTTPrequest response

I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.

using output from beautifulsoup in python

Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.
I have this table with the result of which I want the string content of all tags which i do like this:
from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
print "Reading:" , url
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
rows.pop(0) #first row is header
for row in rows:
tags = row.findAll(lambda tag: tag.name=='a')
content = []
for tagcontent in tags:
content.append(tagcontent.string)
print content
if __name__ == '__main__':
content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
metSoup = parseWithSoup(content)
however the output is as follows:
[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...
My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...
The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.
Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.
What you see are Python unicode strings.
Check the Python documentation
http://docs.python.org/howto/unicode.html
in order to deal correctly with unicode strings.

Categories