BeautifulSoup in Python not parsing right - python

I am running Python 2.7.5 and using the built-in html parser for what I am about to describe.
The task I am trying to accomplish is to take a chunk of html that is essentially a recipe. Here is an example.
html_chunk = "<h1>Miniature Potato Knishes</h1><p>Posted by bettyboop50 at recipegoldmine.com May 10, 2001</p><p>Makes about 42 miniature knishes</p><p>These are just yummy for your tummy!</p><p>3 cups mashed potatoes (about<br> 2 very large potatoes)<br>2 eggs, slightly beaten<br>1 large onion, diced<br>2 tablespoons margarine<br>1 teaspoon salt (or to taste)<br>1/8 teaspoon black pepper<br>3/8 cup Matzoh meal<br>1 egg yolk, beaten with 1 tablespoon water</p><p>Preheat oven to 400 degrees F.</p><p>Sauté diced onion in a small amount of butter or margarine until golden brown.</p><p>In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.</p><p>Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned.</p>"
The goal is to separate out the header, junk, ingredients, instructions, serving, and number of ingredients.
Here is my code that accomplishes that
from bs4 import BeautifulSoup
def list_to_string(list):
joined = ""
for item in list:
joined += str(item)
return joined
def get_ingredients(soup):
for p in soup.find_all('p'):
if p.find('br'):
return p
def get_instructions(p_list, ingredient_index):
instructions = []
instructions += p_list[ingredient_index+1:]
return instructions
def get_junk(p_list, ingredient_index):
junk = []
junk += p_list[:ingredient_index]
return junk
def get_serving(p_list):
for item in p_list:
item_str = str(item).lower()
if ("yield" or "make" or "serve" or "serving") in item_str:
yield_index = p_list.index(item)
del p_list[yield_index]
return item
def ingredients_count(ingredients):
ingredients_list = ingredients.find_all(text=True)
return len(ingredients_list)
def get_header(soup):
return soup.find('h1')
def html_chunk_splitter(soup):
ingredients = get_ingredients(soup)
if ingredients == None:
error = 1
header = ""
junk_string = ""
instructions_string = ""
serving = ""
count = ""
else:
p_list = soup.find_all('p')
serving = get_serving(p_list)
ingredient_index = p_list.index(ingredients)
junk_list = get_junk(p_list, ingredient_index)
instructions_list = get_instructions(p_list, ingredient_index)
junk_string = list_to_string(junk_list)
instructions_string = list_to_string(instructions_list)
header = get_header(soup)
error = ""
count = ingredients_count(ingredients)
return (header, junk_string, ingredients, instructions_string,
serving, count, error)
It works well except in situations where I have chunks that contain strings like "Sauté" because soup = BeautifulSoup(html_chunk) causes Sauté to turn into Sauté and this is a problem because I have a huge csv file of recipes like the html_chunk and I'm trying to structure all of them nicely and then get the output back into a database. I tried checking it Sauté comes out right using this html previewer and it still comes out as Sauté. I don't know what to do about this.
What's stranger is that when I do what BeautifulSoup's documentation shows
BeautifulSoup("Sacré bleu!")
# <html><head></head><body>Sacré bleu!</body></html>
I get
# Sacré bleu!
But my colleague tried that on his Mac, running from terminal, and he got exactly what the documentation shows.
I really appreciate all your help. Thank you.

This is not a parsing problem; it is about encoding, rather.
Whenever working with text which might contain non-ASCII characters (or in Python programs which contain such characters, e.g. in comments or docstrings), you should put a coding cookie in the first or - after the shebang line - second line:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
... and make sure this matches your file encoding (with vim: :set fenc=utf-8).

BeautifulSoup tries to guess the encoding, sometimes it makes a mistake, however you can specify the encoding by adding the from_encoding parameter:
for example
soup = BeautifulSoup(html_text, from_encoding="UTF-8")
The encoding is usually available in the header of the webpage

Related

Does anyone know how to add row numbers?

I can open this file directly from the net,and I want to add row numbers to each line based on rules. If you need header row number,then start from number 1, if no need, then start from next line. This is my code, I tried a lot but doesn't work. It looks like picture. Does anyone how to solve this problem? Thanks in advance!
import sys
class Main:
def task1(self):
print('*' * 30, 'Task')
import urllib.request
# url
url = 'http://www.born.nhely.hu/group_list.txt'
# Initiate a request to get a response
while True:
try:
response = urllib.request.urlopen(url)
except Exception as e:
print('An error has occurred, the request is being made again, the error message is as follows:', e)
else:
break
# Print all student information
content = response.read().decode('utf-8')
#add row number
header_row = input("Do you want to know header_row numbers? Y OR N?")
if header_row == 'Y':
for i, line in enumerate(content, start=1):
print(f'{i},{line}')
else:
for i, line in enumerate(content, start=0):
print('{},{}'.format(i, line.strip()))
def start(self):
self.task1()
Main().start()
Have a look at the data you are downloading:
Name;Short name;Email;Country;Other spoken languages
ABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?
AGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English
...
Now look at the results you are getting:
1,N
2,a
3,m
4,e
5,;
6,S
7,h
8,o
...
It should be apparent that you are looping character by character; not line by line.
When you have:
for i, line in enumerate(content, start=1):
print(f'{i},{line}')
content is a string -- not a list of lines -- so you will loop over the string character by character with the for loop.
So to fix, do:
for i, line in enumerate(content.splitlines(), start=1):
print(f'{i},{line}')
Or, you can change the method of reading from the server to reading lines instead of characters:
content = response.readlines()
Your absorbing the .txt content in one big string... if you use .readlines() instead of .read(), you can achieve what you want.
You should modify this:
# Print all student information
content = response.read().decode('utf-8')
To this:
# Print all student information
content = response.readlines()
You can use the repr() method to take a look at your data:
print(repr(content))
'Name;Short name;Email;Country;Other spoken languages\r\nABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?\r\nAGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English\r\nAMIN Asjad;?;;?;?\r\nATILA Arda Burak;Arda;arda_atila#hotmail.com;Turkey;English\r\nBELTRAN CASTRO Carlos Ricardo;Ricardo;crbeltrancas#gmail.com;Colombia;English, Chinese\r\nBhatti Muhammad Hasan;?;;?;?\r\nCAKIR Alp Hazar;Alp;alphazarc#gmail.com;Turkey;English\r\nDENG Zhihui;Deng;dzhfalcon0727#gmail.com;China;English\r\nDURUER Ahmet Enes;Ahmet / kahverengi;hello#ahmetduruer.com;Turkey;English\r\nENKHZAYA Jagar;Jager;japman2400#gmail.com;Mongolia;English\r\nGHAIBAH Sanaa;Sanaa;sanaagheibeh12#gmail.com;Syria;English\r\nGUO Ruizheng;?;ruizhengguo#gmail.com;China;English\r\nGURBANZADE Gurban;Qurban;gurbanzade01#gmail.com;Azeribaijan;English, Russian, Turkish\r\nHASNAIN Syed Muhammad;Hasnain;syedhasnainhijazy313#gmail.com;Pakistan;?\r\nISMAYILOV Firdovsi;Firi;firiisi#gmail.com;Azeribaijan ?;English,Russian,Turkish\r\nKINGRANI Muskan;Muskan;muskankingrani4#gmail.com;India;English\r\nKOKO Susan Kekeli Ruth;Susan;susankoko3#gmail.com;Ghana;N/A\r\nKOLA-OLALEYE Adeola Damilola;Adeola;inboxadeola#gmail.com;Nigeria;French\r\nLEWIS Madison Buse;?;madisonbuse#yahoo.com;Turkey;Turkish\r\nLI Ting;Ting;514053044#qq.com;China;English\r\nMARUSENKO Svetlana;Svetlana;svetlana.maru#gmail.com;Russia;English, German\r\nMOHANTY Cyrus;cyrus;cyrusmohanty5261#gmail.com;India;English\r\nMOTHOBI Thabo Emmanuel;thabo;thabomothobi#icloud.com;South Africa;English\r\nNayudu Yashmit Vinay;?;;?;?\r\nPurevsuren Davaadorj;?;Purevsuren.davaadorj99#gmail.com;Mongolia ?;English\r\nSAJID Anoosha;Anoosha;anooshasajid12#gmail.com;Pakistan;English\r\nSHANG Rongxiang;Xiang;1074482757#qq.com;China;English\r\nSU Haobo;Su;2483851740#qq.com;China;English\r\nTAKEUCHI ROSSMAN Elly;Elly;elliebanana10th#gmail.com;Japan;English\r\nULUSOY Nedim Can;Nedim;nedimcanulusoy#gmail.com;Turkey;English, Hungarian\r\nXuan Qijian;Xuan;xjwjadon#gmail.com;China ?;?\r\nYUAN Gaopeng;Yuan;1277237374#qq.com;China;English\r\n'
vs
print(repr(content))
[b'Name;Short name;Email;Country;Other spoken languages\r\n', b'ABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?\r\n', b'AGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English\r\n', b'AMIN Asjad;?;;?;?\r\n', b'ATILA Arda Burak;Arda;arda_atila#hotmail.com;Turkey;English\r\n', b'BELTRAN CASTRO Carlos Ricardo;Ricardo;crbeltrancas#gmail.com;Colombia;English, Chinese\r\n', b'Bhatti Muhammad Hasan;?;;?;?\r\n', b'CAKIR Alp Hazar;Alp;alphazarc#gmail.com;Turkey;English\r\n', b'DENG Zhihui;Deng;dzhfalcon0727#gmail.com;China;English\r\n', b'DURUER Ahmet Enes;Ahmet / kahverengi;hello#ahmetduruer.com;Turkey;English\r\n', b'ENKHZAYA Jagar;Jager;japman2400#gmail.com;Mongolia;English\r\n', b'GHAIBAH Sanaa;Sanaa;sanaagheibeh12#gmail.com;Syria;English\r\n', b'GUO Ruizheng;?;ruizhengguo#gmail.com;China;English\r\n', b'GURBANZADE Gurban;Qurban;gurbanzade01#gmail.com;Azeribaijan;English, Russian, Turkish\r\n', b'HASNAIN Syed Muhammad;Hasnain;syedhasnainhijazy313#gmail.com;Pakistan;?\r\n', b'ISMAYILOV Firdovsi;Firi;firiisi#gmail.com;Azeribaijan ?;English,Russian,Turkish\r\n', b'KINGRANI Muskan;Muskan;muskankingrani4#gmail.com;India;English\r\n', b'KOKO Susan Kekeli Ruth;Susan;susankoko3#gmail.com;Ghana;N/A\r\n', b'KOLA-OLALEYE Adeola Damilola;Adeola;inboxadeola#gmail.com;Nigeria;French\r\n', b'LEWIS Madison Buse;?;madisonbuse#yahoo.com;Turkey;Turkish\r\n', b'LI Ting;Ting;514053044#qq.com;China;English\r\n', b'MARUSENKO Svetlana;Svetlana;svetlana.maru#gmail.com;Russia;English, German\r\n', b'MOHANTY Cyrus;cyrus;cyrusmohanty5261#gmail.com;India;English\r\n', b'MOTHOBI Thabo Emmanuel;thabo;thabomothobi#icloud.com;South Africa;English\r\n', b'Nayudu Yashmit Vinay;?;;?;?\r\n', b'Purevsuren Davaadorj;?;Purevsuren.davaadorj99#gmail.com;Mongolia ?;English\r\n', b'SAJID Anoosha;Anoosha;anooshasajid12#gmail.com;Pakistan;English\r\n', b'SHANG Rongxiang;Xiang;1074482757#qq.com;China;English\r\n', b'SU Haobo;Su;2483851740#qq.com;China;English\r\n', b'TAKEUCHI ROSSMAN Elly;Elly;elliebanana10th#gmail.com;Japan;English\r\n', b'ULUSOY Nedim Can;Nedim;nedimcanulusoy#gmail.com;Turkey;English, Hungarian\r\n', b'Xuan Qijian;Xuan;xjwjadon#gmail.com;China ?;?\r\n', b'YUAN Gaopeng;Yuan;1277237374#qq.com;China;English\r\n']
Also, instead of hard-coding the charset as utf-8, you can use response.headers.get_content_charset()

How to add SYLT(synced lyrics) tag on ID3v2 mp3 file using python?

I want to add synced lyrics from vtt on my mp3 file using python. I tried using the mutagen module but it didn't work as intended.
from mutagen.id3 import ID3, USLT, SLT
import sys
import webvtt
lyrics = webvtt.read(sys.argv[2])
lyri = []
lyr = []
for lyric in lyrics:
times = [int(x) for x in lyric.start.replace(".", ":").split(":")]
ms = times[-1]+1000*times[-2]+1000*60*times[-3]+1000*60*60*times[-4]
lyri.append((lyric.text,ms))
lyr.append(lyric.text)
fil = ID3(sys.argv[1])
tag = USLT(encoding=3, lang='kor', text="\n".join(lyr)) # this is unsynced lyrics
#tag = SLT(encoding=3, lang='kor', format=2, type=1, text=lyri) --- not working
print(tag)
fil.add(tag)
fil.save(v1=0)
How can I solve this problem?
I use mutagen to parse an mp3 file that already has SYLT data, and found the usage of SYLT:
from mutagen.id3 import ID3, SYLT, Encoding
tag = ID3(mp3path)
sync_lrc = [("Do you know what's worth fighting for", 17640),
("When it's not worth dying for?", 23640), ...] # [(lrc, millisecond), ]
tag.setall("SYLT", [SYLT(encoding=Encoding.UTF8, lang='eng', format=2, type=1, text=sync_lrc)])
tag.save(v2_version=3)
But I can't figure out format=2, type=1 means.
check
https://id3.org/id3v2.3.0#Synchronised_lyrics.2Ftext
format 1: Absolute time, 32 bit sized, using MPEG frames as unit
format 2: Absolute time, 32 bit sized, using milliseconds as unit
type 0: is other
type 1: is lyrics
type 2 : is text transcription
type 3 : is movement/part name (e.g. "Adagio")
type 4 : is events (e.g. "Don Quijote enters the stage")
type 5 : is chord (e.g. "Bb F Fsus")
type 6 : is trivia/'pop up' information

Debugging ScraperWiki scraper (producing spurious integer)

Here is a scraper I created using Python on ScraperWiki:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
It works perfectly except when scraping the final data row of the table (the "York University" line), at which point instead of lines 9 through 11 of the code causing the string "401-500" to be retrieved from the table and assigned to data["arwu_rank"], those lines somehow seem instead to be causing the int 450 to be assigned to data["arwu_rank"]. You can see that I've added a few lines of "debugging" code to get a better understanding of what's going on, but also that that debugging code doesn't go very deep.
I have two questions:
What are my options for debugging scrapers run on the ScraperWiki infrastructure, e.g. for troubleshooting issues like this? E.g. is there a way to step through?
Can you tell me why the the int 450, instead of the string "401-500", is being assigned to data["arwu_rank"] for the "York University" line?
EDIT 6 May 2013, 20:07h UTC
The following scraper completes without issue, but I'm still unsure why the first one failed on the "York University" line:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
There's no easy way to debug your scripts on ScraperWiki, unfortunately it just sends your code in its entirety and gets the results back, there's no way to execute the code interactively.
I added a couple more prints to a copy of your code, and it looks like the if check before the bit that assigns data
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
doesn't trigger for "York University" so it will be keeping the int value (you set it later on) from the previous time around the loop.

Script like google suggest in python

I am writing a script that works like google suggest. Problem is that I am trying to get a suggestion for next 2 most likely words.
The example uses a txt file working_bee.txt. When writing a text "mis" I should get suggestions like "Miss Mary , Miss Taylor, ...". I only get "Miss, ...". I suspect the Ajax responseText method gives only a single word?
Any ideas what is wrong?
# Something that looks like Google suggest
def count_words(xFile):
frequency = {}
words=[]
for l in open(xFile, "rt"):
l = l.strip().lower()
for r in [',', '.', "'", '"', "!", "?", ":", ";"]:
l = l.replace(r, " ")
words += l.split()
for i in range(len(words)-1):
frequency[words[i]+" "+words[i+1]] = frequency.get(words[i]+" "+words[i+1], 0) + 1
return frequency
# read valid words from file
ws = count_words("c:/mod_python/working_bee.txt").keys()
def index(req):
req.content_type = "text/html"
return '''
<script>
function complete(q) {
var xhr, ws, e
e = document.getElementById("suggestions")
if (q.length == 0) {
e.innerHTML = ''
return
}
xhr = XMLHttpRequest()
xhr.open('GET', 'suggest_from_file.py/complete?q=' + q, true)
xhr.onreadystatechange = function() {
if (xhr.readyState == 4) {
ws = eval(xhr.responseText)
e.innerHTML = ""
for (i = 0; i < ws.length; i++)
e.innerHTML += ws[i] + "<br>"
}
}
xhr.send(null)
}
</script>
<input type="text" onkeyup="complete(this.value)">
<div id="suggestions"></div>
'''
def complete(req, q):
req.content_type = "text"
return [w for w in ws if w.startswith(q)]
txt file:
IV. Miss Taylor's Working Bee
"So you must. Well, then, here goes!" Mr. Dyce swung her up to his shoulder and went, two steps at a time, in through the crowd of girls, so that he arrived there first when the door was opened. There in the hall stood Miss Mary Taylor, as pretty as a pink.
"I heard there was to be a bee here this afternoon, and I've brought Phronsie; that's my welcome," he announced.
"See, I've got a bag," announced Phronsie from her perch, and holding it forth.
So the bag was admired, and the girls trooped in, going up into Miss Mary's pretty room to take off their things. And presently the big library, with the music-room adjoining, was filled with the gay young people, and the bustle and chatter began at once.
"I should think you'd be driven wild by them all wanting you at the same minute." Mr. Dyce, having that desire at this identical time, naturally felt a bit impatient, as Miss Mary went about inspecting the work, helping to pick out a stitch here and to set a new one there, admiring everyone's special bit of prettiness, and tossing a smile and a gay word in every chance moment between.
"Oh, no," said Miss Mary, with a little laugh, "they're most of them my Sunday- school scholars, you know."
Looking at your code I believe you are not sending the correct thing to Apache. You are sending apache a list and apache is expecting a string. I would suggest changing your return to json:
import json
def complete(req, q):
req.content_type = "text"
return json.dumps([w for w in ws if w.startswith(q)])

Crawling web pages with Python

I have a seed file of 250 URLs of IMDB's top 250 movies.
I need to crawl each one of them and get some info from it.
I've created a function that gets a URL of a movie and returns the info I need. It works great. My problem is when I'm trying to run this function on all of the 250 URLs.
After a certian amount (not constant!) of URLs that were crawled successfully, the program stops its run. The python.exe process takes 0% CPU and the memory consumption doesn't change. After some debugging, I figured that the problem is with the parsing, it just stops working and I have no idea why (stuck on a find command).
I'm using urllib2 to get the HTML content of the URL, than parse it as a string and then continue to the next URL (I'm going only once on each of these strings, linear time for all the checks and extractions).
Any idea what can cause this kind of behavior?
EDIT:
I'm attaching one of the problematic functions' code (got 1 more, but I'm guessing it's the same problem)
def getActors(html,actorsDictionary):
counter = 0
actorsLeft = 3
actorFlag = 0
imdbURL = "http://www.imdb.com"
for line in html:
# we have 3 actors, stop
if (actorsLeft == 0):
break
# current line contains actor information
if (actorFlag == 1):
endTag = str(line).find('/" >')
endTagA = str(line).find('</a>')
if (actorsLeft == 3):
actorList = str(line)[endTag+7:endTagA]
else:
actorList += ", " + str(line)[endTag+7:endTagA]
actorURL = imdbURL + str(line)[str(line).find('href=')+6:endTag]
actorFlag = 0
actorsLeft -= 1
actorsDictionary[actorURL] = str(line)[endTag+7:endTagA]
# check if next line contains actor information
if (str(line).find('<td class="name">') > -1 ):
actorFlag = 1
# convert commas and clean \n
actorList = actorList.replace(",",", ")
actorList = actorList.replace("\n","")
return actorList
I'm calling the function this way:
for url in seedFile:
moviePage = urllib.request.urlopen(url)
print(getTitleAndYear(moviePage),",",movieURL,",",getPlot(moviePage),getActors(moviePage,actorsDictionary))
This works great without the getActors function
There is no exception raised here (I removed the try and catch for now)
and it's getting stuck in the for loop after some iterations
EDIT 2: if I run only the getActors function, it works well and finishes all the URLs in the seed file (250)

Categories