I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.
example:
import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br />blah
""")
I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.
EDIT:
Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:
import BeautifulSoup
import urllib
import re
import socket
socket.setdefaulttimeout(3)
# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()
success, failure = 0.0, 0.0
for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
print url
try:
BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
except IOError:
pass
except Exception, e:
print e
failure += 1
else:
success += 1
print failure / (failure + success)
When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.
If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:
embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br />blah
""")
print result.dump()
If you want to capture the character location of an expression within your parser, insert one of these, with a results name:
loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") +
pyparsing.Optional(aTag)
Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.
you don't prefer using normal regex? or because its bad habit to parse html? :D
re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)
I was able to run your BeautifulSoup code and received no errors. I'm running BeautifulSoup 3.0.7a
Please use BeautifulSoup 3.0.7a; 3.1.0.1 has bugs that prevent it from working at all in some cases (such as yours).
Related
I am writing a little script to get my F#H user data from a basic HTML page.
I want to locate my username on that page and the numbers before and after it.
All the data I want is between two HTML <tr> and </tr> tags.
I am currently using this:
re.search(r'<tr>(.*?)</tr>', htmlstring)
I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word
However that only returns the first string between those two delimiters, not even all of them.
This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.
If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.
I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.
TL;DR -
I need a re.search() pattern that finds a substring between two words, that also contains a specific word.
If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>
<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"
Sample
There are a few ways to do it but I prefer the pandas way:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?
Using beautifulsoup and "find" method, and re, you can do it the following way:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.
Using Re:
So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
Helpful lin to use variables in regex: stackflow
I have a type element, bs4.element.Tag, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18].
The variable in question is script. My problem is that I want to access its attributes, for example script ['goodsInfo'], obviously being an element type bs4.element.Tag, try to do: script.attrs and return me {}. Then I tried to convert it to the type json: json.loads (str (script)) and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
This is my code:
import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'
response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")
scripts = soup.find_all('script')
script = scripts[18]
print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
print(type(script))
#output: bs4.element.Tag
print(str(json.loads(str(script))))
You can use json module to extract the data, but first it's necessary to locate the right info - you can use re module for that.
For example:
import re
import json
import requests
url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'
txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])
Prints:
Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17
BS4 does not parse javascript, from BS4's Tag object's POV the text in a <script> tag is, well, just text. I don't have any idea what this script looks like (since you didn't post it and I'm not going to bother try and find it), but if your expectations were that script ['goodsInfo'] would return the value of a JS variables named 'goodInfo' then, bad news, it's not going to work that way.
Also, Javascript is not JSON, so the chances a JS snippet will be valid json are rather small to say the least. The proper syntax to test it would be quite simply the same as the one you used for you first use case, ie json.loads(script.text), but I assume that's the first thing you tried ;-)
So, well, I'm afraid you'll have to manually parse this script to extract the relevant part. Depending on what the js code looks like, it may be a matter of a few lines of basic string parsing / regexp stuff, or it may require a proper Javascript parser etc.
I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2
I have made a script with BeautifulSoup which works fine and is very readable, but I want to redistribute it some day, and BeautifulSoup is an external dependency I would like to avoid, specially considering Windows use.
Here is the code, it gets every usermap link from a given google maps user. The ####### marked lines are the ones using BeautifulSoup:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source) ####
maptables = soup.findAll(id=re.compile('^map[0-9]+$')) #################
for table in maptables:
for line in table.findAll('a', 'maptitle'): ################
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
print shown, mapid, '\t', mapname
shown += 1
urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.
EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.
Any help will be much appreciated, thanks for reading!
Unfortunately, Python does not have useful HTML parsing in the standard library, so the only reasonable way to parse HTML is by using a third party module like lxml.html or BeautifulSoup. This does not mean that you have to have a separate dependency--these modules are free software and if you do not want an external dependency, you're welcome to bundle them with your code, which then won't make them any more a dependency than the code you write yourself.
to parse HTML code I see have three solutions :
use simple string search (.find(),...) Fast !
use regular expressions (aka regex)
use HTMLParser
I have tried this code (see below) and it shows up a list of links. As I have no beautiful soup installed and don't want to, it is very difficult to me to check the results against what your code gives.
The "pure" python code without any "soup" is even shorter and more readable.
Anyway, here it is. Tell me what you think ! Friendly, Louis.
#coding: utf-8
import urllib, re
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
while True:
endit = source.find('maptitle')
mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
print shown, mapid, '\t', mapname
shown += 1
urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.