Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.
I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.
The python code I've written is here:
import re
import urllib2
response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html
patterns = ['Masculine','Feminine']
for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)
if re.findall(pattern,html):
print "Found a match!"
exit
else:
print "No match!"
When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?
Do not parse an HTML with regex, use a specialized tool - an HTML parser.
Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))
print soup.select('div.nameinfo span.info')[0].text # prints "Feminine"
Or, you can find an element by text:
gender = soup.find(text='Feminine')
And then, see if it is None (not found) or not: gender is None.
Related
I want to extract the word of the day and its meaning using python, but I suck at web scrapping and I'm struggling a lot with scrapping texts. I wrote a code but it prints nothing and I don't know why.
As you can see in the above image, I want to get the highlighted text.
The code I wrote is :
import requests
from bs4 import BeautifulSoup
html_text = requests.get('https://www.transparent.com/word-of-the-day/today/japanese.html').text
soup = BeautifulSoup(html_text,'html.parser')
content =soup.find_all('div',class_='wotdr-item wotdr-item--translation js-col-item')
for i in content:
for p in i.find_all('p'):
print(p.text)
But this doesn't print anything. Can somebody tell my mistakes ?
The information you want to receive is dynamic. It is generated during page loading. Initially, you need to look at the requests that the page creates and find the right one.
import requests
import xml.etree.ElementTree as ET
url = 'https://wotd.transparent.com/rss/ja-widget.xml'
response = requests.get(url)
root = ET.fromstring(response.text)
print('Japanese word:', root.find('words/word').text)
print('English translation:', root.find('words/translation').text)
print('Part of speech:', root.find('words/wordtype').text)
print('Japanese example:', root.find('words/fnphrase').text)
print('English example:', root.find('words/enphrase').text)
OUTPUT:
Japanese word: 警報
English translation: alarm, alert, warning, warning signal
Part of speech: noun
Japanese example: 台風警報に気を付けなければいけません。
English example: You must pay attention if you see any typhoon warnings.
UPDATE
for wotd in root.findall('words'):
print('Yomigana:',wotd.find('{http://www.transparent.com/word-of-the-day/}transliteratedWord').text)
print('Yomigana example:', wotd.find('{http://www.transparent.com/word-of-the-day/}transliteratedSentence').text)
OR:
print('Yomigana:', root.find('words/{http://www.transparent.com/word-of-the-day/}transliteratedWord').text)
print('Yomigana example:', root.find('words/{http://www.transparent.com/word-of-the-day/}transliteratedSentence').text)
OUTPUT:
Yomigana: けいほう
Yomigana example: たいほうけいほうにきをつけなければいけません。
I am pulling an HTML page with requests and trying to extract a link from it using regex but i keep getting TypeError: expected string or buffer.
Code:
r = requests.get('https://reddit.com/r/spacex')
subreddit=r.text
match=re.search(r'(<li class="first"><a href=")(.+)(")', subreddit)
if match is not None:
print(match.group(2))
However if i take a chunk of the HTML and hardcode it as a string then my code works:
subreddit='<li class="first"><a href="http://www.reddit.com/r/spacex/comments/3115xw/latest_update_pic_of_slc4_pad_shows_ln2_tankage/"'
r = requests.get('https://reddit.com/r/spacex')
match=re.search(r'(<li class="first"><a href=")(.+)(")', subreddit)
if match is not None:
print(match.group(2))
I also tried doing
match=re.search(r'(<li class="first"><a href=")(.+)(")', str(subreddit))
as suggest around here but that didn't work. I did not receive any errors but match.group(2) never printed the link.
I expect you have multiple lines in subreddit separated by '\n' when you do subreddit=r.text. Hence your regex doesn't search beyond first '\n'
try adding re.MULTILINE option. OR
search in each line of for line in subreddit.split('\n')
.
r = requests.get('https://reddit.com/r/spacex')
subreddit=r.text
print('subreddit:' + subreddit)
subreddit.split('\n')
If the code above produces NoneType has no split() then there is something wrong with your requests.get().. It's not returning anything.. may be a proxy?
Post the output of this code anyway..
If you use BeautifulSoup, it will be a lot easier for you:
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> soup = BeautifulSoup(urllib2.urlopen('https://reddit.com/r/spacex').read())
>>> for x in soup.find_all("li", "first")
... print x.a['href']
Or you can simply do:
>>> soup.select('a[href="https://www.reddit.com/r/spacex/comments/3115xw/latest_update_pic_of_slc4_pad_shows_ln2_tankage/"]')
[<a class="comments may-blank" href="https://www.reddit.com/r/spacex/comments/3115xw/latest_update_pic_of_slc4_pad_shows_ln2_tankage/">10 comments</a>]
I'm running python 3.4 and this code which uses regex works for me.
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('https://reddit.com/r/spacex')
>>> re.search(r'(<li class="first"><a href=")(.+?)(")', r.text).group(2)
'https://www.reddit.com/r/spacex/comments/31p51d/tory_bruno_posts_infographic_of_ula_vs_spacex/'
I know this isn't using re like you asked about, but in a similar vein to the BeautifulSoup answer above:
Can you use PyQuery along with requests?
Is this the links you're looking for?
import requests
from pyquery import PyQuery as PyQuery
r = requests.get('https://reddit.com/r/spacex')
subreddit = r.text
pyq_to_parse = PyQuery(subreddit)
result = pyq_to_parse(".first").find("a")
print result
How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.
i=0
while i<len(newschoollist):
url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '>Phone:</td><td>(.+?)</td></tr>'
pattern = re.compile(regex)
value = re.findall(pattern,htmltext)
print newschoollist[i], valuetag, value
i+=1
However when i try to recognize more complicated HTML like this...
<td>Attendance Rate</td>
<td class='center'> 90.1</td>
I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?
What i want the findall to pick up is the 90.1 when it finds
"Attendance Rate
"
Thanks!
Use an HTML Parser. Example using BeautifulSoup:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'
soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
value = label.find_next_sibling('td')
if not value:
continue
print label.get_text(strip=True), value.get_text(strip=True)
print "----"
Prints (profile contact information):
...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...
I ended up using (soup.get_text()) and it worked great. Thanks!
I have half-written a code to pull the titles and links from an RSS feed but it results in the above error. The error is in both the functions while getting the text. I want to strip the entered string of the title and link tags.
from bs4 import BeautifulSoup
import urllib.request
import re
def getlink(a):
a= str(a)
bsoup=BeautifulSoup(a)
a=bsoup.find('link').getText()
return a
def gettitle(b):
b=str(b)
bsoup=BeautifulSoup(b)
b=bsoup.find('title').getText()
return b
webpage= urllib.request.urlopen("http://feeds.feedburner.com/JohnnyWebber?format=xml").read()
soup=BeautifulSoup(webpage)
titlesoup=soup.findAll('title')
linksoup= soup.findAll('link')
for i,j in zip(titlesoup,linksoup):
i = getlink(i)
j= gettitle(j)
print (i)
print(j)
print ("\n")
EDIT: falsetru's method worked perfectly.
I have one more question. Can text be extracted out of any tag by just doing getText ?
I expect the problem is in
def getlink(a):
...
a=bsoup.find('a').getText()
....
Remember find matches tag names, there is no link tag but an a tag. BeautifulSoup will return None from find if there is no matching tag, thus the NoneType error. Check the docs for details.
Edit:
If you really are looking for the text 'link' you can use bsoup.find(text=re.compile('link'))
i, j is title, link already. Why do you find them again?
for i, j in zip(titlesoup, linksoup):
print(i.getText())
print(j.getText())
print("\n")
Beside that, pass features='xml' to BeautifulSoup if you parse xml file.
soup = BeautifulSoup(webpage, features='xml')
b=bsoup.find('title') returns None
try checking your input
I tried to fetch source of 4chan site, and get links to threads.
I have problem with regexp (isn't working). Source:
import urllib2, re
req = urllib2.Request('http://boards.4chan.org/wg/')
resp = urllib2.urlopen(req)
html = resp.read()
print re.findall("res/[0-9]+", html)
#print re.findall("^res/[0-9]+$", html)
The problem is that:
print re.findall("res/[0-9]+", html)
is giving duplicates.
I can't use:
print re.findall("^res/[0-9]+$", html)
I have read python docs but they didn't help.
That's because there are multiple copies of the link in the source.
You can easily make them unique by putting them in a set.
>>> print set(re.findall("res/[0-9]+", html))
set(['res/3833795', 'res/3837945', 'res/3835377', 'res/3837941', 'res/3837942',
'res/3837950', 'res/3100203', 'res/3836997', 'res/3837643', 'res/3835174'])
But if you are going to do anything more complex than this, I'd recommend you use a library that can parse HTML. Either BeautifulSoup or lxml.