Scraping in python using regex not giving any result? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Improve this question
I am using python 3 to scrape a website and print a value. Here is the code
import urllib.request
import re
url = "http://in.finance.yahoo.com/q?s=spy"
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
regex = '<span id="yfs_l84_SPY">(.+?)</span>'
code = re.compile(regex)
price = re.findall(code,htext)
print (price)
when i run this snippet, it prints an empty list, ie. [], but i am expecting a value e.g. 483.33.
What is the thing that i am getting wrong ? Help

I have to recommend that you not use regex to parse HTML, because HTML is not a regular language. Yes, you could use it here. It's not a good habit to get into.
The biggest issue I imagine that you're having is that the real id of the span you're looking for on that page is yfs_l84_spy. Note case.
That said, here is a quick implementation in BeautifulSoup.
import urllib.request
from bs4 import BeautifulSoup
url = "http://in.finance.yahoo.com/q?s=spy"
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
soup = BeautifulSoup(htext)
soup.find('span',id="yfs_l84_spy")
Out[18]: <span id="yfs_l84_spy">176.12</span>
And to get at that number:
found_tag = soup.find('span',id="yfs_l84_spy") #tag is a bs4 Tag object
found_tag.next #get next (i.e. only) element of the tag
Out[36]: '176.12'

You are not using the regex correctly, there are 2 ways of doing this:
1.
regex = '<span id="yfs_l84_spy">(.+?)</span>'
code = re.compile(regex)
price = code.findall(htext)
2.
regex = '<span id="yfs_l84_spy">(.+?)</span>'
price = re.findall(regex, htext)
It should be noted that the Python regex library does some caching internally so precaching has only limited effect.

Related

Search and create list from a string Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am very new to Python and I am trying to create a list out of string in python.
Input = "<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"
Desired Output = [File1.pdf, File2.ppt, File3.docx]
What is the most efficient and pythonic way to achieve this? Any help will be very much appreciated.
Thanks
You can use beatifulsoup, which has HTML parsing utils.
>>> from bs4 import BeautifulSoup
>>> html = """<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"""
>>> soup = BeautifulSoup(html, parser='html')
>>> files_list = [i.text.split('file: ')[1].replace(')', '') for i in soup.find_all('i')]
>>> print(files_list)
['File1.pdf', 'File2.ppt', 'File3.docx']
There might be a nice way to do this using a HTML parser like shree.pat18 suggested but here is a quick and dirty way using string.split()
Output = [s.split(")")[0] for s in Input.split("file: ")[1:]]
By first splitting on "file: " we get list of strings, the first one contains the first part of the original string so we don't care about that one. The others start with the filenames that we want and the first character we don't care about is ")". So split on ")" and take the first part.

finding HTTPS images with BeautifulSoup, python [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
I'm iterating through all the img's in a request.POST to see if they are HTTPS (I'm using Beautiful Soup to help)
Here's my code:
content = request.POST['content']
print(content) #prints:
<p>test test test</p><br><p><img src="https://www.treefrogfarm.com/store/images/source/IFE_A-K/ClarySage2.jpg" alt=""></p><br><p>2nd 2nd</p><br><p><img src="https://www.treefrogfarm.com/store/images/source/IFE_A-K/ClarySage2.jpg" alt=""></p>
soup = BeautifulSoup(content, 'html.parser')
for image in soup.find_all('img'):
print('Source:', image.get('src')[:8]) #prints Source: https://
if image.get('src')[:7] == "https://":
print('HTTPS')
else:
print('Not HTTPS')
Even though image.get('src')[:7] == "https://", the code still prints Not HTTPS.
Any idea why?
Well for starters, 'https://' is 8 characters, so there's no way that a slice of 7 characters can match it.
Also, please make your question titles actually indicative of the problem you're having rather than unrelated accusations about the python operators.
to match the https:// string the appropriate slice would be :8 instead of :7
if image.get('src')[:8] == "https://":

Web Scraping - How to get a specific part of a weblink [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have the following link:
https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk
I have multiple links in a dataset. Each link is of same pattern. I want to get a specific part of the link, for the above link i would be the bold part of the link above. I want text starting from 2nd http to before first + sign.
I don't know how to do so using regex. I am working in python. Kindly help me out.
If each link has the same pattern you do not need regex. You can use string.find() and string cutting
link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
# This finds the second occurrence of "https://" and returns the position
second_https = link.find("https://", link.find("https://")+1)
# Index of the end of the link
end_of_link = link.find("+")
new_link = link[second_https:end_of_link]
print(new_link)
This will return "https://cooking.nytimes.com/learn-to-cook" and will work if the link follows the same pattern as described (it is the second https:// in the link and ends with + sign)
I'd go with urlparse (Python 2) or urlparse (Python 3) and a little bit of regex:
import re
from urlparse import urlparse
url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
parsed = urlparse(url_example)
result = re.findall('https?.*', parsed.query)[0].split('+')[0]
print(result)
Output:
https://cooking.nytimes.com/learn-to-cook

I have an unindent error while using geany with python [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
I'm using geany and I get the following error
File "autoblog2.py", line 9
htmlfile = urllib.urlopen(url)
^
IndentationError: unindent does not match any outer indentation level
here is my code.
import urllib
import re
symbols_list = ["aapl","spy","goog","nflx"]
i = 0
while i<len(symbols_list):
url = 'https://uk.finance.yahoo.com/q?s='+symbols_list[i]
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_l84_aapl">(.+?)'+symbols_list[i]'</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print 'the price of' +symbols_list[i]
i+=1
I don't get any errors when I run the same code on a single url. I've only had since trying it with a while loop, i'm using python 2
This can happen when you edit one script with two editors.
Your indent settings can differ from editor to editor.
Take a look at the script with another editor.
If the script has the same indents in other editors the only way is to remove all indents and add them again.
I would recommend the python-idle.
It should show the indents like the interpreter reads them.
Good Luck.

python regular expression grammar [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
... html ...
[{"url":"/test/test/url","id":"111111"},{"url":"/test/test/url","id":"111111"}, {"url":"/test/test/url","id":"1111"}]
.... html ...
I have some json type string in html.
How make rex expression to extract pattern as
"/test/test/url" and "1111" comes after "id":
Thanks in advance,
Don't use regular expressions here, use the json module. This is what it's designed for.
import json
mylist = json.loads(html)
for subdict in mylist:
print subdict['url']
print subdict['id']
You should go with #Haidro's answer on this, but if you want to use a regex, or see how you would, then here's some sample code:
regex = re.compile(r'\"url\":("[^"]+"),\"id\":("[^"]+")')
match = re.finditer(regex, yourString)
for m in match:
print m.group(1), m.group(2)
[^"] is a character class for accepting all non- " characters.
EDIT:
I love how I recommend the other answer, but explain how to do it if one really wants to know, yet I somehow still get downvoted.

Categories