My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?
Related
I am writing a little script to get my F#H user data from a basic HTML page.
I want to locate my username on that page and the numbers before and after it.
All the data I want is between two HTML <tr> and </tr> tags.
I am currently using this:
re.search(r'<tr>(.*?)</tr>', htmlstring)
I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word
However that only returns the first string between those two delimiters, not even all of them.
This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.
If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.
I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.
TL;DR -
I need a re.search() pattern that finds a substring between two words, that also contains a specific word.
If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>
<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"
Sample
There are a few ways to do it but I prefer the pandas way:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?
Using beautifulsoup and "find" method, and re, you can do it the following way:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.
Using Re:
So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
Helpful lin to use variables in regex: stackflow
I have complete html of a page and from that I need to find GA (google Analytics) id of it. For example:
<script>ga('create', 'UA-4444444444-1', 'auto');</script>
From above string I need to get UA-4444444444-1, which starts from "UA-" and ends with "-1". I have tried this:
re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)
but didn't get any success. Please let me know what mistake I am making.
Thanks
It seems that you are overthinking it, you could just seek for the UA token directly:
re.findall(r"UA-\d+-\d+")
Never use regex in parsing through the html. BeautifulSoup should be find in extracting text from tags. Here we extract script tags from html, then we apply regex to text located in script tags.
import re
from bs4 import BeautifulSoup as bs4
html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"
soup = bs4(html, 'lxml')
pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
ids.append(pattern.findall(i.text)[0])
print(ids)
This is what im trying to scrape:
<p>Some.Title.html<br />
https://www.somelink.com/yep.html<br />
Some.Title.txt<br />
https://www.somelink.com/yeppers.txt<br />
I have tried several variations of the following:
match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)
I am looking to match lines with the "p" tag and without. "p" tag only occurs on the first instance. Terrible at python so I am pretty rusty, have searched through here and google and nothing seemed to be quite the same. Thanks for any help. Really do appreciate the help I get here when I am stuck.
Desired output is an index:
http://www.SomeLink.com/yep.html
http://www.SomeLink.com/yeppers.txt
Using the Beautiful soup and requests module would be perfect for something like this instead of regex as the commenters noted above.
import requests
import bs4
html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them
This just a simple code that will select all the tags from the html site and store them in a list with the format that you illustrated up above. I'd advise checking here for a nice tutorial on bs4 and here for the actual docs.
I am new to python(using 2.7.3). I was trying to do web scraping using python but I am ot getting the expected outputs:
import urllib
import re
regex='<title>(.+?)<\title>'
pattern=re.compile(regex)
dummy="fsdfsdf<title>Test<\title>dsf"
html=urllib.urlopen('http://www.google.com')
text=html.read()
print pattern.findall(text)
print pattern.findall(dummy)
while the second print statement is working fine but the first one should print Google but it is giving a blank list.
Try changing:
regex='<title>(.+?)<\title>'
to
regex='<title>(.+?)</title>'
You mistyped the slash:
regex='<title>(.+?)<\title>'
should be:
regex='<title>(.+?)</title>'
HTML uses a forward slash in closing tags.
That said, don't use regular expressions to parse HTML. Matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
I am new to lxml and want to extract <p>PARAGRAPHS</p> and <li>PARAGRAPHS</li> from a given url and use them for further steps.
I followed an example from a post, and tried the following code with no luck:
html = lxml.html('http://www.google.com/intl/en/about/corporate/index.html')
url = 'http://www.google.com/intl/en/about/corporate/index.html'
print html.parse.xpath('//p/text()')
I tried to look into the examples in lxml.html, but didn't find any example using url.
Could you give me any hint on what methods should I use? Thanks.
import lxml.html
htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')
print htmltree.xpath('//p/text()')