I am trying to scrape all the URLs from a website that meet a certain criterion. My code is so far as follows:
import pandas as pd
from urllib.request import urlopen
import lxml.html
links = []
connection = urlopen("http://www.open.ac.uk/courses/modules")
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
links.append(link)
This is getting me the URLs in a list. However, I only want the ones that end with /[some letters][3 numbers]. I have the following expression which works at www.regex101.com:
\/[a-z]*[0-9][0-9][0-9]
Ideally I would like to amend the scrape so it only returns the required information. How can I use the expression on the list to filter it?
I have found a few things that kind of answer my question but nothing that is the same as my problem.
An example of the data I am getting is
/courses/modules/tm352
/courses/modules/a332
/courses/modules/ke322
/courses/modules/e318
/postgraduate
#int-site
http://www.open.ac.uk/contact/
http://www2.open.ac.uk/tutors/help/who-to-contact
http://www.open.ac.uk/about/employment/
http://www.open.ac.uk/about/main/management/policies-and-
statements/website-accessibility-open-university
http://www.open.ac.uk/wales/cy
So of that the first 4 lines match what I want the rest do not match.
try this:
df = pd.DataFrame(links)
df[0] = df[0].str.extract('(.+[A-z]+\d\d\d$)', expand=False)
df.loc[df[0].notnull()]
Related
on this link: https://www.basketball-reference.com/teams/MIA/2022.html
I want to read this table:
I use this code:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url,match="Shooting")
But it says: ValueError: No tables found matching pattern 'Shooting'.
If I try pd.read_html(url,match="Roster") or pd.read_html(url,match="Totals") it searches for these tables.
Its the second table that you want to read. You can simply do:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url)[1]
I've discovered that the HTML code commented inside each div#all_* are the same with actual scoring tables content. So it looks like the tables somehow generated from the comments using JavaScript after page loads. Obviously it's some kind scraping protection.
Screenshots of what I mean (for Shooting section you want to get):
Well, the only solution I see for now is firstly to load the whole HTML of page then modify req.content with replace function (delete all special HTML comments characters) and finally get the table you want using pandas:
import requests
import pandas as pd
url = "https://www.basketball-reference.com/teams/MIA/2022.html"
req = requests.get(url)
req = req.text.replace('<!--', '')
# req = req.replace('-->', '') # not necessary in this case
pd.read_html(req, match="Shooting")
Since the whole HTML code doesn't contains comments anymore I recommend to get tables by index.
For Shooting - Regular Season tab:
pd.read_html(req)[15]
and for Shooting - Playoffs tab:
pd.read_html(req)[16]
pd.read_html() isn't finding all the table tags. Only 7 are being returned.
Roster, Per Game, Totals, Advanced and 3 others. Shooting is not among them so pd.read_html(url,match="Shooting") is going to give you an error.
import pandas as pd
url = 'https://www.basketball-reference.com/teams/MIA/2022.html'
x = pd.read_html(url)
print(len(x)) #7
I am writing a little script to get my F#H user data from a basic HTML page.
I want to locate my username on that page and the numbers before and after it.
All the data I want is between two HTML <tr> and </tr> tags.
I am currently using this:
re.search(r'<tr>(.*?)</tr>', htmlstring)
I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word
However that only returns the first string between those two delimiters, not even all of them.
This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.
If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.
I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.
TL;DR -
I need a re.search() pattern that finds a substring between two words, that also contains a specific word.
If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>
<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"
Sample
There are a few ways to do it but I prefer the pandas way:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?
Using beautifulsoup and "find" method, and re, you can do it the following way:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.
Using Re:
So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
Helpful lin to use variables in regex: stackflow
I am trying to scrape a list of dates from: https://ca.finance.yahoo.com/quote/AAPL/options
The dates are located within a drop down menu right above the option chain. I've scraped text from this website before but this text is using a 'select' & 'option' syntax. How would I adjust my code to gather this type of text? I have used many variations of the code below to try and scrape the text but am having no luck.
Thank you very much.
import bs4
import requests
datesLink = ('https://ca.finance.yahoo.com/quote/AAPL/options')
datesPage = requests.get(datesLink)
datesSoup = BeautifulSoup(datesPage.text, 'lxml')
datesQuote = datesSoup.find('div', {'class': 'Cf Pt(18px)controls'}).find('option').text
The reason you can't seem to extract this dropdown list is because this list is generated dynamically, and the easiest way to know this is by saving your html content into a file and giving it a manual look, in a text editor.
You CAN, however, parse those dates out of the script source code, which is in the same html file, using some ugly regex way. For example, this seems to work:
import requests, re
from datetime import *
content = requests.get('https://ca.finance.yahoo.com/quote/AAPL/options').content.decode()
match = re.search(r'"OptionContractsStore".*?"expirationDates".*?\[(.*?)\]', content)
dates = [datetime.fromtimestamp(int(x), tz=timezone.utc) for x in match.group(1).split(',')]
for d in dates:
print(d.strftime('%Y-%m-%d'))
It should be obvious that parsing stuff in such a nasty way isn't fool-proof, and likely going to break sooner rather than later. But the same can be said about any kind of web scraping entirely.
You can simply read HTML directly to Pandas:
import pandas as pd
URI = 'https://ca.finance.yahoo.com/quote/AAPL/options'
df = pd.read_html(URI)[0] #[1] depending on the table you wish for
My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?
I am new to lxml and want to extract <p>PARAGRAPHS</p> and <li>PARAGRAPHS</li> from a given url and use them for further steps.
I followed an example from a post, and tried the following code with no luck:
html = lxml.html('http://www.google.com/intl/en/about/corporate/index.html')
url = 'http://www.google.com/intl/en/about/corporate/index.html'
print html.parse.xpath('//p/text()')
I tried to look into the examples in lxml.html, but didn't find any example using url.
Could you give me any hint on what methods should I use? Thanks.
import lxml.html
htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')
print htmltree.xpath('//p/text()')