Scraping a website with data hidden under "read more" - python

I am trying to scrape reviews from Tripadvisor.com and I want to get the data under 'Read More' button of the site. Is there anyway to scrape this without using selenium?
So far this is the code that I used
resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS')
rsp_soup = BeautifulSoup(resp.text, 'html.parser')
rsp_soup.findAll(attrs={"class": "hotels-review-list-parts-ExpandableReview__reviewText--3oMkH"})
But it can't scrape contents under the 'Read more'

Reviews are partialy revealed in html until you click on read more which actually do not make an Ajax call but updates page from data contained in window.__WEB_CONTEXT__. You can access this data by looking into a <script> tag in which it appears:
<script>
window.__WEB_CONTEXT__={pageManifest:{"assets":["/components/dist/#ta/platform.polyfill.084d8cdf5f.js","/components/dist/runtime.56c5df2842.js", .... }
</script>
Once you've got it, you and you could extract and process the data which is of JSON format. Here is the full code:
import json
from bs4 import BeautifulSoup
resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS')
data = BeautifulSoup(resp.content).find('script', text = re.compile('window.__WEB_CONTEXT__')).text
#Some text processing to make the tag content a valid json
pageManifest = json.loads(data.replace('window.__WEB_CONTEXT__=','').replace('{pageManifest:', '{"pageManifest":')[:-1])
for x in pageManifest['pageManifest']['apolloCache']:
try:
reviews = x['result']['locations'][0]['reviewList']['reviews']
except:
pass
print([x['text'] for x in reviews])
Output
['Do arrange for airport transfers! From the airport, you will be taking a van for around 20 minutes, then you\'ll be transferred to a banca/boat for a 25 minute ride to the resort. Upon arrival, you\'ll be greeted by a band that plays their "welcome, welcome" song and in our case, we were met by Maria (awesome gal!) who introduced the group to the resort facilities and checks you in at the bar.I booked a deluxe room, which is actually a duplex with 2 adjoining rooms, ideal
for families, which accommodates 4 to a room.Rooms are clean and bed is comfortable.Potable water is provided upon check in , but is chargeable thereafter.Don\ 't worry, ...FULL REVIEW...',
"Stayed with my wife and 2 children, 10y and 13y. ...FULL REVIEW...",
'Beginning at now been in Coron for a couple of ...FULL REVIEW...',
'This was the most beautiful and relaxing place ...FULL REVIEW...',
'We spent 2 nights at El rio. It was incredible, ...FULL REVIEW... ']

In general, no. It all depends on what happens when you hit "Read More", i.e. where the actual data is.
There are usually two possibilities (not mutually exclusive):
the data lies in the same page, hidden, and the "read more" is e.g. a label for a hidden checkbox that, when selected, hides the "read more" span and makes the rest of the text appear. This way the page displayed is smaller and more readable, yet it is all loaded within the same call. In that case you just need to find a suitable selector (could be for example #someotherselector+input[type=checkbox] ~ div.moreText or something like that).
the data is not there, it will be loaded via AJAX after some time, remaining hidden, or only when you click on the "read more", to be displayed then. This allows keeping a small page that loads quickly and yet contains lots of items that would load slowly, loading them in background or on demand. In this case you need to inspect the actual AJAX call (it usually carries along an id or a data-value held in the 'Load More...' element: <span class="loadMore" data-text-id="x19834">Read more...</span>) and issue the same call with the appropriate headers:
resp2 = requests.get('https://www.tripadvisor.com.ph/whatever/api/is/used?id=' + element.attr('data-text-id'))
Without knowing how the data is retrieved and where the relevant elements (e.g. the name and content of the id-carrying attribute, etc.) are, it is not possible to give an answer that will work every time.
You might be interested in doing this the right way, also. The data you're scraping is protected by copyright, and TripAdvisor might change things enough that you'll have problems maintaining the scraper.

Related

Can I search multiple HTML elements within the soup.find_all() function?

I'm trying to scrape a website for most viewed headlines. The class selector of the text I want shares common words with other items on the page. For example, I want the text between the tag and class "black_color". Other items use the tag and have the class "color_black hover_color_gray_90" and I don't want these included. I was thinking I could use more HTML elements to be more specific but I'm not sure how to incorporate them.
from bs4 import BeautifulSoup
def getHeadlines():
url = "https://www.bostonglobe.com/"
source_code = requests.get(url)
plainText = source_code.text
soup = BeautifulSoup(plainText, "html.parser")
#results = soup.find_all("h2",{"class":"headline"})
results = soup.find_all("a",{"class":"black_color"})
with open("headlines.txt", "w", encoding="utf-8") as f:
for i in results:
f.write(str(i.text + ' \n' + '\n'))
getHeadlines()
I think looking at the <a> tag may actually be harder than using at the matching <h2>, which has a 'headline' class.
Try this:
soup = BeautifulSoup(source_code.text, "html.parser")
for headline in soup.find_all("h2", class_="headline"):
print(headline.text)
Output:
Boston College outbreak worries epidemiologists, students, community
Nantucket finds ‘community spread’ of COVID-19 among tradespeople The town’s Select Board and Board of Health will hold an emergency meeting at 10 a.m. Monday to consider placing restrictions on some of those trades, officials said Friday.
Weddings in a pandemic: Welcome to the anxiety vortexNewlyweds are spending their honeymoons praying they don’t hear from COVID‐19 contact tracers. Relatives are agonizing over “damned if we RSVP yes, damned if we RSVP no” decisions. Wedding planners are adding contract clauses specifying they’ll walk off the job if social distancing rules are violated.
Fauci says US should plan to ‘hunker down’ for fall and winter
Of struggling area, Walsh says, ‘We have to get it better under control’
...
After looking at it for awhile, I think the <a> tag may actually have some classes added dynamically, which BS will not pick up. Just searching for the color_black class and excluding the color_black hover_color_gray_90 class yields the headline classes that you don't want (e.g. 'Sports'), even though when I look at the actual web page source code, I see it's differentiated in the way you've indicated.
That's usually a good sign that there are post-load CSS changes being made to a page. (I may be wrong about that, but in either case I hope the <h2> approach gets you what you need.)

Python, extract text from webpage

I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.

Scrapy + Python, Error in Finding links from a website

I am trying to find the URLs of all the events of this page:
https://www.eventshigh.com/delhi/food?src=exp
But I can see the URL only in a JSON format:
{
"#context":"http://schema.org",
"#type":"Event",
"name":"DANDIYA NIGHT 2018",
"image":"https://storage.googleapis.com/ehimages/2018/9/4/img_b719545523ac467c4ad206c3a6e76b65_1536053337882_resized_1000.jpg",
"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018",
"eventStatus": "EventScheduled",
"startDate":"2018-10-14T18:30:00+05:30",
"doorTime":"2018-10-14T18:30:00+05:30",
"endDate":"2018-10-14T22:30:00+05:30",
"description" : "Dress code : TRADITIONAL (mandatory)\u00A0 \r\n Dandiya sticks will be available at the venue ( paid)\u00A0 \r\n Lip smacking food, professional dandiya Dj , media coverage , lucky draw \u00A0, Dandiya Garba Raas , Shopping and Games .\u00A0 \r\n \u00A0 \r\n Winners\u00A0 \r\n \u00A0 \r\n Best dress ( all",
"location":{
"#type":"Place",
"name":"K And L Community Hall (senior Citizen Complex )",
"address":"80 TO 49, Pocket K, Sarita Vihar, New Delhi, Delhi 110076, India"
},
Here it is:
"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018"
But I cannot find any other HTML/XML tag which contains the links. Also I cannot find the corresponding JSON file which contains the links. Could you please help me to scrape the links of all events of this page:
https://www.eventshigh.com/delhi/food?src=exp
Gathering information from a JavaScript-powered page, like this one, may look daunting at first; but can often be more productive in fact, as all the information is in one place, instead of scattered across a lot of expensive HTTP-request lookups.
So when a page gives you JSON-data like this, you can thank them by being nice to the server and use it! :)
With a little time invested into "source-view analysis", which you have already gathered, this will also be more efficient than trying to get the information through an (expensive) Selenium/Splash/ect.-renderpipe.
The tool that is invaluable to get there, is XPath. Sometimes a little additional help from our friend regex may be required.
Assuming you have successfully fetched the page, and have a Scrapy response object (or you have a Parsel.Selector() over an otherwise gathered response-body), you will be able to access the xpath() method as response.xpath or selector.xpath:
>>> response.status
200
You have determined the data exists as plain text (json), so we need to drill down to where it hides, to ultimately extract the raw JSON content.
After that, converting it to a Python dict for further use will be trivial.
In this case it's inside a container node <script type="application/ld+json">. Our XPath for that could look like this:
>>> response.xpath('//script[#type="application/ld+json"]')
[<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n{\n '>,
<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n{\n '>,
<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n '>]
This will find every "script" node in the xml page which has an attribute of "type" with value "application/ld+json".
Apparently that is not specific enough, since we find three nodes (Selector-wrapped in our returned list).
From your analysis we know that our JSON must contain the "#type":"Event", so let our xpath do a little substring-search for that:
>>> response.xpath("""//script[#type="application/ld+json"]/self::node()[contains(text(), '"#type":"Event"')]""")
[<Selector xpath='//script[#type="application/ld+json"]/self::node()[contains(text(), \'"#type":"Event"\')]' data='<script type="application/ld+json">\n '>]
Here we added a second qualifier which says our script node must contain the given text.
(The 'self::node()' shows some XPath axes magic to reference back to our current script node at this point - instead of its descendents. We will simplify this though.)
Now our return list contains a single node/Selector. As we see from the data= string, if we were to extract() this, we would now
get some string like <script type="application/ld+json">[...]</script>.
Since we care about the content of the node, but not the node itself, we have one more step to go:
>>> response.xpath("""//script[#type="application/ld+json"][contains(text(), '"#type":"Event"')]/text()""")
[<Selector xpath='//script[#type="application/ld+json"][contains(text(), \'"#type":"Event"\')]/text()' data='\n [\n \n \n '>]
And this returns (a SelectorList of) our target text(). As you may see we could also do away with the self-reference.
Now, xpath() always returns a SelectorList, but we have a little helper for this: response.xpath().extract_first() will grab the list's first element -checking if it exists- before processing it.
We can put this result into a data variable, after which it's simple to json.loads(data) this into a Python dictionary and look up our values:
>>> events = json.loads(data)
>>> [item['url'] for item in events]
['<url>',
'<url>',
'<url>',
'<url>']
Now you can turn them into scrapy.Request(url)s, and you'll know how to continue from there.
.
As always, crawl responsibly and keep the 'net a nice place to be. I do not endorse any unlawful behavior.
Assessing one's rights or gaining permission to access specified target ressource is one's own responsibility.

How to scrape paginated table

I am trying to scrape all the data from the table on this website (https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s) but can't seem to figure out how I would go about scraping all of the subsequent pages. This is the code to scrape the first page of results into a CSV file:
import csv
import requests
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
fileList = []
# For the table-header cells
tableHeader = soup.find('tr', attrs={'class': 'table-header'})
rowList = []
for cell in tableHeader.findAll('th'):
cellText = cell.text.replace(' ', '').replace('\n', '')
rowList.append(cellText)
fileList.append(rowList)
# For the table body cells
table = soup.find('tbody', attrs={'class': 'stripe'})
for row in table.findAll('tr'):
rowList = []
for cell in row.findAll('td'):
cellText = cell.text.replace(' ', '').replace('\n', '')
if cellText == "Details":
continue
rowList.append(cellText)
fileList.append(rowList)
outfile = open("./prison-inmates.csv", "w")
writer = csv.writer(outfile)
writer.writerows(fileList)
How do I get to the next page of results?
Code taken from this tutorial (http://first-web-scraper.readthedocs.io/en/latest/)
Although I wasn't able to get your posted code to run, I did find that the original tutorial code you linked to, can be changed on the url = line to:
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' \
+ '?max_rows=250'
Running python scrape.py then successfully outputs inmates.csv with all available records.
In short, this works by:
instead of: How do I get to the next page ?
we pursue: How do I remove pagination ?
we cause the page to send all records at once so there would be no pagination to deal with in the first place
This allows us to use the original tutorial code to save the complete set of records.
Explanation
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' to use the new URL. The old URL in the tutorial: http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp redirects to this new URL, but doesn't work with our solution so we can't use the old URL
\ is a line break allowing me to continue the line of code on the next line, for readability
+ is to concatenate so we can add the ?max_rows=250.
So the result is equivalent to url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250'
?max_rows=<number-of-records-to-display> is a query string I found that works for this particular Current Detainees page. This can be found by first noticing the Page Size text entry field meant for users to set a custom rows per page. It shows a default value 50. Examine its HTML code, for example in Firefox browser (52.7.3), use Ctrl+shift+i to show the Firefox's Web Developer Inspector tool window. Click the Select element button (icon resembles a box outline with a mouse cursor arrow on it), then click on the input field containing 50. HTML pane below reveals via highlight: <input class="mrcinput" name="max_rows" size="3" title="max_rowsp" value="50" type="text">. This means it submits a form variable named max_rows, which is a number, default 50. Some web pages, depending on how it is coded, can recognize such variables if appended to the URL as a query string, so it is possible to try this by appending ?max_rows= plus a number of your choice. At the time I started the page said 250 Total Items , so I chose to try the custom number 250 by changing my browser address bar to load: https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250. It successfully displayed 250 records, making it unnecessary to paginate, so this ?max_rows=250 is what we use to form the URL used by our script
Now however the page now says 242 Total Items, so it seems they are removing inmates, or at least inmate records listed. You can: ?max_rows=242, but ?max_rows=250 will still work because 250 is larger than the total number of records 242, and as long as it is larger the page will not need to paginate, and thus allow you to have all the records on one page.
Warranty
This isn't a universal solution for scraping table data when encountering pagination. It works for this Current Detainees page and pages that may have been coded in the same way
This is because pagination isn't universally implemented, so any code or solution would depend on how the page implements pagination. Here we use ?max_rows=.... However another website, even if they have adjustable per-page limits, may use a different name for this max_rows variable, or ignore query strings altogether and so our solution may not work on a different website
Scalability issues: If you are in a situation with a different website where you need millions of records for example, a download-all-at-once approach like this can run into perhaps memory limits both on the server side and also on your computer, both could time out and fail to finish delivering or processing. A different approach, resembling something like pagination that you had originally asked for, would definitely be more suitable
So in the future if you need to download large amounts of records, this download-all-at-once approach will likely run you into memory-related trouble, but for scraping this particular Current Detainees page, it will get the job done.

Python Using wildcard inside of strings

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.
I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm
There's a lot of code however this is the main part that I am struggling with. The code block looks like this:
def grab_yearly_data(self,page,year):
# page is the url that was downloaded, year in this case is 2014.
rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
#mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing
self.rank= [g for g in re.findall(rank_pattern,page)]
self.mov_title=[g for g in re.findall(mov_title_pattern,page)]
self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.
Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.
Here is a solution using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
import requests
url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
cells = row.find_all('td')
if len(cells) < 2:
continue
rank = cells[0].text
title = cells[1].text
print rank, title
Prints:
1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below
The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.
For this page, to get to the table you can rely on the chart element and get it's next table sibling:
for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
...
mov_title_pattern=r'.htm">([A-Za-z0-9 ]*)</a></font></b></td>'
Try this.This should work for your case.See demo.
https://www.regex101.com/r/fG5pZ8/6
Your regex does not make much sense. It matches .htm">[A-Z] as few times as possible, which is usually zero, yielding an empty string.
Moreover, with a very general regular expression like that, there is no guarantee that it only matches on the result rows. The generated page contains a lot of other places where you could expect to find .htm"> followed by something.
More generally, I would advocate an approach where you craft a regular expression which precisely identifies each generated result row, and extracts from that all the values you want. In other words, try something like
re.findall('stuff (rank) stuff (title) stuff stuff stuff')
(where I have left it as an exercise to devise a precise regular expression with proper HTML fragments where I have the stuff placeholders)
and extract both the "rank" group and the "title" group out of each matched row.
Granted, scraping is always brittle business. If you make your regex really tight, chances are it will stop working if the site changes some details in its layout. If you make it too relaxed, it will sometimes return the wrong things.

Categories