Scrapy + Python, Error in Finding links from a website - python

I am trying to find the URLs of all the events of this page:
https://www.eventshigh.com/delhi/food?src=exp
But I can see the URL only in a JSON format:
{
"#context":"http://schema.org",
"#type":"Event",
"name":"DANDIYA NIGHT 2018",
"image":"https://storage.googleapis.com/ehimages/2018/9/4/img_b719545523ac467c4ad206c3a6e76b65_1536053337882_resized_1000.jpg",
"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018",
"eventStatus": "EventScheduled",
"startDate":"2018-10-14T18:30:00+05:30",
"doorTime":"2018-10-14T18:30:00+05:30",
"endDate":"2018-10-14T22:30:00+05:30",
"description" : "Dress code : TRADITIONAL (mandatory)\u00A0 \r\n Dandiya sticks will be available at the venue ( paid)\u00A0 \r\n Lip smacking food, professional dandiya Dj , media coverage , lucky draw \u00A0, Dandiya Garba Raas , Shopping and Games .\u00A0 \r\n \u00A0 \r\n Winners\u00A0 \r\n \u00A0 \r\n Best dress ( all",
"location":{
"#type":"Place",
"name":"K And L Community Hall (senior Citizen Complex )",
"address":"80 TO 49, Pocket K, Sarita Vihar, New Delhi, Delhi 110076, India"
},
Here it is:
"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018"
But I cannot find any other HTML/XML tag which contains the links. Also I cannot find the corresponding JSON file which contains the links. Could you please help me to scrape the links of all events of this page:
https://www.eventshigh.com/delhi/food?src=exp

Gathering information from a JavaScript-powered page, like this one, may look daunting at first; but can often be more productive in fact, as all the information is in one place, instead of scattered across a lot of expensive HTTP-request lookups.
So when a page gives you JSON-data like this, you can thank them by being nice to the server and use it! :)
With a little time invested into "source-view analysis", which you have already gathered, this will also be more efficient than trying to get the information through an (expensive) Selenium/Splash/ect.-renderpipe.
The tool that is invaluable to get there, is XPath. Sometimes a little additional help from our friend regex may be required.
Assuming you have successfully fetched the page, and have a Scrapy response object (or you have a Parsel.Selector() over an otherwise gathered response-body), you will be able to access the xpath() method as response.xpath or selector.xpath:
>>> response.status
200
You have determined the data exists as plain text (json), so we need to drill down to where it hides, to ultimately extract the raw JSON content.
After that, converting it to a Python dict for further use will be trivial.
In this case it's inside a container node <script type="application/ld+json">. Our XPath for that could look like this:
>>> response.xpath('//script[#type="application/ld+json"]')
[<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n{\n '>,
<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n{\n '>,
<Selector xpath='//script[#type="application/ld+json"]' data='<script type="application/ld+json">\n '>]
This will find every "script" node in the xml page which has an attribute of "type" with value "application/ld+json".
Apparently that is not specific enough, since we find three nodes (Selector-wrapped in our returned list).
From your analysis we know that our JSON must contain the "#type":"Event", so let our xpath do a little substring-search for that:
>>> response.xpath("""//script[#type="application/ld+json"]/self::node()[contains(text(), '"#type":"Event"')]""")
[<Selector xpath='//script[#type="application/ld+json"]/self::node()[contains(text(), \'"#type":"Event"\')]' data='<script type="application/ld+json">\n '>]
Here we added a second qualifier which says our script node must contain the given text.
(The 'self::node()' shows some XPath axes magic to reference back to our current script node at this point - instead of its descendents. We will simplify this though.)
Now our return list contains a single node/Selector. As we see from the data= string, if we were to extract() this, we would now
get some string like <script type="application/ld+json">[...]</script>.
Since we care about the content of the node, but not the node itself, we have one more step to go:
>>> response.xpath("""//script[#type="application/ld+json"][contains(text(), '"#type":"Event"')]/text()""")
[<Selector xpath='//script[#type="application/ld+json"][contains(text(), \'"#type":"Event"\')]/text()' data='\n [\n \n \n '>]
And this returns (a SelectorList of) our target text(). As you may see we could also do away with the self-reference.
Now, xpath() always returns a SelectorList, but we have a little helper for this: response.xpath().extract_first() will grab the list's first element -checking if it exists- before processing it.
We can put this result into a data variable, after which it's simple to json.loads(data) this into a Python dictionary and look up our values:
>>> events = json.loads(data)
>>> [item['url'] for item in events]
['<url>',
'<url>',
'<url>',
'<url>']
Now you can turn them into scrapy.Request(url)s, and you'll know how to continue from there.
.
As always, crawl responsibly and keep the 'net a nice place to be. I do not endorse any unlawful behavior.
Assessing one's rights or gaining permission to access specified target ressource is one's own responsibility.

Related

Scraping a website with data hidden under "read more"

I am trying to scrape reviews from Tripadvisor.com and I want to get the data under 'Read More' button of the site. Is there anyway to scrape this without using selenium?
So far this is the code that I used
resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS')
rsp_soup = BeautifulSoup(resp.text, 'html.parser')
rsp_soup.findAll(attrs={"class": "hotels-review-list-parts-ExpandableReview__reviewText--3oMkH"})
But it can't scrape contents under the 'Read more'
Reviews are partialy revealed in html until you click on read more which actually do not make an Ajax call but updates page from data contained in window.__WEB_CONTEXT__. You can access this data by looking into a <script> tag in which it appears:
<script>
window.__WEB_CONTEXT__={pageManifest:{"assets":["/components/dist/#ta/platform.polyfill.084d8cdf5f.js","/components/dist/runtime.56c5df2842.js", .... }
</script>
Once you've got it, you and you could extract and process the data which is of JSON format. Here is the full code:
import json
from bs4 import BeautifulSoup
resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS')
data = BeautifulSoup(resp.content).find('script', text = re.compile('window.__WEB_CONTEXT__')).text
#Some text processing to make the tag content a valid json
pageManifest = json.loads(data.replace('window.__WEB_CONTEXT__=','').replace('{pageManifest:', '{"pageManifest":')[:-1])
for x in pageManifest['pageManifest']['apolloCache']:
try:
reviews = x['result']['locations'][0]['reviewList']['reviews']
except:
pass
print([x['text'] for x in reviews])
Output
['Do arrange for airport transfers! From the airport, you will be taking a van for around 20 minutes, then you\'ll be transferred to a banca/boat for a 25 minute ride to the resort. Upon arrival, you\'ll be greeted by a band that plays their "welcome, welcome" song and in our case, we were met by Maria (awesome gal!) who introduced the group to the resort facilities and checks you in at the bar.I booked a deluxe room, which is actually a duplex with 2 adjoining rooms, ideal
for families, which accommodates 4 to a room.Rooms are clean and bed is comfortable.Potable water is provided upon check in , but is chargeable thereafter.Don\ 't worry, ...FULL REVIEW...',
"Stayed with my wife and 2 children, 10y and 13y. ...FULL REVIEW...",
'Beginning at now been in Coron for a couple of ...FULL REVIEW...',
'This was the most beautiful and relaxing place ...FULL REVIEW...',
'We spent 2 nights at El rio. It was incredible, ...FULL REVIEW... ']
In general, no. It all depends on what happens when you hit "Read More", i.e. where the actual data is.
There are usually two possibilities (not mutually exclusive):
the data lies in the same page, hidden, and the "read more" is e.g. a label for a hidden checkbox that, when selected, hides the "read more" span and makes the rest of the text appear. This way the page displayed is smaller and more readable, yet it is all loaded within the same call. In that case you just need to find a suitable selector (could be for example #someotherselector+input[type=checkbox] ~ div.moreText or something like that).
the data is not there, it will be loaded via AJAX after some time, remaining hidden, or only when you click on the "read more", to be displayed then. This allows keeping a small page that loads quickly and yet contains lots of items that would load slowly, loading them in background or on demand. In this case you need to inspect the actual AJAX call (it usually carries along an id or a data-value held in the 'Load More...' element: <span class="loadMore" data-text-id="x19834">Read more...</span>) and issue the same call with the appropriate headers:
resp2 = requests.get('https://www.tripadvisor.com.ph/whatever/api/is/used?id=' + element.attr('data-text-id'))
Without knowing how the data is retrieved and where the relevant elements (e.g. the name and content of the id-carrying attribute, etc.) are, it is not possible to give an answer that will work every time.
You might be interested in doing this the right way, also. The data you're scraping is protected by copyright, and TripAdvisor might change things enough that you'll have problems maintaining the scraper.

How to get the job description using scrapy?

I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.
So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.
Now, I try to get all the text from the div job_description where I actually get nothing. I used
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns [] response.xpath('//div[#class="job_description"]/div[#class="container"]/div[#class="row"]/text()').extract()
You were close with
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
The div-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[#class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:
response.xpath('//div[#class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[#class="job_description] and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[#class="col-sm-5 justify-text"]//*/text()').extract()

Perl cannot get content from html page, while the page has all the necessary tags

I'm trying to build a Redfin api, where I'm trying to get the content using LWP::Simple, for this link
Redfin Link
The content I get in a result, doesnot have the deails of the school. What I want is a "Willow Glen Middle School" and I have another API, that would take the input of this text and return with an exact API score.
I tried the same thing using Python, still the same result, below I'm dumping a code in perl. Whichever works, I'll take it.
It just has a simple code now :
use LWP::Simple;
$content = get("https://www.redfin.com/CA/San-Jose/947-Hummingbird-Dr-95125/home/1309375#schools");
print "Call API" if($content =~ /Willow Glen Middle School/);
You are not getting the result for school is because there is no result of school. The content of the page is populated using javascript, whereas your get method gives HTML without processing with javascript. You need to use something like WWW::Mechanize::Firefox to get your example to work. However note that it will be much slower that LWP.
Here is a sample code
#use LWP::Simple;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get("https://www.redfin.com/CA/San-Jose/947-Hummingbird-Dr-95125/home/1309375#schools");
#print $mech->content;
if($mech->content=~/Willow Glen Middle School/){
print "ya\n";
}

Python Using wildcard inside of strings

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.
I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm
There's a lot of code however this is the main part that I am struggling with. The code block looks like this:
def grab_yearly_data(self,page,year):
# page is the url that was downloaded, year in this case is 2014.
rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
#mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing
self.rank= [g for g in re.findall(rank_pattern,page)]
self.mov_title=[g for g in re.findall(mov_title_pattern,page)]
self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.
Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.
Here is a solution using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
import requests
url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
cells = row.find_all('td')
if len(cells) < 2:
continue
rank = cells[0].text
title = cells[1].text
print rank, title
Prints:
1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below
The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.
For this page, to get to the table you can rely on the chart element and get it's next table sibling:
for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
...
mov_title_pattern=r'.htm">([A-Za-z0-9 ]*)</a></font></b></td>'
Try this.This should work for your case.See demo.
https://www.regex101.com/r/fG5pZ8/6
Your regex does not make much sense. It matches .htm">[A-Z] as few times as possible, which is usually zero, yielding an empty string.
Moreover, with a very general regular expression like that, there is no guarantee that it only matches on the result rows. The generated page contains a lot of other places where you could expect to find .htm"> followed by something.
More generally, I would advocate an approach where you craft a regular expression which precisely identifies each generated result row, and extracts from that all the values you want. In other words, try something like
re.findall('stuff (rank) stuff (title) stuff stuff stuff')
(where I have left it as an exercise to devise a precise regular expression with proper HTML fragments where I have the stuff placeholders)
and extract both the "rank" group and the "title" group out of each matched row.
Granted, scraping is always brittle business. If you make your regex really tight, chances are it will stop working if the site changes some details in its layout. If you make it too relaxed, it will sometimes return the wrong things.

BeautifulSoup find and find_all not working as expect

I just starting using BeautifulSoup and I am encountering a problem. I set up a html snippet below and make a BeautifulSoup object:
html_snippet = '<p class="course"><span class="text84">Ae 100. Research in Aerospace. </span><span class="text85">Units to be arranged in accordance with work accomplished. </span><span class="text83">Open to suitably qualified undergraduates and first-year graduate students under the direction of the staff. Credit is based on the satisfactory completion of a substantive research report, which must be approved by the Ae 100 adviser and by the option representative. </span> </p>'
subject = BeautifulSoup(html_snippet)
I have tried doing several find and find_all operations like below but all I am getting is nothing or an empty list:
subject.find(text = 'A')
subject.find(text = 'Research')
subject.next_element.find('A')
subject.find_all(text = 'A')
When I created the BeautifulSoup object from a html file on my computer before, the find and find_all operations were all working fine. However, when I pulled the html_snippet from reading a webpage online through urllib2, I am getting problems.
Can anyone point out where the issue is?
Pass the argument like this:
import re
subject.find(text=re.compile('A'))
The default behavior for the text filter is to match on the entire body. Passing in a regular expression lets you match on fragments.
EDIT: To match only bodies beginning with A, you can use the following:
subject.find(text=re.compile('^A'))
To match only bodies containing words that begin with A, you can use:
subject.find_all(text = re.compile(r'\bA'))
It's difficult to tell more specifically what you're looking for, let me know if I've misinterpreted what you're asking.

Categories