Python - Scrapy unable to fetch data - python

I am just starting out with Python/Scrapy.
I have a written a spider that crawls a website and fetches information. But i am stuck in 2 places.
I am trying to retrieve the telephone numbers from a page and they are coded like this
<span class="mrgn_right5">(+001) 44 42676000,</span>
<span class="mrgn_right5">(+011) 44 42144100</span>
The code i have is:
getdata = soup.find(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone']=getdata.get_text().strip()
#print phone
But it is fetching only the first set of numbers and not the second one. How can i fix this?
On the same page there is another set of information
I am using this code
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
#print getdata
aditem['Pricerange']=getdata.get_text().strip()
#print pricerange
But it is not fetching any thing.
Any help on fixing these two would be great.

From a browse of the Beautiful Soup documentation, find will only return a single result. If multiple results are expected/required, then use find_all instead. Since there are two results, a list will be returned, so the elements of the list need to be joined together (for example) to add them to Phone field of your AdItem.
getdata = soup.find_all(attrs={"class":"mrgn_right5"})
if getdata:
aditem['Phone'] = ''.join([x.get_text().strip() for x in getdata])
For the second issue, you need to access the attributes of the returned object. Try the following:
getdata = soup.find(attrs={"itemprop":"pricerange"})
if getdata:
aditem['Pricerange'] = getdata.attrs['content']
And for the address information, the following code works but is very hacky and could no doubt be improved by someone who understands Beautiful Soup better than me.
getdata = soup.find(attrs={"itemprop":"address"})
address = getdata.span.get_text()
addressLocality = getdata.meta.attrs['content']
addressRegion = getdata.find(attrs={"itemprop":"addressRegion"}).attrs['content']
postalCode = getdata.find(attrs={"itemprop":"postalCode"}).attrs['content']

Related

Beautiful soup find_all on ul with specified class returns none regardless of specified class. Find_all works on a different ul in the same program

trying to search craigslist across multiple cities
first time using find all is to grab each city and save link, works no problem.
Second time using find all is in a for loop, idea is city by city we save the single link needed. Unfortunaley it saves none every time. Using find will save the first list item which shows me it kind of works and I must be doing something wrong with the find_all method. Ive used find_all method a few times in single loops with no issue, is there a problem because Im calling from nested for loop? Find works no problem though...
soup = BeautifulSoup(cityPage, 'lxml')
cities = soup.find('ul', class_="height6 geo-site-list")
#saves link to each city in list
city_hyperlink = cities.find_all('li')
for city in city_hyperlink:
# for each city, goal is to extract 1 list item before going to next city
#make url link for new soup object using each new city link
so = requests.get(city.a['href']).text
soupy = BeautifulSoup(so, 'lxml')
#save ul class so we can find specific list item we need
car_col_class = soupy.find("ul", id="sss0")
#issue starts here, this returns none. find() pulls first list item but we need
specific one down the list
#ignores the specified class tag, just returns none
for col in car_col_class:
search = car_col_class.find_all('li', class_="ata")
#just a test to see if it found correct url
print(search)
Searching by data attribute as opposed to class name solved the problem. No idea why but it works
search = car_col_class.find(attrs={"data-cat": "cta"})
as opposed to
search = car_col_class.find('li', class= "cta" )

Access the next page of list results in Reddit API

I'm trying to play around with the API of Reddit, and I understand most of it, but I can't seem to figure out how to access the next page of results (since each page is 25 entries).
Here is the code I'm using:
import requests
import json
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all')
listing = r.json()
after = listing['data']['after']
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
query = 'https://www.reddit.com/r/Petscop/top.json?after='+after
r = requests.get(query)
listing = r.json()
data = listing['data']['children']
for entry in data:
post = entry['data']
print post['score']
So I extract the after ID as after, and pass it into the next request. However, after the first 25 entries (the first page) the code returns just an empty list ([]). I tried changing the second query as:
r = requests.get(r'https://www.reddit.com/r/Petscop/top.json?after='+after)
And the result is the same. I also tried replacing "after" with "before", but the result was again the same.
Is there a better way to get the next page of results?
Also, what the heck is the r in the get argument? I copied it from an example, but I have no idea what it actually means. I ask because I don't know if it is necessary to access the next page, and if it is necessary, I don't know how to modify the query dynamically by adding after to it.
Try:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after='+after
or better:
query = 'https://www.reddit.com/r/Petscop/top.json?sort=top&show=all&t=all&after={}'.format(after)
As for r in strings you can omit it.

soup.find("div", id = "tournamentTable"), None returned - python 2.7 - BS 4.5.1

I'm Trying to parse the following page: http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/
The part I'm interested in is getting the table along with the scores and odds.
The code I have so far:
url = "http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/"
req = requests.get(url, timeout = 9)
soup = BeautifulSoup(req.text)
print soup.find("div", id = "tournamentTable"), soup.find("#tournamentTable")
>>> <div id="tournamentTable"></div> None
Very simple but I'm thus weirdly stuck at finding the table in the tree. Although I found already prepared datasets I would like to know to why the printed strings are a tag and None.
Any ideas?
Thanks
First, this page uses JavaScript to fetch data, if you disable the JS in your browser, you will notice that the div tag exist but nothing in it, so, the first will print a single tag.
Second, # is CSS selector, you can not use it in find()
Any argument that’s not recognized will be turned into a filter on one
of a tag’s attributes.
so, the second find will to find some tag with #tournamentTable as it's attribute, and nothing will be match, so it will return None
It looks like the table gets populated with an Ajax call back to the server. That is why why you print soup.find("div", id = "tournamentTable") you get only the empty tag. When you print soup.find("#tournamentTable"), you get None because that is trying to find a element with the tag "#tournamentTable". If you want to use CSS selectors, you should use soup.select() like this, soup.select('#tournamentTable') or soup.select('div#tournamentTable') if you want to be even more particular.

BeautifulSoup to access available bikes in DC bikeshare

I'm new to programming and python and am trying to access the number of available bikes at a given station in the DC bikeshare program. I believe that the best way to do that is with BeautifulSoup. The good news is that the data is available in what appears to be a clean format here: https://www.capitalbikeshare.com/data/stations/bikeStations.xml
Here's an example of a station:
<station>
<id>1</id>
<name>15th & S Eads St</name>
<terminalName>31000</terminalName>
<lastCommWithServer>1460217337648</lastCommWithServer>
<lat>38.858662</lat>
<long>-77.053199</long>
<installed>true</installed>
<locked>false</locked>
<installDate>0</installDate>
<removalDate/>
<temporary>false</temporary>
<public>true</public>
<nbBikes>7</nbBikes>
<nbEmptyDocks>8</nbEmptyDocks>
<latestUpdateTime>1460192501598</latestUpdateTime>
</station>
I'm looking for the <nbBikes> value. I had what I thought would be the start of a python script that would show me the value for the first 5 stations (I'll tackle picking the station I want once I get this under control) but it doesn't return any values. Here's the script:
# bikeShareParse.py - parses the capital bikeshare info page
import bs4, requests
url = "https://www.capitalbikeshare.com/data/stations/bikeStations.xml"
res = requests.get(url)
res.raise_for_status()
#create the soup element from the file
soup = bs4.BeautifulSoup("res.text", "lxml")
# defines the part of the page we are looking for
nbikes = soup.select('#text')
#limits number of results for testing
numOpen = 5
for i in range(numOpen):
print nbikes
I believe that my problem (besides not understanding how to format code correctly in a stack overflow question) is that the value for nbikes = soup.select('#text') is incorrect. However, I can't seem to substitute anything for '#text' to get any values, let alone the ones I want.
Am I approaching this the right way? If so, what am I missing?
thanks
This script creates a dictionary with the structure [station_ID, bikes_remaining]. It is modified from the beginning of this: http://www.plotsofdots.com/archives/68
# from http://www.plotsofdots.com/archives/68
import xml.etree.ElementTree as ET
import urllib2
#we parse the data using urlib2 and xml
site='https://www.capitalbikeshare.com/data/stations/bikeStations.xml'
htm=urllib2.urlopen(site)
doc = ET.parse(htm)
#we get the root tag
root=doc.getroot()
root.tag
#we define empty lists for the empty bikes
sID=[]
embikes=[]
#we now use a for loop to extract the information we are interested in
for country in root.findall('station'):
sID.append(country.find('id').text)
embikes.append(int(country.find('nbBikes').text))
#this just tests that the process above works, can be commented out
#print embikes
#print sID
#use zip to create touples and then parse them into a dataframe
prov=zip(sID,embikes)
print prov[0]

duckduckgo API not returning results

Edit I now realize the API is simply inadequate and is not even working.
I would like to redirect my question, I want to be able to auto-magically search duckduckgo using their "I'm feeling ducky". So that I can search for "stackoverflow" for instance and get the main page ("https://stackoverflow.com/") as my result.
I am using the duckduckgo API. Here
And I found that when using:
r = duckduckgo.query("example")
The results do not reflect a manual search, namely:
for result in r.results:
print result
Results in:
>>>
>>>
Nothing.
And looking for an index in results results in an out of bounds error, since it is empty.
How am I supposed to get results for my search?
It seems the API (according to its documented examples) is supposed to answer questions and give a sort of "I'm feeling ducky" in the form of r.answer.text
But the website is made in such a way that I can not search it and parse results using normal methods.
I would like to know how I am supposed to parse search results with this API or any other method from this site.
Thank you.
If you visit DuckDuck Go API Page, you will find some notes about using the API. The first notes says clearly that:
As this is a Zero-click Info API, most deep queries (non topic names)
will be blank.
An here's the list of those fields:
Abstract: ""
AbstractText: ""
AbstractSource: ""
AbstractURL: ""
Image: ""
Heading: ""
Answer: ""
Redirect: ""
AnswerType: ""
Definition: ""
DefinitionSource: ""
DefinitionURL: ""
RelatedTopics: [ ]
Results: [ ]
Type: ""
So it might be a pity, but their API just truncates a bunch of results and does not give them to you; possibly to work faster, and seems like nothing can be done except using DuckDuckGo.com.
So, obviously, in that case API is not the way to go.
As for me, I see only one way out left: retrieving raw html from duckduckgo.com and parsing it using, e.g. html5lib (it worth to mention that their html is well-structured).
It also worth to mention that parsing html pages is not the most reliable way to scrap data, because html structure can change, while API usually stays stable until changes are publicly announced.
Here's and example of how can be such parsing achieved with BeautifulSoup:
from BeautifulSoup import BeautifulSoup
import urllib
import re
site = urllib.urlopen('http://duckduckgo.com/?q=example')
data = site.read()
parsed = BeautifulSoup(data)
topics = parsed.findAll('div', {'id': 'zero_click_topics'})[0]
results = topics.findAll('div', {'class': re.compile('results_*')})
print results[0].text
This script prints:
u'Eixample, an inner suburb of Barcelona with distinctive architecture'
The problem of direct querying on the main page is that it uses JavaScript to produce required results (not related topics), so you can use HTML version to get results only. HTML version has different link:
http://duckduckgo.com/?q=example # JavaScript version
http://duckduckgo.com/html/?q=example # HTML-only version
Let's see what we can get:
site = urllib.urlopen('http://duckduckgo.com/html/?q=example')
data = site.read()
parsed = BeautifulSoup(data)
first_link = parsed.findAll('div', {'class': re.compile('links_main*')})[0].a['href']
The result stored in first_link variable is a link to the first result (not a related search) that search engine outputs:
http://www.iana.org/domains/example
To get all the links you can iterate over found tags (other data except links can be received similar way)
for i in parsed.findAll('div', {'class': re.compile('links_main*')}):
print i.a['href']
http://www.iana.org/domains/example
https://twitter.com/example
https://www.facebook.com/leadingbyexample
http://www.trythisforexample.com/
http://www.myspace.com/leadingbyexample?_escaped_fragment_=
https://www.youtube.com/watch?v=CLXt3yh2g0s
https://en.wikipedia.org/wiki/Example_(musician)
http://www.merriam-webster.com/dictionary/example
...
Note that HTML-only version contains only results, and for related search you must use JavaScript version. (vithout html part in url).
After already getting an answer to my question which I accepted and gave bounty for - I found a different solution, which I would like to add here for completeness. And a big thank you to all those who helped me reach this solution. Even though this isn't the solution I asked for, it may help someone in the future.
Found after a long and hard conversation on this site and with some support mails: https://duck.co/topic/strange-problem-when-searching-intel-with-my-script
And here is the solution code (from an answer in the thread posted above):
>>> import duckduckgo
>>> print duckduckgo.query('! Example').redirect.url
http://www.iana.org/domains/example
Try:
for result in r.results:
print result.text
If it suits your application, you might also try the related searches
r = duckduckgo.query("example")
for i in r.related_searches:
if i.text:
print i.text
This yields:
Eixample, an inner suburb of Barcelona with distinctive architecture
Example (musician), a British musician
example.com, example.net, example.org, example.edu and .example, domain names reserved for use in documentation as examples
HMS Example (P165), an Archer-class patrol and training vessel of the British Royal Navy
The Example, a 1634 play by James Shirley
The Example (comics), a 2009 graphic novel by Tom Taylor and Colin Wilson
For python 3 users, the transcription of #Rostyslav Dzinko's code:
import re, urllib
import pandas as pd
from bs4 import BeautifulSoup
query = "your query"
site = urllib.request.urlopen("http://duckduckgo.com/html/?q="+query)
data = site.read()
soup = BeautifulSoup(data, "html.parser")
my_list = soup.find("div", {"id": "links"}).find_all("div", {'class': re.compile('.*web-result*.')})[0:15]
(result__snippet, result_url) = ([] for i in range(2))
for i in my_list:
try:
result__snippet.append(i.find("a", {"class": "result__snippet"}).get_text().strip("\n").strip())
except:
result__snippet.append(None)
try:
result_url.append(i.find("a", {"class": "result__url"}).get_text().strip("\n").strip())
except:
result_url.append(None)

Categories