Edit I now realize the API is simply inadequate and is not even working.
I would like to redirect my question, I want to be able to auto-magically search duckduckgo using their "I'm feeling ducky". So that I can search for "stackoverflow" for instance and get the main page ("https://stackoverflow.com/") as my result.
I am using the duckduckgo API. Here
And I found that when using:
r = duckduckgo.query("example")
The results do not reflect a manual search, namely:
for result in r.results:
print result
Results in:
>>>
>>>
Nothing.
And looking for an index in results results in an out of bounds error, since it is empty.
How am I supposed to get results for my search?
It seems the API (according to its documented examples) is supposed to answer questions and give a sort of "I'm feeling ducky" in the form of r.answer.text
But the website is made in such a way that I can not search it and parse results using normal methods.
I would like to know how I am supposed to parse search results with this API or any other method from this site.
Thank you.
If you visit DuckDuck Go API Page, you will find some notes about using the API. The first notes says clearly that:
As this is a Zero-click Info API, most deep queries (non topic names)
will be blank.
An here's the list of those fields:
Abstract: ""
AbstractText: ""
AbstractSource: ""
AbstractURL: ""
Image: ""
Heading: ""
Answer: ""
Redirect: ""
AnswerType: ""
Definition: ""
DefinitionSource: ""
DefinitionURL: ""
RelatedTopics: [ ]
Results: [ ]
Type: ""
So it might be a pity, but their API just truncates a bunch of results and does not give them to you; possibly to work faster, and seems like nothing can be done except using DuckDuckGo.com.
So, obviously, in that case API is not the way to go.
As for me, I see only one way out left: retrieving raw html from duckduckgo.com and parsing it using, e.g. html5lib (it worth to mention that their html is well-structured).
It also worth to mention that parsing html pages is not the most reliable way to scrap data, because html structure can change, while API usually stays stable until changes are publicly announced.
Here's and example of how can be such parsing achieved with BeautifulSoup:
from BeautifulSoup import BeautifulSoup
import urllib
import re
site = urllib.urlopen('http://duckduckgo.com/?q=example')
data = site.read()
parsed = BeautifulSoup(data)
topics = parsed.findAll('div', {'id': 'zero_click_topics'})[0]
results = topics.findAll('div', {'class': re.compile('results_*')})
print results[0].text
This script prints:
u'Eixample, an inner suburb of Barcelona with distinctive architecture'
The problem of direct querying on the main page is that it uses JavaScript to produce required results (not related topics), so you can use HTML version to get results only. HTML version has different link:
http://duckduckgo.com/?q=example # JavaScript version
http://duckduckgo.com/html/?q=example # HTML-only version
Let's see what we can get:
site = urllib.urlopen('http://duckduckgo.com/html/?q=example')
data = site.read()
parsed = BeautifulSoup(data)
first_link = parsed.findAll('div', {'class': re.compile('links_main*')})[0].a['href']
The result stored in first_link variable is a link to the first result (not a related search) that search engine outputs:
http://www.iana.org/domains/example
To get all the links you can iterate over found tags (other data except links can be received similar way)
for i in parsed.findAll('div', {'class': re.compile('links_main*')}):
print i.a['href']
http://www.iana.org/domains/example
https://twitter.com/example
https://www.facebook.com/leadingbyexample
http://www.trythisforexample.com/
http://www.myspace.com/leadingbyexample?_escaped_fragment_=
https://www.youtube.com/watch?v=CLXt3yh2g0s
https://en.wikipedia.org/wiki/Example_(musician)
http://www.merriam-webster.com/dictionary/example
...
Note that HTML-only version contains only results, and for related search you must use JavaScript version. (vithout html part in url).
After already getting an answer to my question which I accepted and gave bounty for - I found a different solution, which I would like to add here for completeness. And a big thank you to all those who helped me reach this solution. Even though this isn't the solution I asked for, it may help someone in the future.
Found after a long and hard conversation on this site and with some support mails: https://duck.co/topic/strange-problem-when-searching-intel-with-my-script
And here is the solution code (from an answer in the thread posted above):
>>> import duckduckgo
>>> print duckduckgo.query('! Example').redirect.url
http://www.iana.org/domains/example
Try:
for result in r.results:
print result.text
If it suits your application, you might also try the related searches
r = duckduckgo.query("example")
for i in r.related_searches:
if i.text:
print i.text
This yields:
Eixample, an inner suburb of Barcelona with distinctive architecture
Example (musician), a British musician
example.com, example.net, example.org, example.edu and .example, domain names reserved for use in documentation as examples
HMS Example (P165), an Archer-class patrol and training vessel of the British Royal Navy
The Example, a 1634 play by James Shirley
The Example (comics), a 2009 graphic novel by Tom Taylor and Colin Wilson
For python 3 users, the transcription of #Rostyslav Dzinko's code:
import re, urllib
import pandas as pd
from bs4 import BeautifulSoup
query = "your query"
site = urllib.request.urlopen("http://duckduckgo.com/html/?q="+query)
data = site.read()
soup = BeautifulSoup(data, "html.parser")
my_list = soup.find("div", {"id": "links"}).find_all("div", {'class': re.compile('.*web-result*.')})[0:15]
(result__snippet, result_url) = ([] for i in range(2))
for i in my_list:
try:
result__snippet.append(i.find("a", {"class": "result__snippet"}).get_text().strip("\n").strip())
except:
result__snippet.append(None)
try:
result_url.append(i.find("a", {"class": "result__url"}).get_text().strip("\n").strip())
except:
result_url.append(None)
Related
There is a page that contains more than one table. I would like to scrape any table whatever I want.
I have noticed that using the code below I receive only access to the first table:
import requests
import lxml.html as lh
url= 'some url'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
for t in tr_elements[0]:
name=t.text_content()
print(name)
According to the answers included in How can I find an element by CSS class with XPath? i was trying to do the following in order to get access to the other table. I have written
doc.xpath('//*[contains(#class, 'some name of the class')]//tr') instead of just
doc.xpath('//tr'). However this gave me no result. I must admit that my knowledge of using xpath is very low, so I would like to receive an answer instead of just informing me that someone has asked a similar question.
Thank you in advanced for help.
EDIT:
here is the url: https://biznes.interia.pl/gieldy/notowania-gpw/profil-akcji-mab,wId,6852,tab,przebieg-sesji
I'm pretty new at this and I'm trying to figure out a way to look up a list of websites automatically. I have a very large list of companies and essentially I'd want the algorithm to type the company into Google, click the first link (most likely the company website) and figure out whether the company matches the target industry (ice cream distributors) or has anything to do with the industry. The way I'd want to check for this is by seeing if the home page contains any of the key words in a given dictionary (let's say, 'chocolate, vanilla, ice cream, etc'). I would really appreciate some help with this - thank you so much.
I recommend using a combination of requests and lxml. To accomplish this you could do something similar to this.
import requests
from lxml.cssselect import CSSSelector
from lxml import html
use requests or grequests to get the html from all the pages.
queries = ['cats', 'dogs']
queries = [requests.get(x) for x in queries]
data = [x.text for x in queries]
parse the html with lxml and extract the first link on each page.
data = [html.document_fromstring(x) for x in data]
sel = CSSSelector('h3.r a')
links = [sel(x)[0] for x in data]
finally grab the html from all the first results.
pages = [requests.get(a.attrib['href'] for a in links]
this will give you an html string each of the pages you want. From there you should be able to simply search for the words you want in the pages html. You might find a counter helpful.
I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll
I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.
I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue