I need to make an app that uses Python to search for specific names on a website. For instance, I have to check if the string "Robert Paulson" is being used on a website. If it is, it returns True. Else, false. Also,is there any library that can help me make that?
Since you have not attempted to make your application first, then I am not going to post code for you. I will however, suggest using:
urllib2:
A robust module for interacting with webpages. i.e. pull back the html of a webpage.
BeautifulSoup (from bs4 import BeautifulSoup):
An awesome module to "regex" html to find what is is that you're looking for.
Good luck my friend!
You could do something similar to this other answer. You will just need the regex to find your string.
I have also used Selenium webdriver to solve some more complex webesite searching, although I think the link I provided would solve your problem more simply.
Related
When you search something on your browser it will give you by default a list of websites related to the search that you have done, but I was wondering if there was a way to store/print/iterate the list of urls shown in that main page.
I haven't tried anything because I don't even know which python library should I use.
Which library should I use for this puprose?
I hope that it is a valid question.
Beautiful Soup
Requests
Selenium
Pick your poison.
Read the docs.
???
Profit!
I would like to know if it is possible to scrape google search specifying a date range. I read about googlesearch and I am trying to use its module (search). However it seems that something it is not working.
Using 'cdr:1,cd_min:01/01/2020,cd_max:01/01/2020' to search all results about a query (for example Kevin Spacey), it is not returning the expected urls. I guess something it is not working with the function (as defined in the library). Has someone ever tried to use it?
I am looking for results in Italian (only pages in Italian and with domain google.it). Another way to scrape these results would be also welcomed.
Many thanks
May this information help you:
Then, use the HTTP Spy to get the detail of the request. It's useful when Google changes their format of search, and the Module has not applied update to their code.
Good luck!
I'm currently using urllib2 and BeautifulSoup to open and parse html data. However I've ran into a problem with a site that uses javascript to load the images after the page has been rendered (I'm trying to find the image source for a certain image on the page).
I'm thinking Twill could be a solution, and am trying to open the page and use a regular expression with 'find' to return the html string I'm looking for. I'm having some trouble getting this to work though, and can't seem to find any documentation or examples on how to use regular expressions with Twill.
Any help or advice on how to do this or solve this problem in general would be much appreciated.
I'd rather user CSS selectors or "real" regexps on page source. Twill is AFAIK not being worked on. Have you tried BS or PyQuery with CSS selectors?
Twill does not work with javascript (see http://twill.idyll.org/browsing.html)
use webdriver if you want to handle javascript
I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html
How can I execute links2 to open a web page and locate and click a text link with Python?
Is pexpect able to do it? Any examples are appreciated.
Not sure why you want to do this. If you want to grab the web link and process the page content, urllib2 together with an HTML parser (BeautifulSoup for example) may be just fine.
If you do want to simulate moust clicks, you may want to use AutoPy.
Why do you want to use links2? I don't see how you could benefit from that. It is probably better to approach your problem in a different way, like with mechanize or maybe even twill.
Please provide a description of your overall problem instead of that specific question
if you want javascript support use selenium rc with whatever language you are comfortable with