Extract email and phone numbers from website using python and scrapy - python

I have a list of thousands of websites and I would like to extract phone numbers and emails if available.
Possibly using python + scrapy
I found this one
https://levelup.gitconnected.com/scraping-websites-for-phone-numbers-and-emails-with-python-5557fcfa1596
but it looks like the package is not available anymore.
Any suggestions?
thanks!

This is a broad question, so I cant answer it here completely.
Basically, you need to follow the following steps:
First, scrape the website HTML using BS4 or Scrapy.
Then use some regex to find emails, phone numbers
Also check this article: https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/

Related

How can you iterate through the list of websites in the search engine in Python?

When you search something on your browser it will give you by default a list of websites related to the search that you have done, but I was wondering if there was a way to store/print/iterate the list of urls shown in that main page.
I haven't tried anything because I don't even know which python library should I use.
Which library should I use for this puprose?
I hope that it is a valid question.
Beautiful Soup
Requests
Selenium
Pick your poison.
Read the docs.
???
Profit!

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

Webscraping phonenumbers

First of all I'm a total newbie to programming and my English is not the best.
Im using Python 3.6 on Windows 10 Pro.
After some trial and error i finally figured out how to scrap data from a webpage via lxml and how to use beautifulsoup and csv to add it to an excel sheet.
So far that works out for me. It was pretty easy to collect lists of names, adresses and distance. But when i tried to extract the phone numbers and emails, i got in trouble. After some research i found out they split the phone number and kinda encoded it. Emails are also kinda tricky.
The webpage i want to extract that data from is:
https://www.gelbeseiten.de/schluesselfertigbau/bergheim,,,,,umkreis-50000
I found out that the first part of the phonenumber is in here:
<span class="nummer">(02271) 6 79</span>
They hided the rest in here:
<span class="suffix encode_me telSelector128028047679_2623072" data-telselector="telSelector128028047679_2623072" data-telsuffix="IDcw"> 70</span>
Even with the first part seems beeing easy, i cant use lxml how im used to for extracting it.
So my question is, if its still possible for a beginner to extract that phonenumbers and emails?
Or should i try to get those numbers out of the print PDF-files?
Try below solution to get phone number:
import requests
from lxml import html
source = html.fromstring(requests.get("https://www.gelbeseiten.de/schluesselfertigbau/bergheim,,,,,umkreis-50000").text)
phone_number = "".join([text_node for text_node in source.xpath('//li[#class="phone"]//text()') if text_node.strip()])
print(phone_number)
Output:
'(02271) 6 79 70'

How to extract all the url's from a website?

I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html

A simple spider question

I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you.
I want to
start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attorneys/List.aspx?LastName=A
From LastName=A to extract links to actual bios: /BioLinks/
visit each of the /BioLinks/ to extract the school info for each attorney.
I am able to extract the /BioLinks/ and School information but I am unable to go from the initial url to the bio pages.
If you think this is the wrong way to go about this, then, how would you achieve this goal?
Many thanks.
Not sure I fully understand what you're asking, but maybe you need to get the absolute URL to each bio and retrieve the source code for that page:
import urllib2
bio_page = urllib.urlopen(bio_url).read()
Then use a regular expressions or other parsing to get the attorney's law school.

Categories