First of all I'm a total newbie to programming and my English is not the best.
Im using Python 3.6 on Windows 10 Pro.
After some trial and error i finally figured out how to scrap data from a webpage via lxml and how to use beautifulsoup and csv to add it to an excel sheet.
So far that works out for me. It was pretty easy to collect lists of names, adresses and distance. But when i tried to extract the phone numbers and emails, i got in trouble. After some research i found out they split the phone number and kinda encoded it. Emails are also kinda tricky.
The webpage i want to extract that data from is:
https://www.gelbeseiten.de/schluesselfertigbau/bergheim,,,,,umkreis-50000
I found out that the first part of the phonenumber is in here:
<span class="nummer">(02271) 6 79</span>
They hided the rest in here:
<span class="suffix encode_me telSelector128028047679_2623072" data-telselector="telSelector128028047679_2623072" data-telsuffix="IDcw"> 70</span>
Even with the first part seems beeing easy, i cant use lxml how im used to for extracting it.
So my question is, if its still possible for a beginner to extract that phonenumbers and emails?
Or should i try to get those numbers out of the print PDF-files?
Try below solution to get phone number:
import requests
from lxml import html
source = html.fromstring(requests.get("https://www.gelbeseiten.de/schluesselfertigbau/bergheim,,,,,umkreis-50000").text)
phone_number = "".join([text_node for text_node in source.xpath('//li[#class="phone"]//text()') if text_node.strip()])
print(phone_number)
Output:
'(02271) 6 79 70'
Related
I have to scrap text data from this website. I have read some blogs on web scrap. But the major challenge that I have found is parsing HTML code. I am entirely new to this field. Can I get some help about how to scrap text data(which is possible) and make it into a CSV? Is this possible at all without knowledge about html? Can I expect a good demonstration of python code solving my problem then I will try this on my own for other websites?
TIA
The tools you can use in Python to scrape and parse html data are the requests module and the Beautiful Soup library.
Parsing html files into, for example, csv files is entirely possible, it just requires some effort to learn the tools. In my view there's no best way to learn this than by trying it out yourself.
As for "do you need to know html to parse html files?" well, yes you do, but the good thing is that html is actually quite simple. I suggest you take a look at some tutorials like this one, then inspect the webpage you're interested in and see if you can relate the two.
I appreciate my answer is not really what you were looking for, however as I said I think there's no best way to learn than to try things out yourself. If you're then stuck on anything in particular you can then ask on SO for specific help :)
I din't check the html of the website but you can use beautifulsoup for parsing
html and pandas for converting data into csv
sample code
import requests
from bs4 import BeautifulSoup
res = requests.get('yourwesite.com')
soup = BeautifulSoup(res.content,'html.parser')
# suppose i want all 'li' tags and links in 'li' tags.
lis = soup.find_all("li")
links = []
for li in lis:
a_tag = li.find("a")
link = a_tag.get("href")
links.appedn(link)
And you can get lots of tutorial on pandas online.
I have a list of thousands of websites and I would like to extract phone numbers and emails if available.
Possibly using python + scrapy
I found this one
https://levelup.gitconnected.com/scraping-websites-for-phone-numbers-and-emails-with-python-5557fcfa1596
but it looks like the package is not available anymore.
Any suggestions?
thanks!
This is a broad question, so I cant answer it here completely.
Basically, you need to follow the following steps:
First, scrape the website HTML using BS4 or Scrapy.
Then use some regex to find emails, phone numbers
Also check this article: https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/
<br>Questionnaire score: 0 out of 0<br>
<br>Question: 1: Present Location ? Ready to relocate?<br>
<br>Answer: Yes<br>
<br>Question: 2: Highest level of education and completion date<br>
<br>Answer: Bachelors<br>
<br>Question: 3: Are you authorized to work in UK?<br>
<br>Answer: Yes<br>
Questions are fixed but answers may be different. I have tried HTML parser and beautifulsoup4.
Please Help with a suitable code.
Thanks..!
You can use Scrapy most wanted framework for web scraping using python, making it super simple and easy to use.
I am having some issues parsing an IM chat log using Python 2.7. I am currently using BeautifulSoup.get_text. This generally works, but sometimes masks interesting stuff. For instance:
<font color="#A82F2F"><font size="2">(3/11/2016 3:11:57 PM)</font> <b>user name:</b></font> <html xmlns='http://jabber.org/protocol/xhtml-im'><body xmlns='http://www.w3.org/1999/xhtml'><p>Have you posted the key to https://___.edu/sshkeys/?</p></body></html><br/>
In this case, I get the Have you posted the key to part, but it strips out the https:________ part.
Most, not all, the lines are formatted the same. i.e. date time, user, interesting stuff.
Is there a better way to parse this to get the text AND all the interesting stuff?
You can utilize find_all:
for anchor in soup.find_all('a', href=True):
print("The anchor url={} text={}".format(anchor['href'], anchor['text'])
Depending on how you want to output this information, you'd have to get more or less clever.
I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you.
I want to
start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attorneys/List.aspx?LastName=A
From LastName=A to extract links to actual bios: /BioLinks/
visit each of the /BioLinks/ to extract the school info for each attorney.
I am able to extract the /BioLinks/ and School information but I am unable to go from the initial url to the bio pages.
If you think this is the wrong way to go about this, then, how would you achieve this goal?
Many thanks.
Not sure I fully understand what you're asking, but maybe you need to get the absolute URL to each bio and retrieve the source code for that page:
import urllib2
bio_page = urllib.urlopen(bio_url).read()
Then use a regular expressions or other parsing to get the attorney's law school.