Python 2.7 BeautifulSoup , email scraping - python

Hope you are all well. I'm new in Python and using python 2.7.
I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from a-z in the full directory. This directory does not have an API unfortunately.
I'm using BeautifulSoup, but with no success so far.
here is mycode:
import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)
what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. i also tried replacing soup('a') with soup ('target'), but no luck! Can anybody help me please?

You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto::
import requests
soup = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)
print([a["href"] for a in soup.select("a[href^=mailto:]")])
Or extract the text:
print([a.text for a in soup.select("a[href^=mailto:]")])
Using find_all("a") you would need to use a regex to achieve the same:
import re
find_all("a", href=re.compile(r"^mailto:"))

Related

Web scraping youtube page

i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()

How to find all Elements of a specific Type with the new Requests-HTML library

I wanna find all specific fields in a HTML, in Beautiful soup everything is working with this code:
soup = BeautifulSoup(html_text, 'html.parser')
urls_previous = soup.find_all('h2', {'class': 'b_algo'})
but how can I make the same search with the requests library or can requests only find a single element in a HTML document, I couldn't find how to do it in the docs or examples ?
https://html.python-requests.org/
Example:
<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>
How can I find all Elements of a specific type with the requests library ?
with requests-html
from requests_html import HTML
doc = """<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>"""
#load html from string
html = HTML(html=doc)
x = html.find('h2')
print(x)

Extract data from span tag within a div tag

I have the html code below and I'm trying to extract the 3:40 as text to use in my python script. How would I go about grabbing that info?
example element
I would use the BeautifulSoup library. Here is how I would grap this info knowning that you already have the HTML file :
from bs4 import BeautifulSoup
with open(html_path) as html_file:
html_page = BeautifulSoup(html_file, 'html.parser')
div = html_page.find('div', class_='playbackTimeline__duration')
span = div.find('span', {'aria-hidden': 'true'})
text = span.get_text()
I'm not sure if it works, but it gives you an idea on how to do this kind of stuff. Check for "web scraping" if you want more information about that. :)

Extract links from html page using BeautifulSoup

I need to extract some articles from the Piography website.
so from this page http://www.biography.com/people I need all the sublinks.
for example:
/people/ryan-seacrest-21095899
/people/edgar-allan-poe-9443160
but I have two problems:
1- when I am trying to a find all < a >. I couldn't find the href that I need.
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.biography.com/people"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
divs = soup.findAll('a')
for div in divs:
print(div)
2- There are a "see more" button. so how I can take all the links for all the people in the website. not just that appear in the first page?
On site what you show, use angular and part of content generate with JS. BeautifulSoup not execute JS. You need to use http://selenium-python.readthedocs.io/ or another like instrument. Or you may pry in ajax need for you GET(or may be POST) method, and give data through him.

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories