How to save a file from BeautifulSoup? - python

I am trying to scrape a website, which so far I am able to scrape but I want to output the file to a text file then from there I want to delete some strings in it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
delete = ['https://', 'http://', 'b\'http://', 'b\'https://']
url = urlopen('https://openphish.com/feed.txt')
bs = BeautifulSoup(url.read(), 'html.parser' )
print(bs.encode('utf_8'))
The result is a lot of links, I can show a sample.
"b'https://certain-wrench.000webhostapp.com/auth/signin/details.html\nhttps://sweer-adherence.000webhostapp.com/auth/signin/details.html\n"
UPDATED
import requests
from bs4 import BeautifulSoup
url = "https://openphish.com/feed.txt"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
with open('url.txt', 'w', encoding='utf-8') as f_out:
f_out.write(soup.prettify())
delete = ["</p>", "</body>", "</html>", "<body>", "<p>", "<html>", "www.",
"https://", "http://", " ", " ", " "]
with open(r'C:\Users\v-morisv\Desktop\scripts\url.txt', 'r') as file:
with open(r'C:\Users\v-morisv\Desktop\scripts\url1.txt', 'w') as
file1:
for line in file:
for word in delete:
line = line.replace(word, "")
print(line, end='')
file1.write(line)
This code above works, but I have a problem because I am not getting only the domain I am getting everything after the forwarddash so it looks like this
bofawebplus.webcindario.com/index4.html and I want to remove "/" and everything after it.

This seems like a proper situation using Regular Expression.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen('https://openphish.com/feed.txt')
bs = BeautifulSoup(url.read(), 'html.parser' )
import re
domain_list = re.findall(re.compile('http[s]?://([^/]*)/'), bs.text)
print('\n'.join(domain_list))

There's no reason to use BeautifulSoup here, it is used for parsing HTML, but the URL being opened is plain text.
Here's a solution that should do what you need. It uses the Python urlparse as an easier and more reliable way of extracting the domain name.
This also uses a python set to remove duplicate entries, since there were quite a few.
from urllib.request import urlopen
from urllib.parse import urlparse
feed_list = urlopen('https://openphish.com/feed.txt')
domains = set()
for line in feed_list:
url = urlparse(line)
domain = url.netloc.decode('utf-8') # decode from utf-8 to string
domains.add(domain) # Keep all the domains in the set to remove duplicates
for domain in domains:
print(domains)

Related

Parse URL beautifulsoup

import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(a)
The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-
http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A
Instead it should be:-
http://www.imdb.com/title/tt0317219/
How can I remove the second part for each URL if it contains "&sa="
Because then the second part of the URL starting from:-
"&sa=" should be removed, so that all URLs are saved like the second URL.
I am using python 2.7 and Ubuntu 16.04.
If every time redundant part of url starts with &, you can apply split() to each url:
url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)
output:
http://www.imdb.com/title/tt0317219/
Not the best way, but you could do one more time split, adding one more line after a:
a=[a[0].split("&")[0]]
print(a)
Result:
['https://de.wikipedia.org/wiki/Cars_(Film)']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:I2SHYtLktRcJ']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Handlung']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Synchronisation']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Soundtrack']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Kritik']
['https://www.mytoys.de/disney-cars/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:9Ohx4TRS8KAJ']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['http://cars.disney.com/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:1BoR6M9fXwcJ']
['http://cars.disney.com/']
['http://cars.disney.com/']
['https://www.whichcar.com.au/car-style/12-cartoon-cars']
['https://www.youtube.com/watch%3Fv%3D6JSMAbeUS-4']
['http://filme.disney.de/cars-3-evolution']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:fO7ypFFDGk0J']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.play3.de/2017/08/02/project-cars-2-6/']
['http://www.imdb.com/title/tt0317219/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:-xdXy-yX2fMJ']
['http://www.carmagazine.co.uk/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:PRPbHf_kD9AJ']
['http://google.com/search%3Ftbm%3Disch%26q%3DCars']
['http://www.imdb.com/title/tt0317219/']
['https://de.wikipedia.org/wiki/Cars_(Film)']

How to get eBay feedbacks from URL using Python, BeautifulSoup, re

Does someone knows how to get ebay feedbacks from site using python3, beautifulsoup, re...
I have this code but it is not easy to find feedbacks.
import urllib.request
import re
from bs4 import BeautifulSoup
fhand = urllib.request.urlopen('http://feedback.ebay.com/ws/eBayISAPI.dll?ViewFeedback2&userid=nana90store&iid=-1&de=off&items=25&searchInterval=30&which=positive&interval=30&_trkparms=positive_30')
for line in fhand:
print (line.strip())
f=open('feedbacks1.txt','a')
f.write(str(line)+'\n')
f.close()
file = open('feedbacks1.txt', 'r')
cleaned = open('cleaned.txt', 'w')
soup = BeautifulSoup(file)
page = soup.getText()
letters_only = re.sub("[^a-zA-Z]", " ", page )
cleaned.write(str(letters_only))
If you just care for the feedback text this might be what you are looking for:
import urllib.request
import re
from bs4 import BeautifulSoup
fhand = urllib.request.urlopen('http://feedback.ebay.com/ws/eBayISAPI.dll?ViewFeedback2&userid=nana90store&iid=-1&de=off&items=25&searchInterval=30&which=positive&interval=30&_trkparms=positive_30')
soup = BeautifulSoup(fhand.read(), 'html.parser')
table = soup.find(attrs = {'class' : 'FbOuterYukon'})
for tr in table.findAll('tr'):
if not tr.get('class'):
print(list(tr.children)[1].getText())
I am first finding the table with feedback, then the rows that contain the feedback (no class) and then the relevant lines and parse the corresponding text. This can also be adapted for similar needs.

Open links from txt file in python

I would like to ask for help with a rss program. What I'm doing is collecting sites which are containing relevant information for my project and than check if they have rss feeds.
The links are stored in a txt file(one link on each line).
So I have a txt file with full of base urls what are needed to be checked for rss.
I have found this code which would make my job much easier.
import requests
from bs4 import BeautifulSoup
def get_rss_feed(website_url):
if website_url is None:
print("URL should not be null")
else:
source_code = requests.get(website_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.find_all("link", {"type" : "application/rss+xml"}):
href = link.get('href')
print("RSS feed for " + website_url + "is -->" + str(href))
get_rss_feed("http://www.extremetech.com/")
But I would like to open my collected urls from the txt file, rather than typing each, one by one.
So I have tryed to extend the program with this:
from bs4 import BeautifulSoup, SoupStrainer
with open('test.txt','r') as f:
for link in BeautifulSoup(f.read(), parse_only=SoupStrainer('a')):
if link.has_attr('http'):
print(link['http'])
But this is returning with an error, saying that beautifoulsoup is not a http client.
I have also extended with this:
def open()
f = open("file.txt")
lines = f.readlines()
return lines
But this gave me a list separated with ","
I would be really thankfull if someone would be able to help me
Typically you'd do something like this:
with open('links.txt', 'r') as f:
for line in f:
get_rss_feed(line)
Also, it's a bad idea to define a function with the name open unless you intend to replace the builtin function open.
i guess you can make it by using urllib
import urllib
f = open('test.txt','r')
#considering each url in a new line...
while True:
URL = f.readline()
if not URL:
break
mycontent=urllib.urlopen(URL).read()

Python3 scraper. Doesn't parse the xpath till the end

I'm using lxml.html module
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here (http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z) is full of universities.
The problem is that it parses till the letter 'H' (244 unis).
I can't understand why, as I see it parses all the HTML till the end.
I also documented my self that 244 is not a limit of a list or anything in python3.
That HTML page simply isn't HTML, it's totally broken. But the following will do what you want. It uses the BeautifulSoup parser.
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info.
For web-scraping i recommend you to use BeautifulSoup 4
With bs4 this is easily done:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)

Scraping in Python with BeautifulSoup

I've read quite a few posts here about this, but I'm very new to Python in general so I was hoping for some more info.
Essentially, I'm trying to write something that will pull word definitions from a site and write them to a file. I've been using BeautifulSoup, and I've made quite some progress, but here's my issue -
from __future__ import print_function
import requests
import urllib2, urllib
from BeautifulSoup import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)
print(visible_text, file=wordlist)
this seems to pull what I need, but puts it in this format
[u'passable\n adj 1: able to be passed or traversed or crossed; "the road is\n passable"
but I need it to be in plaintext. I've tried using a sanitizer (I was running it through bleach, but that didn't work. I've read some of the other answers here, but they don't explain HOW the code works, and I don't want to add something if I don't understand how it works.
Is there any way to just pull the plaintext?
edit: I ended up doing
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
wordlist = open('test.txt', 'a')
word = raw_input('Paste your word ')
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url
html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
The code is already giving you plaintext, it just happens to have some characters encoded as entity references. In this case, special characters, which form part of the XML/HTML syntax are encoded to prevent them from breaking the structure of the text.
To decode them, use the HTMLParser module:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('"the road is passable"')
>>> u'"the road is passable"'

Categories