New to Python, have a simple, situational question:
Trying to use BeautifulSoup to parse a series of pages.
from bs4 import BeautifulSoup
import urllib.request
BeautifulSoup(urllib.request.urlopen('http://bit.ly/'))
Traceback ...
html.parser.HTMLParseError: expected name token at '<!=KN\x01...
Working on Windows 7 64-bit with Python 3.2.
Do I need Mechanize? (which would entail Python 2.X)
If that URL is correct, you're asking why an HTML parser throws an error parsing an MP3 file. I believe the answer to this to be self-evident...
If you were trying to download that MP3, you could do something like this:
import urllib2
BLOCK_SIZE = 16 * 1024
req = urllib2.urlopen("http://bit.ly/xg7enD")
#Make sure to write as a binary file
fp = open("someMP3.mp3", 'wb')
try:
while True:
data = req.read(BLOCK_SIZE)
if not data: break
fp.write(data)
finally:
fp.close()
if you want to download a file in python you can use this as well
import urllib
urllib.urlretrieve("http://bit.ly/xg7enD","myfile.mp3")
and it will save your file in the current working directory with "myfile.mp3" name.
i am able to download all types of files through it.
hope it may help !
instead of urllib.request i suggest use requests, and from this lib use get()
from requests import get
from bs4 import BeautifulSoup
soup = BeautifulSoup(
get(url="http://www.google.com").content,
'html.parser'
)
Related
I have written following code to open a local HTML file saved on my Desktop:
However while running this code I get following error:
I have no prior experience of handling this in Python or BS4. I tried various solutions online but couldn't solve it.
Code:
import csv
from email import header
from fileinput import filename
from tokenize import Name
import requests
from bs4 import BeautifulSoup
url = "C:\ Users\ ASUS\ Desktop\ payment.html"
page=open(url)
# r=requests.get(url)
# htmlContent = r.content
soup = BeautifulSoup(page.read())
head_tag = soup.head
for child in head_tag.descendants:
print(child)
Need help!
Thank you in advance.
It's unicode error prefix the path with r (to produce a raw string):
url = r"C:\ Users\ ASUS\ Desktop\ payment.html"
When I attempt to parse a locally stored copy of a webpage, beautifulsoup returns gibberish to me. I don't understand why as I've never faced this problem when using the requests and bs4 modules together for scraping tasks.
here's my code
import requests
from bs4 import BeautifulSoup as BS
import os
url_2 = r'/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/'
os.chdir(url_2)
f = open('re_2.html')
soup = BS(url_2, "lxml")
f.close()
print soup
this code returns the following :
<html><body><p>/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/</p></body></html>
I wasn't able to find a similar problem online so I've posted it here. any help would be much appreciated.
You are passing the path (which you named url_2) to BeautifulSoup so it treats that as a web page text and returns it, neatly wrapped in some minimal HTML. Seems fine.
Try constructing the BS from the file's contents instead. See here how it works: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
soup = BS(f)
should do...
In python3, I want to load this_file, which is a json format.
Basically, I want to do something like [pseudocode]:
>>> read_from_url = urllib.some_method_open(this_file)
>>> my_dict = json.load(read_from_url)
>>> print(my_dict['some_key'])
some value
You were close:
import requests
import json
response = json.loads(requests.get("your_url").text)
Just use json and requests modules:
import requests, json
content = requests.get("http://example.com")
json = json.loads(content.content)
Or using the standard library:
from urllib.request import urlopen
import json
data = json.loads(urlopen(url).read().decode("utf-8"))
So you want to be able to reference specific values with inputting keys? If i think i know what you want to do, this should help you get started. You will need the libraries urlllib2, json, and bs4. just pip install them its easy.
import urllib2
import json
from bs4 import BeautifulSoup
url = urllib2.urlopen("https://www.govtrack.us/data/congress/113/votes/2013/s11/data.json")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
newDictionary=json.loads(str(soup))
I used a commonly used url to practice with.
I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463
From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.
I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...