Trouble Getting a clean text file from HTML

Trouble Getting a clean text file from HTML - python

I have looked at these previous questions
I am trying to consolidate news and notes from websites.
Reputed News service websites allow Users to post comments and views.
I am trying to get only the news content without the users comments. I tried working with BeautifulSoup and html2text. But user-comments are being included in the text file. I have even tried developing a custom program but with no useful progress than the above two.
Can anybody provide some clue how to proceed?
The code:
import urllib2
from bs4 import BeautifulSoup
URL ='http://www.example.com'
print 'Following: ',URL
print "Loading..."
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
identify_as = { 'User-Agent' : user_agent }
print "Reading URL:"+str(URL)
def process(URL,identify_as):
req = urllib2.Request(URL,data=None,headers=identify_as)
response = urllib2.urlopen(req)
_BSobj = BeautifulSoup(response).prettify(encoding='utf-8')
return _BSobj #return beauifulsoup object
print 'Processing URL...'
new_string = process(URL,identify_as).split()
print 'Buidiing requested Text'
tagB = ['<title>','<p>']
tagC = ['</title>','</p>']
reqText = []
for num in xrange(len(new_string)):
buffText = [] #initialize and reset
if new_string[num] in tagB:
tag = tagB.index(new_string[num])
while new_string[num] != tagC[tag]:
buffText.append(new_string[num])
num+=1
reqText.extend(buffText)
reqText= ''.join(reqText)
fileID = open('reqText.txt','w')
fileID.write(reqText)
fileID.close()

Here's a quick example I wrote using urllib which gets the contents of a page to a file:
import urllib
import urllib.request
myurl = "http://www.mysite.com"
sock = urllib.request.urlopen(myurl)
pagedata = str(sock.read())
sock.close()
file = open("output.txt","w")
file.write(pagedata)
file.close()
Then with a lot of string formatting you should be able to extract the parts of the html you want. This gives you something to get started from.

Related

How to get cookies of a website using python

How can I get cookies of a website from browser using python? The code currently being used is:
get_title = lambda html: re.findall('<title>(.*?)</title>', html, flags=re.DOTALL)[0].strip()
url = config.base_url
public_html = urllib2.urlopen(url).read()
print get_title(public_html)
cj = browsercookie.firefox()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_html = opener.open(url).read()
print get_title(login_html)
This code comes after the application has logged in.
config.base_url = "https://10.194.13.71"
It is giving this error : c** File "/root/Desktop/mysonicwallnew/testservice.py", line 26, in test_service
public_html = urllib2.urlopen(url).read()
CertificateError: hostname '10.194.31.71' doesn't match either of 'www.abc.com', 'abc.com'
**
How do I fix this?

This works for me -
import requests
import browsercookie
import re
cj = browsercookie.chrome()
r = requests.get('http://stackoverflow.com', cookies=cj)
get_title = lambda html: re.findall('<title>(.*?)</title>', html, flags=re.DOTALL)[0].strip()
print r.content
print get_title(r.content)
Try updating question with the error you are facing or the exact thing you are looking for from the cookie to get more specific answers.

python 3 open and read url without url name

I have gone through the relevant questions, and I did not find answer to this one:
I want to open an url and parse its contents.
When I do that on, say, google.com, no problem.
When I do it on an url that does not have a file name, I often get told that I read an empty string.
See the code below as an example:
import urllib.request
#urls = ["http://www.google.com", "http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
#urls = ["http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
urls = ["http://www.whoscored.com/LiveScores"]
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
sock=urllib.request.urlopen(url)
print("I have this sock: {0}.".format(sock))
htmlSource = sock.read()
print("I read the source code...")
htmlSourceLine = sock.readlines()
sock.close()
htmlSourceString = str(htmlSource)
print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
htmlSourceString = htmlSourceString.replace(">", ">\n")
htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
print(htmlSourceString)
print("\n\nI am done with this url: {0}.".format(url))
I do not know what does that I sometimes get that empty string as a return for urls that don't have a file name--such as "www.whoscored.com/LiveScores" in the example--whereas "google.com" or "www.whoscored.com" seem to work all the time.
I hope that my formulation is understandable...

It looks like the site is coded to explicitly reject requests from non-browser clients. You'll have to spoof creating sessions and the like, ensuring that Cookies are passed back and forth as required. The third-party requests library can help you with these tasks, but the bottom line is you are going to have to find out more about how that site operates.

Your code worked intermittently for me but using requests and sending the user-agent worked perfectly:
headers = {
'User-agent': 'Mozilla/5.0,(X11; U; Linux i686; en-GB; rv:1.9.0.1): Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
urls = ["http://www.whoscored.com/LiveScores"]
import requests
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
sock= requests.get(url, headers=headers)
print("I have this sock: {0}.".format(sock))
htmlSource = sock.content
print("I read the source code...")
htmlSourceString = str(htmlSource)
print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
htmlSourceString = htmlSourceString.replace(">", ">\n")
htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
print(htmlSourceString)
print("\n\nI am done with this url: {0}.".format(url))

How to get contents of frames automatically if browser does not support frames + can't access frame directly

I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.

Okay, this was an interesting task to do with requests and BeautifulSoup.
There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.
Here's the complete code:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
You should probably extract separate blocks of code into functions to make it more readable and reusable.
FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.
Hope that helps.

Submitting to a web form using python

I have seen questions like this asked many many times but none are helpful
Im trying to submit data to a form on the web ive tried requests, and urllib and none have worked
for example here is code that should search for the [python] tag on SO:
import urllib
import urllib2
url = 'http://stackoverflow.com/'
# Prepare the data
values = {'q' : '[python]'}
data = urllib.urlencode(values)
# Send HTTP POST request
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
# Print the result
print html
yet when i run it i get the html soure of the home page
here is an example of using requests:
import requests
data= {
'q': '[python]'
}
r = requests.get('http://stackoverflow.com', data=data)
print r.text
same result! i dont understand why these methods arent working i've tried them on various sites with no success so if anyone has successfully done this please show me how!
Thanks so much!

If you want to pass q as a parameter in the URL using requests, use the params argument, not data (see Passing Parameters In URLs):
r = requests.get('http://stackoverflow.com', params=data)
This will request https://stackoverflow.com/?q=%5Bpython%5D , which isn't what you are looking for.
You really want to POST to a form. Try this:
r = requests.post('https://stackoverflow.com/search', data=data)
This is essentially the same as GET-ting https://stackoverflow.com/questions/tagged/python , but I think you'll get the idea from this.

import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
This makes a POST request with the data specified in the values. we need urllib to encode the url and then urllib2 to send a request.

Mechanize library from python is also great allowing you to even submit forms. You can use the following code to create a browser object and create requests.
import mechanize,re
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
br.open( "http://google.com" )
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'
br.submit()
resp = None
for link in br.links():
siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
if siteMatch:
resp = br.follow_link( link )
break
content = resp.get_data()
print content

python search with image google images

i'm having a very tough time searching google image search with python. I need to do it using only standard python libraries (so urllib, urllib2, json, ..)
Can somebody please help? Assume the image is jpeg.jpg and is in same folder I'm running python from.
I've tried a hundred different code versions, using headers, user-agent, base64 encoding, different urls (images.google.com, http://images.google.com/searchbyimage?hl=en&biw=1060&bih=766&gbv=2&site=search&image_url={{URL To your image}}&sa=X&ei=H6RaTtb5JcTeiALlmPi2CQ&ved=0CDsQ9Q8, etc....)
Nothing works, it's always an error, 404, 401 or broken pipe :(
Please show me some python script that will actually seach google images with my own image as the search data ('jpeg.jpg' stored on my computer/device)
Thank you for whomever can solve this,
Dave:)

I use the following code in Python to search for Google images and download the images to my computer:
import os
import sys
import time
from urllib import FancyURLopener
import urllib2
import simplejson
# Define search term
searchTerm = "hello world"
# Replace spaces ' ' in search term for '%20' in order to comply with request
searchTerm = searchTerm.replace(' ','%20')
# Start FancyURLopener with defined version
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
# Set count to 0
count= 0
for i in range(0,10):
# Notice that the start changes for each iteration in order to request a new set of images for each loop
url = ('https://ajax.googleapis.com/ajax/services/search/images?' + 'v=1.0&q='+searchTerm+'&start='+str(i*4)+'&userip=MyIP')
print url
request = urllib2.Request(url, None, {'Referer': 'testing'})
response = urllib2.urlopen(request)
# Get results using JSON
results = simplejson.load(response)
data = results['responseData']
dataInfo = data['results']
# Iterate for each result and get unescaped url
for myUrl in dataInfo:
count = count + 1
print myUrl['unescapedUrl']
myopener.retrieve(myUrl['unescapedUrl'],str(count)+'.jpg')
# Sleep for one second to prevent IP blocking from Google
time.sleep(1)
You can also find very useful information here.

The Google Image Search API is deprecated, we use google search to download the images using REgex and Beautiful soup
from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
def get_soup(url,header):
return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)))
image_type = "Action"
# you can change the query for the image here
query = "Terminator 3 Movie"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/searches_sm=122&source=lnms&tbm=isch&sa=X&ei=4r_cVID3NYayoQTb4ICQBA&ved=0CAgQ_AUoAQ&biw=1242&bih=619&q="+query
print url
header = {'User-Agent': 'Mozilla/5.0'}
soup = get_soup(url,header)
images = [a['src'] for a in soup.find_all("img", {"src": re.compile("gstatic.com")})]
#print images
for img in images:
raw_img = urllib2.urlopen(img).read()
#add the directory for your image here
DIR="C:\Users\hp\Pictures\\valentines\\"
cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
print cntr
f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
f.write(raw_img)
f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble Getting a clean text file from HTML - python

Related

How to get cookies of a website using python

python 3 open and read url without url name

How to get contents of frames automatically if browser does not support frames + can't access frame directly

Submitting to a web form using python

python search with image google images

Categories

Resources