How to process URLs with Python - python

I have the following code (doop.py), which strips a .html file of all the 'nonsense' html script, outputting only the 'human-readable' text; eg. it will take a file containing the following:
<html>
<body>
<a href="http://www.w3schools.com">
This is a link</a>
</body>
</html>
and give
$ ./doop.py
File name: htmlexample.html
This is a link
The next thing I need to do is add a function that, if any of the html arguments within the file represent a URL (a web address), the program will read the content of the designated webpage instead of a disk file. (For present purposes, it is sufficient for doop.py to recognize an argument beginning with http:// (in any mixture of letter-cases) as a URL.)
I'm not sure where to start with this - I'm sure it would involve telling python to open a URL, but how do I do that?
Thanks,
A

Apart from urllib2 that others already mentioned, you can take a look at Requests module by Kenneth Reitz. It has a more concise and expressive syntax than urllib2.
import requests
r = requests.get('https://api.github.com', auth=('user', 'pass'))
r.text

As with most things pythonic: there is a library for that.
Here you need the urllib2 library
This allows you to open a url like a file, and read and writ from it like a file.
The code you would need would look something like this:
import urllib2
urlString = "http://www.my.url"
try:
f = urllib2.urlopen(urlString) #open url
pageString = f.read() #read content
f.close() #close url
readableText = getReadableText(pageString)
#continue using the pageString as you wish
except IOException:
print("Bad URL")
Update:
(I don't have a python interpreter to hand, so can't test that this code will work or not, but it should!!)
Opening the URL is the easy part, but first you need to extract the URLs from your html file. This is done using regular expressions (regex's), and unsurprisingly, python has a library for that (re). I recommend that you read up on both regex's, but they are basically a patter against which you can match text.
So what you need to do is write a regex that matches URLs:
(http|ftp|https)://[\w-_]+(.[\w-_]+)+([\w-.,#?^=%&:/~+#]*[\w-\#?^=%&/~+#])?
If you don't want to follow urls to ftp resources, then remove "ftp|" from the beginning of the pattern. Now you can scan your input file for all character sequences that match this pattern:
import re
input_file_str = #open your input file and read its contents
pattern = re.compile("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?") #compile the pattern matcher
matches = pattern.findall(input_file_str) #find all matches, storing them in an interator
for match in matches : #go through iteratr
urlString = match #get the string that matched the pattern
#use the code above to load the url using matched string!
That should do it

You can use third part libraries like beautifulsoup or Standard HTML Parser . Here is a previous stack overflow question. html parser python
Other Links
http://unethicalblogger.com/2008/05/03/parsing-html-with-python.html
Standard Library
http://docs.python.org/library/htmlparser.html
Performance comparision
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
While parsing one needs to parse for http

Rather than write your own HTML Parser / Scraper, I would personally recommend Beautiful Soup which you can use to load up your HTML, get the elements you want out of it, find all the links, and then use urllib to fetch the new links for you to parse and process further.

Related

how should I automation of work with Python

I'm very beginner to python but I know intermediate JavaScript. I have one project to complete this is like a scraper but I want to automate some work for me.
1 ) I have a Excel with more than 1000 data and it also has URLs. I want to code that python visit every URL from that Excel sheet and search first page for Some predefine search texts (List of Texts)
2 ) If my code find any of the Text from that web page then it should return true else false
I want any idea or logic to do this kind of process. Any help will make my head pain less 😅
it is very heavy work which is not very good idea to do in JavaScript that's why I want to do it in Python
An easy way to do this would be to get the requests module. Then learn how to use the csv module which can read spreadsheets such as excel spreadsheets. Then here is what you want to do
import csv
import requests
URLS = []
def GetUrlFromCSVFile():
global URLS
#Figure out how to get link from csv file then append them to the URLS list
for url in URLS:
r = requests.get(URL, headers=#You Should Probs get some headers)
if whatever_keyword_u_looking_for in r.text:
print("Found")
else:
print("Not here")
I suggest the following:
Read about the csv library - to read the content of an excel file.
Read about the requests library - to get the page's content from its URL.
Read about regular expressions in the re library.

Python found URL is invalid

Hi there I have following Problem:
I extracted a list of URL's from a .txt file with Python using this:
import re
with open('html.txt') as f:
urls = f.read()
links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
print(url[0])
And the Output contains for some files following:
https://url.com/?download_file=259&order=wc_order_xDxDxD&email=testmail%40gmail.com&key=1234-1234-1234-1234-8c368abd9c22
PROBLEM IS:
as you see it printed out "#038;" I'm thinking that translates into "&" but there is already a "&" infront of that and if I follow the Link its invalid.
However if I delete all "#038;" the Link works just fine.
How can I print them so that I dont have "#038;" inside and the Link works?
Thanks so much
Looks like a url encoding issue.
Since, you are only printing, you can use string replace function.
for url in links:
url[0].replace("#038","")
You are almost there &#038 = &
HTML ACIIcharacters

Python extract all hrefs from a .txt file

I have a folder that contains thousands of raw html code. I would like to extract all the href from each page. What would be the fastest way to do that?
href="what_i_need_here"
import re
with open('file', 'r') as f:
print (re.findall(r"href=\"(.+?)\"\n", ''.join(f.readlines())))
This would be what I guess might work, but there's no way to tell since you didn't provide any information. The regex used is href="(.+?)"\n. I read the content using f.readlines(), then combined it into a line to search using ''.join. See if it works, or add examples of the text.

Extracting links from HTML in Python

i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)
You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)
try to use HTML.Parser library or re library
they will help you to do that
and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Categories