I have a folder that contains thousands of raw html code. I would like to extract all the href from each page. What would be the fastest way to do that?
href="what_i_need_here"
import re
with open('file', 'r') as f:
print (re.findall(r"href=\"(.+?)\"\n", ''.join(f.readlines())))
This would be what I guess might work, but there's no way to tell since you didn't provide any information. The regex used is href="(.+?)"\n. I read the content using f.readlines(), then combined it into a line to search using ''.join. See if it works, or add examples of the text.
Related
Suppose I am having a collection of random strings in a .txt file and the data in it can be showcased as follows:
a4HpekGN78MaHcT0vcGA
R1gnLzvsvgvf2hU08jqO
CsWCv0s6OZGEgAXAuhgZ
1293gdxhIUpIbTQbBqJc
vbCAyd6IbVfIjgkzJXJt
and I want to add this value individually to a URL
e.g. https://example.com/?stringvalue=**a4HpekGN78MaHcT0vcGA**/action?complete
I am a beginner at Python and want to develop my skills by working on individual projects.
This will be a use of python's f-string. I do not know what you are going to do with the url, but to update it, you can use the following code.
with open('file.txt', 'r') as f:
for line in f:
url = f"https://example.com/?stringvalue=**{line.strip()}**/action?complete"
print(url) # or whatever you want
I'm working with a large .json filled with twitter bios and would like to extract screen_names. To prevent that the search also returns potential users mentioned in the bio section it is important only to extract the first match ofeach line.
When I open the file in Notepad++ I can use the following regex to do exactly that:
(^.*?)\K"screen_name": "(\w+)"
Using the same as part of an re.findall or re.search in python does not result in any matches.
I'm totally new to both Python and regex so I'm fairly certain I'm not fully aware of all the necessary coding.
Many thanks in advance!
As noted by other users Python and Notepad use different search codes, and so to achieve my wanted result I deployed the following code:
import re
regex=re.compile(r'"screen_name":\s*"(\w+)"')
with open("followers.json", "r") as f:
for line in f:
output=regex.search(line)
with open("followers.txt", "a") as outp:
outp.write(output.group(1)+"\n")
This will analyse your specified .json file, read it line by line, and save every first match of each line in the file "followers.txt".
I am trying to write a script to automate browsing to my most commonly visited websites. I have put the websites into a list and am trying to open it using the webbrowser() module in Python. My code looks like the following at the moment:
import webbrowser
f = open("URLs", "r")
list = f.readline()
for line in list:
webbrowser.open_new_tab(list)
This only reads the first line from my file "URLs" and opens it in the browser. Could any one please help me understand how I can achieve reading through the entire file and also opening the URLs in different tabs?
Also other options that can help me achieve the same.
You have two main problems.
The first problem you have is that you are using readline and not readlines. readline will give you the first line in the file, while readlines gives you a list of your file contents.
Take this file as an example:
# urls.txt
http://www.google.com
http://www.imdb.com
Also, get in to the habit of using a context manager, as this will close the file for you once you have finished reading from it. Right now, even though for what you are doing, there is no real danger, you are leaving your file open.
Here is the information from the documentation on files. There is a mention about best practices with handling files and using with.
The second problem in your code is that, when you are iterating over list (which you should not use as a variable name, since it shadows the builtin list), you are passing list in to your webrowser call. This is definitely not what you are trying to do. You want to pass your iterator.
So, taking all this in to mind, your final solution will be:
import webbrowser
with open("urls.txt") as f:
for url in f:
webbrowser.open_new_tab(url.strip())
Note the strip that is called in order to ensure that newline characters are removed.
You're not reading the file properly. You're only reading the first line. Also, assuming you were reading the file properly, you're still trying to open list, which is incorrect. You should be trying to open line.
This should work for you:
import webbrowser
with open('file name goes here') as f:
all_urls = f.read().split('\n')
for each_url in all_urls:
webbrowser.open_new_tab(each_url)
My answer is assuming that you have the URLs 1 per line in the text file. If they are separated by spaces, simply change the line to all_urls = f.read().split(' '). If they're separated in another way just change the line to split accordingly.
i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)
You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)
try to use HTML.Parser library or re library
they will help you to do that
and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+
I have the following code (doop.py), which strips a .html file of all the 'nonsense' html script, outputting only the 'human-readable' text; eg. it will take a file containing the following:
<html>
<body>
<a href="http://www.w3schools.com">
This is a link</a>
</body>
</html>
and give
$ ./doop.py
File name: htmlexample.html
This is a link
The next thing I need to do is add a function that, if any of the html arguments within the file represent a URL (a web address), the program will read the content of the designated webpage instead of a disk file. (For present purposes, it is sufficient for doop.py to recognize an argument beginning with http:// (in any mixture of letter-cases) as a URL.)
I'm not sure where to start with this - I'm sure it would involve telling python to open a URL, but how do I do that?
Thanks,
A
Apart from urllib2 that others already mentioned, you can take a look at Requests module by Kenneth Reitz. It has a more concise and expressive syntax than urllib2.
import requests
r = requests.get('https://api.github.com', auth=('user', 'pass'))
r.text
As with most things pythonic: there is a library for that.
Here you need the urllib2 library
This allows you to open a url like a file, and read and writ from it like a file.
The code you would need would look something like this:
import urllib2
urlString = "http://www.my.url"
try:
f = urllib2.urlopen(urlString) #open url
pageString = f.read() #read content
f.close() #close url
readableText = getReadableText(pageString)
#continue using the pageString as you wish
except IOException:
print("Bad URL")
Update:
(I don't have a python interpreter to hand, so can't test that this code will work or not, but it should!!)
Opening the URL is the easy part, but first you need to extract the URLs from your html file. This is done using regular expressions (regex's), and unsurprisingly, python has a library for that (re). I recommend that you read up on both regex's, but they are basically a patter against which you can match text.
So what you need to do is write a regex that matches URLs:
(http|ftp|https)://[\w-_]+(.[\w-_]+)+([\w-.,#?^=%&:/~+#]*[\w-\#?^=%&/~+#])?
If you don't want to follow urls to ftp resources, then remove "ftp|" from the beginning of the pattern. Now you can scan your input file for all character sequences that match this pattern:
import re
input_file_str = #open your input file and read its contents
pattern = re.compile("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?") #compile the pattern matcher
matches = pattern.findall(input_file_str) #find all matches, storing them in an interator
for match in matches : #go through iteratr
urlString = match #get the string that matched the pattern
#use the code above to load the url using matched string!
That should do it
You can use third part libraries like beautifulsoup or Standard HTML Parser . Here is a previous stack overflow question. html parser python
Other Links
http://unethicalblogger.com/2008/05/03/parsing-html-with-python.html
Standard Library
http://docs.python.org/library/htmlparser.html
Performance comparision
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
While parsing one needs to parse for http
Rather than write your own HTML Parser / Scraper, I would personally recommend Beautiful Soup which you can use to load up your HTML, get the elements you want out of it, find all the links, and then use urllib to fetch the new links for you to parse and process further.