Scraping text from multiple web pages in Python - python

I've been tasked to scrape all the text off of any webpage a certain client of ours hosts. I've managed to write a script that will scrape the text off a single webpage, and you can manually replace the URL in the code each time you want to scrape a different webpage. But obviously this is very inefficient. Ideally, I could have Python connect to some list that contains all the URLs I need and it would iterate through the list and print all the scraped text into a single CSV. I've tried to write a "test" version of this code by creating a 2 URL long list and trying to get my code to scrape both URLs. However, as you can see, my code only scrapes the most recent url in the list and does not hold onto the first page it scraped. I think this is due to a deficiency in my print statement since it will always write over itself. Is there a way to have everything I scraped held somewhere until the loop goes through the entire list AND then print everything.
Feel free to totally dismantle my code. I know nothing of computer languages. I just keep getting assigned these tasks and use Google to do my best.
import urllib
import re
from bs4 import BeautifulSoup
data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv'
urlTable = ['url1','url2']
def extractText(string):
page = urllib.request.urlopen(string)
soup = BeautifulSoup(page, 'html.parser')
##Extracts all paragraph and header variables from URL as GroupObjects
text = soup.find_all("p")
headers1 = soup.find_all("h1")
headers2 = soup.find_all("h2")
headers3 = soup.find_all("h3")
##Forces GroupObjects into str
text = str(text)
headers1 = str(headers1)
headers2 = str(headers2)
headers3 = str(headers3)
##Strips HTML tags and brackets from extracted strings
text = text.strip('[')
text = text.strip(']')
text = re.sub('<[^<]+?>', '', text)
headers1 = headers1.strip('[')
headers1 = headers1.strip(']')
headers1 = re.sub('<[^<]+?>', '', headers1)
headers2 = headers2.strip('[')
headers2 = headers2.strip(']')
headers2 = re.sub('<[^<]+?>', '', headers2)
headers3 = headers3.strip('[')
headers3 = headers3.strip(']')
headers3 = re.sub('<[^<]+?>', '', headers3)
print_to_file = open (data_file_name, 'w' , encoding = 'utf')
print_to_file.write(text + headers1 + headers2 + headers3)
print_to_file.close()
for i in urlTable:
extractText (i)

Try this, 'w' will open the file with a pointer at the the beginning of the file. You want the pointer at the end of the file
print_to_file = open (data_file_name, 'a' , encoding = 'utf')
here is all of the different read and write modes for future reference
The argument mode points to a string beginning with one of the following
sequences (Additional characters may follow these sequences.):
``r'' Open text file for reading. The stream is positioned at the
beginning of the file.
``r+'' Open for reading and writing. The stream is positioned at the
beginning of the file.
``w'' Truncate file to zero length or create text file for writing.
The stream is positioned at the beginning of the file.
``w+'' Open for reading and writing. The file is created if it does not
exist, otherwise it is truncated. The stream is positioned at
the beginning of the file.
``a'' Open for writing. The file is created if it does not exist. The
stream is positioned at the end of the file. Subsequent writes
to the file will always end up at the then current end of file,
irrespective of any intervening fseek(3) or similar.
``a+'' Open for reading and writing. The file is created if it does not
exist. The stream is positioned at the end of the file. Subse-
quent writes to the file will always end up at the then current
end of file, irrespective of any intervening fseek(3) or similar.

Related

PyQt5 QWebEngineView - Make a Markdown from a file - Python

I have this method who read a file and put the content into a Plain Text.
def show_open_dialog():
global file_path
if not save_if_modified():
return
file_name, _ = QFileDialog.getOpenFileName(
window_area,
'Open fle...',
os.getcwd(),
'Text files (*.txt *.py)'
)
if file_name:
with open(file_name, 'r') as f:
# Print content into text area.
text_area.setPlainText(f.read())
file_path = file_name
When this method is called, it open a window where I can choose a file and charge it like Notepad of Windows, and it works fine. Now what I want to do is create a Markdown with the information from that file, that is, pass it to HTML.
I have already created the QWebEngineView element.
browser_area = QWebEngineView()
And this is the modifications I made inside of "with open" but that It does not work.
# Print content into text area.
text_area.setPlainText(f.read())
# Raw data.
file_content = f.read()
# To HTML.
browser_area.setHtml(file_content)
# Show it.
browser_area.show()
After print the content, only show an empty window.
I also tried Markdown2 (markdown2.markdown(file_content)) instead of .setHtml() but It does not work too.
For the moment I just want to show the content in a new window and show a message if the HTML cannot be loaded.
When accessing file objects, the read(size=-1) function reads the size amount of bytes from the stream and puts the stream at that position.
with open('somefile', 'r') as f:
# reads the first 10 bytes
start = f.read(10)
# reads the *next* 10 bytes
more = f.read(10)
# move the position at the beginning
f.seek(0)
another = f.read(10)
print(start == another)
# This will print "True"
If size is -1 (the default) this means that the whole object is read, and after that the position is at the end. Since there's nothing left to read at the end of the file, if you try to read again you will get, obviously, nothing.
If you need to access the read data multiple times, you should store it in a temporary variable:
with open(file_name, 'r') as f:
data = f.read()
text_area.setPlainText(data)
browser_area.setHtml(data)

Create hyperlinks from urls in text file using QTextBrowser

I have a text file with some basic text:
For more information on this topic, go to (http://moreInfo.com)
This tool is available from (https://www.someWebsite.co.uk)
Contacts (https://www.contacts.net)
I would like the urls to show up as hyperlinks in a QTextBrowser, so that when clicked, the web browser will open and load the website. I have seen this post which uses:
Bar
but as the text file can be edited by anyone (i.e. they might include text which does not provide a web address), I would like it if these addresses, if any, can be automatically hyperlinked before being added to the text browser.
This is how I read the text file:
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(file_path, 'r')
text = f.read()
text_browser.setText(text)
text_browser.setOpenExternalLinks(True)
self.dockwidget.show()
Edit:
Made some headway and managed to get the hyperlinks using (assuming the links are inside parenthesis):
import re
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(about_file_path, 'r')
text = f.read()
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
for x in urls:
if x in text:
text = text.replace(x, x.replace('http', '') + x + '')
textBrowser.setHtml(text)
textBrowser.setOpenExternalLinks(True)
self.dockwidget.show()
However, it all appears in one line and not in the same format as in the text file. How could I solve this?
Matching urls correctly is more complex than your current solution might suggest. For a full breakdown of the issues, see: What is the best regular expression to check if a string is a valid URL?
.
The other problem is much easier to solve. To preserve newlines, you can use this:
text = '<br>'.join(text.splitlines())

python : empty output is printing while reading the web browser output

I have set of URL links in my file and i need to open every link and fetch the output and i need to store that in a file. But if i tried to print output empty lines are coming.
Please find the code below and help me on this
import urllib2
import webbrowser
with open('C:\\Users\\home\\Desktop\\11.txt','r') as fp:
for line in fp:
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
top_level_url = "https://facebook.com"
password_mgr.add_password(None, top_level_url, "appsdev", "--omitted--")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
r=opener.open(top_level_url)
r.read()
print r.read()
If the code you posted is correct and the 2nd r.read() isn't a typo, then it's because you have two reads.
On file-like objects (like the return value from opener.open()), calling read() will return the entire contents and set the current position to the end of the file. Subsequent calls to read() will return empty strings, since the cursor is already at the end of the file.
In your code
r.read() # This returns the entire contents
print r.read() # Empty string
Just get rid of the first r.read().
Before Writing into some other file , assign that content into any variable,
like ,
out_data = r.read()
new_file = open('file.txt','w')
new_file.write(out_data)
new_file.close()
thats it your scraped data will be be wrote into file.txt

Setting HTML source for QtWebKit to a string value vs. file.read(), encoding issue?

I have a script that reads a bunch of JavaScript files into a variable, and then places the contents of those files into placeholders in a Python template. This results in the value of the variable src (described below) being a valid HTML document including scripts.
# Open the source HTML file to get the paths to the JavaScript files
f = open(srcfile.html, 'rU')
src = f.read()
f.close()
js_scripts = re.findall('script\ssrc="(.*)"', src)
# Put all of the scripts in a variable
js = ''
for script in js_scripts:
f = open(script, 'rU')
js = js + f.read() + '\n'
f.close()
# Open/read the template
template = open('template.html)
templateSrc = Template(template.read())
# Substitute the scripts for the placeholder variable
src = str(templateSrc.safe_substitute(javascript_content=js))
# Write a Python file containing the string
with open('htmlSource.py', 'w') as f:
f.write('#-*- coding: utf-8 -*-\n\nhtmlSrc = """' + src + '"""')
If I try to open it up via PyQt5/QtWebKit in Python...
from htmlSource import htmlSrc
webWidget.setHtml(htmlSrc)
...it doesn't load the JS files in the web widget. I just end up with a blank page.
But if I get rid of everything else, and just write to file '"""src"""', when I open the file up in Chrome, it loads everything as expected. Likewise, it'll also load correctly in the web widget if I read from the file itself:
f = open('htmlSource.py', 'r')
htmlSrc = f.read()
webWidget.setHtml(htmlSrc)
In other words, when I run this script, it produces the Python output file with the variable; then I try to import that variable and pass it to webWidget.setHtml(); but the page doesn't render. But if I use open() and read it as a file, it does.
I suspect there's an encoding issue going on here. But I've tried several variations of encode and decode without any luck. The scripts are all UTF-8.
Any suggestions? Many thanks!

PyQuery Python not working with for loop

I am trying to write a program that pulls the urls from each line of a .txt file and performs a PyQuery to scrape lyrics data off of LyricsWiki, and everything seems to work fine until I actually put the PyQuery stuff in. For example, when I do:
full_lyrics = ""
#open up the input file
links = open('links.txt')
for line in links:
full_lyrics += line
print(full_lyrics)
links.close()
It prints everything out as expected, one big string with all the data in it. However, when I implement the actual html parsing, it only pulls the lyrics from the last url and skips through all the previous ones.
import requests, re, sqlite3
from pyquery import PyQuery
from collections import Counter
full_lyrics = ""
#open up the input file
links = open('links.txt')
output = open('web.txt', 'w')
output.truncate()
for line in links:
r = requests.get(line)
#create the PyQuery object and parse text
results = PyQuery(r.text)
results = results('div.lyricbox').remove('script').text()
full_lyrics += (results + " ")
output.write(full_lyrics)
links.close()
output.close()
I writing to a txt file to avoid encoding issues with Powershell. Anyway, after I run the program and open up the txt file, it only shows the lyrics of the last link on the links.txt document.
For reference, 'links.txt' should contain several links to lyricswiki song pages, like this:
http://lyrics.wikia.com/Taylor_Swift:Shake_It_Off
http://lyrics.wikia.com/Maroon_5:Animals
'web.txt' should be a blank output file.
Why is it that pyquery breaks the for loop? It clearly works when its doing something simpler, like just concatenating the individual lines of a file.
The problem is the additional newline character in every line that you read from the file (links.txt). Try open another line in your links.txt and you'll see that even the last entry will not be processed.
I recommend that you do a right strip on the line variable after the for, like this:
for line in links:
line = line.rstrip()
r = requests.get(line)
...
It should work.
I also think that you don't need requests to get the html. Try results = PyQuery(line) and see if it works.

Categories