The code saves a list of URLs. I want to take the lines of text and covert them to links within an HTML file by adding the A tags and place those links within properly formatted HTML code.
#!/usr/bin/env python
import sys
import os
import shutil
try:
from googlesearch import search
except ImportError:
print("No module named 'google' found")
#keyword query user input
query = raw_input('Enter keyword or keywords to search: ')
#print results from search into a file called keyword.txt
with open("keyword.txt","w+") as f:
for j in search(query, tld='co.in', lang='en', num=10, start=0, stop=200, pause=3):
f.write("%s\n" % j)
f.close()
#add keyword to list of keywords file
sys.stdout=open("keywords","a+")
print (query)
sys.stdout.close()
#rename file to reflect query input
os.rename('keyword.txt',query + ".txt")
#move created data file to proper directory and cleanup mess
source = os.listdir("/home/user/search/")
destination = "/home/user/search/index/"
for files in source:
if files.endswith(".txt"):
shutil.copy(files,destination)
os.remove(query + ".txt")
Expected results would be an HTML file with clickable links
Based on your comment, it appears that you are struggling to write the url string obtained from the search function into a file along with the required HTML tags. Try:
with open("keyword.txt","w+") as f:
for j in search(query, tld='co.in', lang='en', num=10, start=0, stop=200, pause=3):
f.write('{1} <br>\n'.format(j,j))
Which will write each url and add hyperlinks to the url. You might want to print <html> ... </html> and <body> ... </body> tags as well to keyword.txt. This can be done like
with open("keyword.txt","w+") as f:
f.write('<html> \n <body> \n')
for j in search(query, tld='co.in', lang='en', num=10, start=0, stop=200, pause=3):
f.write('{1} <br>\n'.format(j,j))
f.write('\n</body> \n </html>')
And, you don't have to close the file using f.close() if you use with open see: https://stackoverflow.com/a/8011836/937153
Personally, I prefer format over %. I will be careful about presenting a comparison between the two here. You can see Python string formatting: % vs. .format for a detailed discussion on this topic.
Related
I am new to python, only one script behind me for searching strings in pdfs. Now, I would like to build script which will give me results in new CSV/xlsx file where I will have first lines and their page numbers of given pdf file. For now I have code below for printing whole page:
from PyPDF2 import PdfFileReader
pdf_document = "example.pdf"
with open(pdf_document, "rb") as filehandle:
pdf = PdfFileReader(filehandle)
info = pdf.getDocumentInfo()
pages = pdf.getNumPages()
print (info)
print ("number of pages: %i" % pages)
page1 = pdf.getPage(0)
print(page1)
print(page1.extractText())
You can read pdf file page by page, split by '\n' (if that is the character that splits lines), then use the CSV package to write into a CSV file. A script like below. Just to mention that it if the PDF contains images this code will not be able to extract text. You need an OCR module to convert images to text first.
from PyPDF2 import PdfFileReader
import csv
pdf_document = "test.pdf"
with open(pdf_document, "rb") as filehandle:
pdf = PdfFileReader(filehandle)
with open('result.csv','w') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['page numebr','firts line'])
for i in range(0, pdf.getNumPages()):
content= pdf.getPage(i).extractText().split('\n')
print(content[0]) # prints first line
print(i+1) # prints page number
print('-------------')
csv_writer.writerow([i+1,content[0]])
I have to convert whole pdf to text. i have seen at many places converting pdf to text but particular page.
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
page = pdf.getPage(0)
text = page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
How to convert whole pdf file to text without using getpage()??
You may want to use textract as this answer recommends to get the full document if all you want is the text.
If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
text = ""
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text += page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
Though you may want to remember which page the text came from in which case you could use a list:
page_text = []
for page_num in range(pdf.getNumPages()): # For each page
page = pdf.getPage(page_num) # Get that page's reference
page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
print(page) # print each page
You could use tika to accomplish this task, but the output needs a little cleaning.
from tika import parser
parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)
This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.
from PyPDF2 import PdfFileReader
def pdf_text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# Get total pdf page number.
totalPageNumber = pdf.numPages
currentPageNumber = 0
while (currentPageNumber < totalPageNumber):
page = pdf.getPage(currentPageNumber)
text = page.extractText()
# The encoding put each page on a single line.
# type is <class 'bytes'>
print(text.encode('utf-8'))
#################################
# This outputs the text to a list,
# but it doesn't keep paragraphs
# together
#################################
# output = text.encode('utf-8')
# split = str(output, 'utf-8').split('\n')
# print (split)
#################################
# Process next page.
currentPageNumber += 1
path = 'mypdf.pdf'
pdf_text_extractor(path)
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
PDF is a page-oriented format & therefore you'll need to deal with the concept of pages.
What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.
Your extractText() function simply gets the extracted text blocks in document order, not presentation order.
Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.
Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.
I found out a very simple way to do this.
You have to follow this steps:
Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.
pip install PyPDF2
If you're not using Anaconda you have to install pip and put its path
to your cmd or terminal.
Python Code: This following code shows how to convert a pdf file very easily:
import PyPDF2
with open("pdf file path here",'rb') as file_obj:
pdf_reader = PyPDF2.PdfFileReader(file_obj)
raw = pdf_reader.getPage(0).extractText()
print(raw)
I just used pdftotext module to get this done easily.
import pdftotext
# Load your PDF
with open("test.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
file.write(page)
file.close()
Link: https://github.com/manojitballav/pdf-text
I have a text file with some basic text:
For more information on this topic, go to (http://moreInfo.com)
This tool is available from (https://www.someWebsite.co.uk)
Contacts (https://www.contacts.net)
I would like the urls to show up as hyperlinks in a QTextBrowser, so that when clicked, the web browser will open and load the website. I have seen this post which uses:
Bar
but as the text file can be edited by anyone (i.e. they might include text which does not provide a web address), I would like it if these addresses, if any, can be automatically hyperlinked before being added to the text browser.
This is how I read the text file:
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(file_path, 'r')
text = f.read()
text_browser.setText(text)
text_browser.setOpenExternalLinks(True)
self.dockwidget.show()
Edit:
Made some headway and managed to get the hyperlinks using (assuming the links are inside parenthesis):
import re
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(about_file_path, 'r')
text = f.read()
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
for x in urls:
if x in text:
text = text.replace(x, x.replace('http', '') + x + '')
textBrowser.setHtml(text)
textBrowser.setOpenExternalLinks(True)
self.dockwidget.show()
However, it all appears in one line and not in the same format as in the text file. How could I solve this?
Matching urls correctly is more complex than your current solution might suggest. For a full breakdown of the issues, see: What is the best regular expression to check if a string is a valid URL?
.
The other problem is much easier to solve. To preserve newlines, you can use this:
text = '<br>'.join(text.splitlines())
I'm new to python and trying a program to do the following:
Open all folder and subfolders in a directory path
Identify the HTML files
Load the HTML in BeautifulSoup
Find the first body tag
If the body tag is immediately followed by < Google Tag Manager> then continue
If not then add < Google Tag Manager> code and save the file.
I'm not able to scan all subfolders within each folder.
I'm not able to set seen() if < Google Tag Manager> appears immediately after the body tag.
Any help to perform the above tasks is appreciated.
My code attempt is as follows:
import sys
import os
from os import path
from bs4 import BeautifulSoup
directory_path = '/input'
files = [x for x in os.listdir(directory_path) if path.isfile(directory_path+os.sep+x)]
for root, dirs, files in os.walk(directory_path):
for fname in files:
seen = set()
a = directory_path+os.sep+fname
if fname.endswith(".html"):
with open(a) as f:
soup = BeautifulSoup(f)
for li in soup.select('body'):
if li in seen:
continue
else:
seen.add("<!-- Google Tag Manager --><noscript><iframe src='//www.googletagmanager.com/ns.html?id=GTM-54QWZ8'height='0' width='0' style='display:none;visibility:hidden'></iframe></noscript><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-54QWZ8');</script><!-- End Google Tag Manager —>\n")
So you can install the iglob library for python. With iglob you can recursively traverse the main directory you specify and the sub-directories and list all the files with a given extension. Then open up the HTML file, read all the lines, traverse through the lines manually until you find "" for the tag as some users who may use a frame work might have other content inside the body tag. Either way, loop through the lines looking for the start of the body tag, then check the next line, if the text as you specified "Google Tag Manager" is not in the next line, write it out. Please keep in mind I wrote this in the event you will always have the Google Tag Manager tags right after the body tag.
Please keep in mind that:
In the event the Google Tag Manager text is not directly after the body tag, this code will add it anyways, so if Google Tag manager is somewhere in the Two body tags, and works, this could break the functionality of your Google Tag Manager.
I am using Python 3.x for this, so if you are using Python 2, you might have to translate this to that version of python.
Replace the 'Path.html' with the variable path so that it rewrite the file it is looking at with the modifications. I put in 'path.html' so that i could see the output and compare to original while I was writing the script.
Here is the code:
import glob
types = ('*.html', '*.htm')
paths = []
for fType in types:
for filename in glob.iglob('./**/' + fType, recursive=True):
paths.append(filename)
#print(paths)
for path in paths:
print(path)
with open(path,'r') as f:
lines = f.readlines()
with open(path, 'w') as w:
for i in range(0,len(lines)):
w.write(lines[i])
if "<body>" in lines[i]:
if "<!-- Google Tag Manager -->" not in lines[i+1]:
w.write('<!-- Google Tag Manager --> <!-- End Google Tag Manager -->\n')
My take on it, might have some bugs:
edited to add: I have since realized that this code does not ensure <!-- Google Tag Manager --> is the first tag after <body>, instead it ensures it is the first comment after <body>. Which is not what the question asked for.
import fnmatch
import os
from bs4 import BeautifulSoup, Comment
from HTMLParser import HTMLParser
def get_soup(filename):
with open(filename, 'r') as myfile:
data=myfile.read()
return BeautifulSoup(data, 'lxml')
def write_soup(filename, soup):
with open(filename, "w") as file:
output = HTMLParser().unescape(soup.prettify())
file.write(output)
def needs_insertion(soup):
comments = soup.find_all(text=lambda text:isinstance(text, Comment))
try:
if comments[0] == ' Google Tag Manager ':
return False # has correct comment
else:
return True # has comments, but not correct one
except IndexError:
return True # has no comments
def get_html_files_in_dir(top_level_directory):
matches = []
for root, dirnames, filenames in os.walk(top_level_directory):
for filename in fnmatch.filter(filenames, '*.html'):
matches.append(os.path.join(root, filename))
return matches
my_html_files_path = '/home/azrad/whateveryouneedhere'
for full_file_name in get_html_files_in_dir(my_html_files_path):
soup = get_soup(full_file_name)
if needs_insertion(soup):
soup.body.insert(0, '<!-- Google Tag Manager --> <!-- End Google Tag Manager -->')
write_soup(full_file_name, soup)
I am trying to write a program that pulls the urls from each line of a .txt file and performs a PyQuery to scrape lyrics data off of LyricsWiki, and everything seems to work fine until I actually put the PyQuery stuff in. For example, when I do:
full_lyrics = ""
#open up the input file
links = open('links.txt')
for line in links:
full_lyrics += line
print(full_lyrics)
links.close()
It prints everything out as expected, one big string with all the data in it. However, when I implement the actual html parsing, it only pulls the lyrics from the last url and skips through all the previous ones.
import requests, re, sqlite3
from pyquery import PyQuery
from collections import Counter
full_lyrics = ""
#open up the input file
links = open('links.txt')
output = open('web.txt', 'w')
output.truncate()
for line in links:
r = requests.get(line)
#create the PyQuery object and parse text
results = PyQuery(r.text)
results = results('div.lyricbox').remove('script').text()
full_lyrics += (results + " ")
output.write(full_lyrics)
links.close()
output.close()
I writing to a txt file to avoid encoding issues with Powershell. Anyway, after I run the program and open up the txt file, it only shows the lyrics of the last link on the links.txt document.
For reference, 'links.txt' should contain several links to lyricswiki song pages, like this:
http://lyrics.wikia.com/Taylor_Swift:Shake_It_Off
http://lyrics.wikia.com/Maroon_5:Animals
'web.txt' should be a blank output file.
Why is it that pyquery breaks the for loop? It clearly works when its doing something simpler, like just concatenating the individual lines of a file.
The problem is the additional newline character in every line that you read from the file (links.txt). Try open another line in your links.txt and you'll see that even the last entry will not be processed.
I recommend that you do a right strip on the line variable after the for, like this:
for line in links:
line = line.rstrip()
r = requests.get(line)
...
It should work.
I also think that you don't need requests to get the html. Try results = PyQuery(line) and see if it works.