Syntax for Conditional Statement With Regex Function - python

I have created a code to parse through multiple pdf files and return a line of data from each page. I came across the issue that some of the pages within my pdf files do not have this line. When this happens my code just omits the page entirely; however I would like it to print a single 'none' for the pages where it can not find the specified line. I thought this was a simple fix but its proving to be a little more complicated that I thought. Here is an example of the line I am pulling and what I have tried:
#pattern I told my code to look for within each page of pdf
sqft_re = re.compile('(\d+(sqft)\s+[$]\d+[.]\d+\s+\d{2}/\d{2})')
#this is an example of what the line I want in each page looks like:
'1600sqft $154.98 10/14'
Basically I want the code to parse through every pdf and return the line if it can find it. If it can not I want it to return a single 'none' for said page without that line. I have called the lines to a list like so:
lines = []
Here is how I set my for loop to look through each page of my pdf files:
for files in os.listdir(directory):
if files.endwith(".pdf"):
with pdfplumber.open(files) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = sqft_re.search(line)
if line:
line.group(1)
lines.append(line)
Example of output:
lines
'1600sqft $154.98 10/14'
'1450qft $113.02 07/05'
'90sqft $60.17 05/12'
'3000sqft $500.98 09/20'
This code successfully returns a the list of data for pages with the line. However pages without the line are omitted. Here is what I thought would fix the problem and simply print none for pages without the line:
for files in os.listdir(directory):
if files.endwith(".pdf"):
with pdfplumber.open(files) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = sqft_re.search(line)
if line:
line.group(1)
else:
line = 'None'
lines.append(line)
However this did not work and now instead of just substituting 'None' for pages without the value every single line within the pdf page is printed as 'None' except for where it matches the line. So basically I now have a list that looks like this:
lines
'None'
'None'
'None'
'1600sqft $154.98 10/14'
'None'
'None'
'None'
'1450qft $113.02 07/05' #etc.....
I have tried some other things like calling a different function when it does not match what I am looking for, making my own string to substitute the value with and a couple more. I am still getting the same problem. In my sample pdf there is only one page without this line so my list should look like:
'1600sqft $154.98 10/14'
'1450qft $113.02 07/05'
'90sqft $60.17 05/12'
'3000sqft $500.98 09/20'
'None'
I am also pretty new to python (R is what I primarily work with) so I am sure I am overlooking something here but any guidance to what I am missing would be appreciated!

You should append the match to the lines variable, not the line itself, unless that is your intention.
Besides, you need to set a flag to False before checking each page and once there is a match, set it to True. If it is False at the end of the page, add None to the lines.
See a sample Python code with the loop:
for page in pdf.pages:
text = page.extract_text()
found = False
for line in text.split('\n'):
line = sqft_re.search(line)
found = not found
lines.append(line.group(1))
if not found:
lines.append('None')

Related

Streamlit file objects creating very large line spacings when decoded to string, printed to file

I am writing a Streamlit app that takes in tensor output data from a .txt file, formats it, and both shows information on the data and prints the formatted data back to a new .txt file for later use.
After uploading the txt file to Streamlit and decoding it to a single long string, I alter the string and write it to a new txt file. When I open the txt file, the line spacings are huge, it looks like extra newlines have been put in but when you highlight the text, it is just large line spacings.
As well as this, when I use splitlines() on the string, the array that is returned is empty. This is the case even though the string is not empty and does contain newlines - I think it is to do with the large line spacings, but I am not sure.
The program is split into modules, but the code that is meant to format the file is in just two functions. One adds delimiters and works like this (with Streamlit as st):
def delim(file):
#read the selected file and write it to variable elems as a string
elems = file.decode('utf-8')
#replace the applicable parts of variable elems with the delimiters
elems = elems.replace('e+002', 'e+002, ')
elems = elems.replace('e+003', 'e+003, ')
elems = elems.replace('e+004', 'e+004, ')
elems = elems.replace('e+005', 'e+005, ')
elems = elems.replace('e+006', 'e+006, ')
elems = elems.replace('e+007', 'e+007, ')
elems = elems.replace('e+008', 'e+008, ')
elems = elems.replace('e+009', 'e+009, ')
with open('final_file.txt', 'w') as magma_file:
#write a txt file with the stored, altered text in variable elems
magma_file.write(elems)
#close the writeable file to be safe
magma_file.close()
st.success('Delimiters successfully added')
The second part, where I am getting the empty array, is in a second function. The whole function is not necessary to see the issue, but the part that is not working is here:
def addElem(file):
#create counting variables
counter = 0
linecount = 1
#put file as string in variable checks
checks = file.decode('utf-8')
checks.splitlines()
#check to see if the start of the file is formatted correctly. This is the part giving me strife
if checks[0].rstrip().endswith('5'):
with open('final_file.txt', 'w') as ff:
#iterate through the lines in the file
for line in checks:
counter+=1
# and so on, not relevant to the problem
The variable checks does contain a string after decoding the file, but when I use splitlines() then look inside checks[0], checks[1] etc., they are all empty. I tried commenting out other code, the conditional statement, removing the rstrip() and just seeing what was in the checks array after splitting the string, but it was still nothing. I tried changing splitlines() to split() using various delimiters including \n, but the array remained empty.
This program logic worked perfectly when I was running it locally using a console application interacting directly with the file system, so probably the problem is something to do with how a Streamlit "file like object" works. I read through the docs at Streamlit, but it doesn't give much detail on this.
This program is not for my use, so I can't keep it as a console app. I did ask about this on the Streamlit community a month ago, but so far no one has answered and I am not sure whether it is an unusual problem or just a terrible question.
I am wondering if there is a better way to decode the file to a string, but decoding to unicode doesn't explain the line spacings so I think something else is going on.

how to go to changed url in selenium python

i am a completely beginner in programming, and i am trying to make my first python script, and i have a url in this form: https://stackoverflow.com/questions/ID
where ID is changed every time in a loop, and a list of IDs given in a text file.
now i tried to do it this way:
if id_str != "":
y = f'https://stackoverflow.com/questions/{id}'
browser.get(y)
but it opens only the first ID in the text file, so i need to know how to make it get a different ID from the text file every time.
Thanks in Advance
Generally it can be something like this:
with open(filename) as file:
lines = file.readlines()
for line in lines:
if id_str != "":
y = f'https://stackoverflow.com/questions/{id_str}'
browser.get(y)
where filename is a text file containing questions ids.
Each line here containing a single id string.
It can be more complicated, according to your needs / implementation.

Remove first 5 lines from every PDF page in Python

I need to extract text from a pdf using Python (NLP application), and want to leave out the first 5 lines from the text on every page. I tried looking online, but couldn't find anything substantial. I am using the below code to read all text on the pages. Is there a post-extraction step that can remove from all pages the first few lines, or maybe something that can be done at extraction stage itself?
fileReader = PyPDF2.PdfFileReader(file)
s=""
for i in range(2, fileReader.numPages):
s+=fileReader.getPage(i).extractText()
split text with "\n" and slice to remove the first 5 lines:
import pdfplumber
pdf = pdfplumber.open("CS.pdf")
for page in pdf.pages:
text = page.extract_text()
for line in text.split("\n")[5:]:
print(line)

Python increment pulling from an array in a loop

I am trying to create something that would
Read a file
Separate ever thing into an array
Use the first string in the array and do something
After done go back to the second array and do something
Keep repeating this process until the array is done.
So far I have
users = open('users.txt', "r")
userL = users.read().splitlines()
What I want it to do exactly is open the text file, which is already separated 1 line per string, then have the Python part put that into an array, get the first string and set that into a variable. From there the variable will be used in a URL for xbox.com.
After it checks it I will have some JSON read the page and see if the gamertag list I have is being used, if it is being used it will go back to the array and go to the second string and check. This needs to be a constant loop of checking gamertags. If it does find a gamertag in the array (From the text file) that isn't used, it will save that to another text file entitled "available gametags" and keep moving on.
What I want it to do (requested in comments)
Open Program
Have it read a text file of usernames I have created
Have the program test each program at the end of the gamertag viewer link for xbox
JSON read the page and if it contains info that the name is taken it goes back to the list and uses the next gamertag on the page.
Keeps doing this
Logs all the gamertags that worked and saves to a text file.
Exits
The problem in doing that is I don't know how to go back to the file and access the line after the one it just tested and continue this pattern until the file is completely read.
Use a for loop:
with open("users.txt") as f:
for line in f:
# do whatever with the line
For example, to achieve your goal here, you might do something like this:
# import our required modules
import json
import urllib2
# declare some initial variables
input_file = "users.txt"
output_file = "available_tags.txt"
available_tags = [] # an empty list to hold our available gamertags
# open the file
with open(input_file) as input_f:
# loop through each line
for line in input_f:
# strip any new line character from the line
tag = line.strip()
# set up our URL and open a connection to the API
url = "http://360api.chary.us/?gamertag=%s" % tag
u = urllib2.urlopen(url)
# load the returned data as a JSON object
data = json.loads(u.read())
# check if the gamertag is available
if not data['GamertagExists']:
# print it and add it to our list of available tags if so
print "Tag %s is available." % tag
available_tags.append(tag)
else:
print "Tag %s is not available." % tag #otherwise
# check that we have at least one valid tag to store
if len(available_tags) > 0:
# open our output file
with open(output_file, "w") as output_f:
# loop through our available tags
for tag in available_tags:
# write each one to the file
output_f.write("%s\n" % tag)
else:
print "No valid tags to be written to output file."
To get you started, the following code will read a whole file from start to finish in that order, and print each line individually:
with open(r"path/to.file.txt") as fin:
for line in fin.readlines():
print(line) # Python 2.7: Use 'print line' instead
If you need to remove the trailing new lines from each string, use .strip().
To write data to a file, use something like the following:
with open(r"path/to/out/file.txt", "w") as fout:
fout.writelines(data_to_write)

Python: Problems finding string in website source code

I open a website with urlopen. I then put the website sourcecode into a variable like so
source = website.read()
When I just print the source it comes out formatted correctly, however when I try to iterate through each line each character is it's own line.
for example
when I just print it looks like this
<HTML> title</html>
When I do this
for line in source:
print line
it looks like this
<
H
T
M
L
... etc
I need to find a string that starts with "var" and then print that entire line.
Use readlines() instead of read() to get a list of lines.
Or use:
for line in source.split("\n"):
...

Categories