Python: Problems finding string in website source code - python

I open a website with urlopen. I then put the website sourcecode into a variable like so
source = website.read()
When I just print the source it comes out formatted correctly, however when I try to iterate through each line each character is it's own line.
for example
when I just print it looks like this
<HTML> title</html>
When I do this
for line in source:
print line
it looks like this
<
H
T
M
L
... etc
I need to find a string that starts with "var" and then print that entire line.

Use readlines() instead of read() to get a list of lines.

Or use:
for line in source.split("\n"):
...

Related

Streamlit file objects creating very large line spacings when decoded to string, printed to file

I am writing a Streamlit app that takes in tensor output data from a .txt file, formats it, and both shows information on the data and prints the formatted data back to a new .txt file for later use.
After uploading the txt file to Streamlit and decoding it to a single long string, I alter the string and write it to a new txt file. When I open the txt file, the line spacings are huge, it looks like extra newlines have been put in but when you highlight the text, it is just large line spacings.
As well as this, when I use splitlines() on the string, the array that is returned is empty. This is the case even though the string is not empty and does contain newlines - I think it is to do with the large line spacings, but I am not sure.
The program is split into modules, but the code that is meant to format the file is in just two functions. One adds delimiters and works like this (with Streamlit as st):
def delim(file):
#read the selected file and write it to variable elems as a string
elems = file.decode('utf-8')
#replace the applicable parts of variable elems with the delimiters
elems = elems.replace('e+002', 'e+002, ')
elems = elems.replace('e+003', 'e+003, ')
elems = elems.replace('e+004', 'e+004, ')
elems = elems.replace('e+005', 'e+005, ')
elems = elems.replace('e+006', 'e+006, ')
elems = elems.replace('e+007', 'e+007, ')
elems = elems.replace('e+008', 'e+008, ')
elems = elems.replace('e+009', 'e+009, ')
with open('final_file.txt', 'w') as magma_file:
#write a txt file with the stored, altered text in variable elems
magma_file.write(elems)
#close the writeable file to be safe
magma_file.close()
st.success('Delimiters successfully added')
The second part, where I am getting the empty array, is in a second function. The whole function is not necessary to see the issue, but the part that is not working is here:
def addElem(file):
#create counting variables
counter = 0
linecount = 1
#put file as string in variable checks
checks = file.decode('utf-8')
checks.splitlines()
#check to see if the start of the file is formatted correctly. This is the part giving me strife
if checks[0].rstrip().endswith('5'):
with open('final_file.txt', 'w') as ff:
#iterate through the lines in the file
for line in checks:
counter+=1
# and so on, not relevant to the problem
The variable checks does contain a string after decoding the file, but when I use splitlines() then look inside checks[0], checks[1] etc., they are all empty. I tried commenting out other code, the conditional statement, removing the rstrip() and just seeing what was in the checks array after splitting the string, but it was still nothing. I tried changing splitlines() to split() using various delimiters including \n, but the array remained empty.
This program logic worked perfectly when I was running it locally using a console application interacting directly with the file system, so probably the problem is something to do with how a Streamlit "file like object" works. I read through the docs at Streamlit, but it doesn't give much detail on this.
This program is not for my use, so I can't keep it as a console app. I did ask about this on the Streamlit community a month ago, but so far no one has answered and I am not sure whether it is an unusual problem or just a terrible question.
I am wondering if there is a better way to decode the file to a string, but decoding to unicode doesn't explain the line spacings so I think something else is going on.

Seeking and deleting elements in lists of a parsed file and saving result to another file

I have a large .txt file that is a result of a C-file being parsed containing various blocks of data, but about 90% of them are useless to me. I'm trying to get rid of them and then save the result to another file, but have hard time doing so. At first I tried to delete all useless information in unparsed file, but then it won't parse. My .txt file is built like this:
//Update: Files I'm trying to work on comes from pycparser module, that I found on a GitHub.
File before being parsed looks like this:
And after using pycparser
file_to_parse = pycparser.parse_file(current_directory + r"\D_Out_Clean\file.d_prec")
I want to delete all blocks that starts with word Typedef. This module stores this in an one big list that I can access via it's attribute.
Currently my code looks like this:
len_of_ext_list = len(file_to_parse.ext)
i = 0
while i < len_of_ext_list:
if 'TypeDecl' not in file_to_parse.ext[i]:
print("NOT A TYPEDECL")
print(file_to_parse.ext[i], type(file_to_parse.ext[i]))
parsed_file_2 = open(current_directory + r"\Zadanie\D_Out_Clean_Parsed\clean_file.d_prec", "w+")
parsed_file_2.write("%s%s\n" %("", file_to_parse.ext[i]))
parsed_file_2.close
#file_to_parse_2 = file_to_parse.ext[i]
i+=1
But above code only saves one last FuncDef from a unparsed file, and I don't know how to change it.
So, now I'm trying to get rid of all typedefs in parsed file as they don't have any valuable information for me. I want to now what functions definitions and declarations are in file, and what type of global variables are stored in parsed file. Hope this is more clear now.
I suggest reading the entire input file into a string, and then doing a regex replacement:
with open(current_directory + r"\D_Out\file.txt", "r+") as file:
with open(current_directory + r"\D_Out_Clean\clean_file.txt", "w+") as output:
data = file.read()
data = re.sub(r'type(?:\n\{.*?\}|[^;]*?;)\n?', '', data, flags=re.S)
output.write(line)
Here is a regex demo showing that the replacement logic is working.

Syntax for Conditional Statement With Regex Function

I have created a code to parse through multiple pdf files and return a line of data from each page. I came across the issue that some of the pages within my pdf files do not have this line. When this happens my code just omits the page entirely; however I would like it to print a single 'none' for the pages where it can not find the specified line. I thought this was a simple fix but its proving to be a little more complicated that I thought. Here is an example of the line I am pulling and what I have tried:
#pattern I told my code to look for within each page of pdf
sqft_re = re.compile('(\d+(sqft)\s+[$]\d+[.]\d+\s+\d{2}/\d{2})')
#this is an example of what the line I want in each page looks like:
'1600sqft $154.98 10/14'
Basically I want the code to parse through every pdf and return the line if it can find it. If it can not I want it to return a single 'none' for said page without that line. I have called the lines to a list like so:
lines = []
Here is how I set my for loop to look through each page of my pdf files:
for files in os.listdir(directory):
if files.endwith(".pdf"):
with pdfplumber.open(files) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = sqft_re.search(line)
if line:
line.group(1)
lines.append(line)
Example of output:
lines
'1600sqft $154.98 10/14'
'1450qft $113.02 07/05'
'90sqft $60.17 05/12'
'3000sqft $500.98 09/20'
This code successfully returns a the list of data for pages with the line. However pages without the line are omitted. Here is what I thought would fix the problem and simply print none for pages without the line:
for files in os.listdir(directory):
if files.endwith(".pdf"):
with pdfplumber.open(files) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = sqft_re.search(line)
if line:
line.group(1)
else:
line = 'None'
lines.append(line)
However this did not work and now instead of just substituting 'None' for pages without the value every single line within the pdf page is printed as 'None' except for where it matches the line. So basically I now have a list that looks like this:
lines
'None'
'None'
'None'
'1600sqft $154.98 10/14'
'None'
'None'
'None'
'1450qft $113.02 07/05' #etc.....
I have tried some other things like calling a different function when it does not match what I am looking for, making my own string to substitute the value with and a couple more. I am still getting the same problem. In my sample pdf there is only one page without this line so my list should look like:
'1600sqft $154.98 10/14'
'1450qft $113.02 07/05'
'90sqft $60.17 05/12'
'3000sqft $500.98 09/20'
'None'
I am also pretty new to python (R is what I primarily work with) so I am sure I am overlooking something here but any guidance to what I am missing would be appreciated!
You should append the match to the lines variable, not the line itself, unless that is your intention.
Besides, you need to set a flag to False before checking each page and once there is a match, set it to True. If it is False at the end of the page, add None to the lines.
See a sample Python code with the loop:
for page in pdf.pages:
text = page.extract_text()
found = False
for line in text.split('\n'):
line = sqft_re.search(line)
found = not found
lines.append(line.group(1))
if not found:
lines.append('None')

Python write to file - New line

I am trying to write python code where I write to a fileand for each f.write. I want to make it write to a new line. I thought \n would do it. But right now everything is being written to one line. What am I doing wrong ?
Code (localtime, sum and value are variables)
f = open('/var/www/html/index.html','w')
f.write("<font size='35px'>"+sum+"</font>")
f.write('\n'+localtime)
f.write('\n Sist oppdatert '+value)
f.close()
Use line breaks <br/> for html line breaks:
f = open('/var/www/html/index.html','w')
f.write("<font size='35px'>"+sum+"</font>")
f.write('<br/>'+localtime)
f.write('<br/> Sist oppdatert '+value)
f.close()

Reading in line from txt file in python

I'm having trouble it seems with reading in lines from a text file. When I do the whole f.readline() I can save it to a string and then print off the correct text however when lets say I go to print the first or second character of the string I just made it'll print a strange like dot checker pattern character instead of the correct letter.
Edit: Ok so when I try alfasin's method I seem to get the correct length of each line besides the first line that is read in. If I'm say reading in 5 lines and looking for a space, the first line with find the first space at spot 13 when it should find it at spot 8. However the next lines read in will all produce the correct length and location of the space.
Edit2: Also the text file I am reading in is UTF-8.
Edit3: Definitely was an issue with the encoding of the text file. I changed it to ANSI and everything started working as it should.
Try the following:
with open('filename.txt') as file:
for line in file:
print line
# and if you want to break it down to characters:
print list(line)

Categories