Python BeautifulSoup is unnecessarily slow

Python BeautifulSoup is unnecessarily slow - python

While this code works pretty fast:
for olay in soup("li", {"class":"textb"}):
tanim = olay("strong")
try:
print tanim[0]
except IndexError:
pass
Getting string property like this makes this code considerably slower:
for olay in soup("li", {"class":"textb"}):
tanim = olay("strong")
try:
print tanim[0].string
except IndexError:
pass
My question is, am I doing something that I shouldn't getting string property like that? Should I have used something else to get plain text version of an object?
Update:
This is also working pretty fast, so slowness is unique to string property I guess?
for olay in soup("li", {"class":"textb"}):
tanim = olay("strong")
try:
print tanim[0].text
except IndexError:
pass

If you just want to print the string representation of tanim[0]. You should just do: print str(tanim[0]). Also, do a dir(tanim[0]) to see if it has a property called string at all.
for olay in soup("li", {"class":"textb"}):
tanim = olay("strong")
try:
print str(tanim[0])
except IndexError:
pass
For everyone to provide a better answer, you could also post the target HTML or the URI and mention which bit you are trying to extract out of it.

Related

Python string indices must be integers

I'm reading a Dictionary from an API which has a field called 'price'.
I'm reading it fine for a while (so, the code works) until I get to a point I get the error message: string indices must be integers.
That breaks my code.
So, I would like to find a way to skip it (ignore it) when this happens, and continue with the code. And just print something out so I know something happened.
So, far I don't manage to see what number is causing this error.
If I test this by itself, it works fine.
fill = {'price': 0.00002781 }
price = fill['price'] # OUTPUT: string indices must be integers
print(price)
I've tried many things:
from decimal import Decimal
price = decimal(fill['price'])
also:
price = int(fill['price']) # but it's not really an int
and:
price = float(fill['price']) # but sometimes it's a very big float so I need decimal

It seems that what you get from the API is not exactly what you expect:
The variable fill is a string (at least at the time you get the error).
As strings can't have string indices (like dictionaries can) you get the TypeError exception.
To handle the exception and troubleshoot it, you can use try-except, like so:
try:
price = fill['price']
except TypeError as e:
print(f"fill: {fill}, exception: {str(e)}")
This way, when there is an issue, the fill value will be printed as well as the exception.

string indices must be integers tells you that the type of fill during runtime at some point is a str instead of Dict. I suggest that you add type checking or assertion to your program to make sure fill is of the expected type.

If you want to just ignore it you could use try and except blocks.
try:
price = fill['price']
except Exception as e:
print(f"Error reading the price. Error: {e}")

Python try-except-except

Im gonna include the description of the task this code is supposed to do in case someone needs it to answer me.
#Write a function called "load_file" that accepts one
#parameter: a filename. The function should open the
#file and return the contents.#
#
# - If the contents of the file can be interpreted as
# an integer, return the contents as an integer.
# - Otherwise, if the contents of the file can be
# interpreted as a float, return the contents as a
# float.
# - Otherwise, return the contents of the file as a
# string.
#
#You may assume that the file has only one line.
#
#Hints:
#
# - Don't forget to close the file when you're done!
# - Remember, anything you read from a file is
# initially interpreted as a string.
#Write your function here!
def load_file(filename):
file=open(filename, "r")
try:
return int(file.readline())
except ValueError:
return float(file.readline())
except:
return str(file.readline())
finally:
file.close()
#Below are some lines of code that will test your function.
#You can change the value of the variable(s) to test your
#function with different inputs.
#
#If your function works correctly, this will originally
#print 123, followed by <class 'int'>.
contents = load_file("LoadFromFileInput.txt")
print(contents)
print(type(contents))
When the code is tested with a file which contains "123", then everything works fine. When the website loads in another file to test this code, following error occurs:
[Executed at: Sat Feb 2 7:02:54 PST 2019]
We found a few things wrong with your code. The first one is shown below, and the rest can be found in full_results.txt in the dropdown in the top left:
We tested your code with filename = "AutomatedTest-uwixoW.txt". We expected load_file to return the float -97.88285. However, it instead encountered the following error:
ValueError: could not convert string to float:
So Im guessing the error occurs inside the first except statement, but i don't understand why. If an error occurs when the value inside a file is being converted to float, shouldnt the code just go to the second except statement ? And in the second except it would be converted to string, which will work anyway ? I'm guessing i misunderstand something about how try-except(specified error)-except(no specified error) works.
Sorry for long post.

shouldnt the code just go to the second except statement ?
Nope: this "flat" try/except statement works only for the first try block. If an exception occurs there, the except branches catch this exception and straight away evaluate the appropriate block. If an exception occurs in this block, it's not caught by anything, because there's no try block there.
So, you'd have to do a whole lot of nested try/except statements:
try:
do_this()
except ValueError:
try:
do_that()
except ValueError:
do_this()
except:
do_that_one()
except:
# a whole bunch of try/except here as well
You may need to add an extra level of nesting.
This is terribly inefficient in terms of the amount of code you'll need to write. A better option might be:
data = file.readline()
for converter in (int, float, str):
try:
return converter(data)
except:
pass
Note that if you do converter(file.readline()), a new line will be read on each iteration (or, in your case, in any new try/except block), which may not be what you need.

No, only one of those except blocks -- the first one matching the exception -- will be executed. The behavior you are describing would correspond to
except ValueError:
try:
return float(file.readline())
except:
return str(file.readline())

def load_file(filename):
file=open(filename, "r")
try:
val = file.readline()
return int(val)
except ValueError:
try:
return float(val)
except:
return str(val)
finally:
file.close()

iterate threw list and if value doesnt excist hide error and continue

I've got a List like:
results = ['SDV_GAMMA','SDV_BETA,'...','...']
and then comes and for loop like:
for i in range (len(results)):
a = instance.elementSets[results[i]]
The strings defined in the result-list are part of a *.odb result file and if they didn't exist there comes an error.
I would like that my program doesn't stop cause of an error. It should go on and check if values of the others result values exist.
So i do not have to sort every result before i start my program. If it´s not in the list, there is no problem, and if it exists i get my data.
I hope u know what i mean.

You can use try..except block
Ex:
for i in results
try:
a = instance.elementSets[results[i]]
except:
pass

You can simply check the presence of results[i] in instance.elementSets before extracting it.
If instance.elementSets is a dictionary, use the dict.get command.
https://docs.python.org/3/library/stdtypes.html#dict.get

Python: Append a parsed string but throw out non-compliant values?

Warning: I'm a total newbie; apologies if I didn't search for the right thing before submitting this question. I found lots on how to ignore errors, but nothing quite like what I'm trying to do here.
I have a simple script that I'm using to grab data off a database, parse some fields apart, and re-write the parsed values back to the database. Multiple users are submitting to the database according to a delimited template, but there is some degree of non-compliance, meaning sometimes the string won't contain all/any delimiters. My script needs to be able to handle those instances by throwing them out entirely.
I'm having trouble throwing out non-compliant strings, rather than just ignoring the errors they raise. When I've tried try-except-pass, I've ended up getting errors when my script attempts to append parsed values into the array I'm ultimately writing back to the db.
Originally, my script said:
def parse_comments(comments):
parts = comments.split("||")
if len(parts) < 20:
raise ValueError("Comment didn't have enough || delimiters")
return Result._make([parts[i].strip() for i in xrange(2, 21, 3)])
Fully compliant uploads would append Result to an array and write back to db.
I've tried try/except:
def parse_comments(comments):
parts = comments.split("||")
try:
Thing._make([parts[i].strip() for i in xrange(2, 21, 3)])
except:
pass
return Thing
But I end up getting an error when I try and append the parsed values to an array -- specifically TypeError: 'type' object has no attribute 'getitem'
I've also tried:
def parse_comments(comments):
parts = comments.split("||")
if len(parts) >= 20:
Thing._make([parts[i].strip() for i in xrange(2, 21, 3)])
else:
pass
return Thing
but to no avail.
tl;dr: I need to parse stuff and append parsed items. If a string can't be parsed how I want it, I want my code to ignore that string entirely and move on.

But I end up getting an error when I try and append the parsed values to an array -- specifically TypeError: 'type' object has no attribute 'getitem'
Because Thing means the Thing class itself, not an instance of that class.
You need to think more clearly about what you want to return when the data is invalid. It may be the case that you can't return anything directly usable here, so that the calling code has to explicitly check.

I am not sure I understand everything you want to do. But I think you are not catching the error at the right place. You said yourself that it arose when you wanted to append the value to an array. So maybe you should do:
try:
# append the parsed values to an array
except TypeError:
pass
You should give the exception type to catch after except, otherwise it will catch any exception, even a user's CTRL+C which raise a KeyboardInterrupt.

Python trying to write and read class from file but something went horribly wrong

Considering this is only for my homework I don't expect much help but I just can't figure this out and honestly I can't get my head around what's going wrong. Usually I have an idea where the problem is but now I just don't get it.
Long story short: I'm trying to create a valid looking telephone number within a class and then loading it onto an array or list then later on save all of them as string into a folder. When I start the program again I want it to read the file and re-create my class and load it back into the list. (Basically a very simple repository).
Problem is even though I evaluate the stored phone number in the exact same way I validate it as input data ... I get an error which makes no sens.
Another small problem is the fact that when I re-use the data for some reason it creates white spaces in the file which in turn messes my program up badly.
Here I validate phone numbers:
def validateTel(call_ID):
if isinstance (call_ID, str) == True:
call_ID = call_ID.replace (" ", "")
if (len (call_ID) != 10):
print ("Telephone numbers are 10 digits long")
return False
for item in call_ID:
try:
int(item)
except:
print ("Telephone numbers should contain non-negative digits")
return False
else:
if (int(item) < 0):
print ("Digits are non-negative")
After this I use it and other non-relevant (to this discussion) data to create an object (class instance) and move them to a list.
Inside my class I have a load from string and a load to string. What they do is take everything from my class object so I can write it to a file using "+" as a separator so I can use string.split("+") and write it to a file. This works nicely, but when I read it ... well it's not working.
def load_data():
f = open ("data.txt", "r")
ch = f.read()
contact = agenda.contact () # class object
if ch in (""," ","None"," None"):
f.close()
return [] # if the file is empty or has None in some way I pass an empty stack
else:
stack = [] # the list where I load all my class objects
f.seek(0,0)
for line in f:
contact.loadFromString(line) # explained bellow
stack.append(deepcopy(contact))
f.close()
return stack
In loadFromString(line) all I do is validate the line (see if the data inside it at least looks OK).
Now here is the place where I validate the string I just read from the file:
def validateString (load_string):
string = string.split("+")
if len (string) != 4:
print ("System error in loading from file: Program skipping segment of corrupt data")
return False
if string[0] == "" or string[0] == " " or string[0] == None or string[0] == "None" or string[0] == " None":
print ("System error in loading from file: Name field cannot be empty")
try:
int(string[1])
except:
print("System error in loading from file: ID is not integer")
return False
if (validateTel(str(string[2])) == False):
print ("System error in loading from file: Call ID (telephone number)")
return False
return True
Small recap:
I try to load the data from file using loadFromString(). The only relevant thing that does is it tries to validate my data with validateString(string) in there the only thing that messes me up is the validateTel. But my input data gets validated in the same way my stored data does. They are perfectly identical but it gives a "System error" BUT to give such an error it should have also gave an error in the validate sub-program but it doesn't.
I hope this is enough info because my program is kinda big (for me any way) however the bug should be here somewhere.
I thank anyone brave enough to sift trough this mess.
EDIT:
The class is very simple, it looks like this:
class contact:
def __init__ (self, name = None, ID = None, tel = None, address = None):
self.__name = name
self.__id = ID
self.__tel = tel
self.__address = address
After this I have a series of setters and getters (to modify contacts and to return parts of the abstract data)
Here I also have my loadFromString and loadToString but those work just fine (except maybe they cause a small jump after each line (an empty line) which then kills my program, but that I can deal with)
My problem is somewhere in the validate or a way the repository interacts with it. The point is that even if it gives an error in the loading of the data, first the validate should print an error ... but it doesn't -_-

You said I just can't figure this out and honestly I can't get my head around what's going wrong. I think this is a great quote which sums up a large part of programming and software development in general -- dealing with crazy, weird problems and spending a lot of time trying to wrap your head around them.
Figuring out how to turn ridiculously complicated problems into small, manageable problems is the hardest part of programming, but also arguably the most important and valuable.
Here's some general advice which I think might help you:
use meaningful names for functions and variables (validateString doesn't tell me anything about what the function does; string tells me nothing about the meaning of its contents)
break down problems into small, well-defined pieces
specify your data -- what is a phone number? 10 positive digits, no spaces, no punctuation?
document/comment the input/output from functions if it's not obvious
Specific suggestions:
validateTel could probably be replaced with a simple regular expression match
try using json for serialization
if you're using json, then it's easy to use lists. I would strongly recommend this over using + as a separator -- that looks highly questionable to me
Example: using a regex
import re
def validateTel(call_ID):
phoneNumberRegex = re.compile("^\d{10}$") # match a string of 10 digits
return phoneNumberRegex.match(call_ID)
Example: using json
import json
phoneNumber1, phoneNumber2, phoneNumber3 = ... whatever ...
mylist = [phoneNumber1, phoneNumber2, phoneNumber3]
print json.dumps(mylist)

For starters, don't name your variables after reserved keywords. Instead of calling it string, call it telNumber or s or my_string.
def validateString (my_string):
working_string= my_string.split("+")
if len (working_string) != 4:
print ("System error in loading from file: Program skipping segment of corrupt data")
return False
The next line I don't really get - what is this If chain for? Accounting for bad data or something? Probably better to check for good data; bad data can come in infinite variety.
if working_string[0] == "" or working_string[0] == " " or working_string[0] == None or working_string[0] == "None" or string[0] == " None":
print ("System error in loading from file: Name field cannot be empty")
try:
int(string[1])
except:
print("System error in loading from file: ID is not integer")
return False
if (validateTel(str(working_string[2])) == False):
print ("System error in loading from file: Call ID (telephone number)")
return False
return True
Also, to give you a hint - you may want to look into regular expressions.

wow - many problems maybe connected to your problem, also as I commented - I suspect your problem is with turning the telnumber object to string.
f is is file object it won't be equal to anything. if you want to check if the file exists you should just do try /except around the file creation block. like:
try:
f = open ('data.txt','r') #also could call c=f.read() and check if c equals to none.. not really needed because you can cover an empty file in the next part iterating over f
except:
return
for line in f:
all sorts of stuff
return stack
don't use string reserved word and checking with negative numbers is very strange -is this part of the homeework? and why check by turning to int? this could also break your code - since the rest is a string.
all that said - I still suspect your main problem is with the way you turning the object into string data, It would never remain an instance of unless you used json/pickle/something else to strigfy. an object instance isn't just the class str.
and another thing - keep it simple, python is (also) about elegent and simple coding and you are trying to throw brute force with everything you know at a simple problem. focus, relax and rewrite the program.

I don't perceive all the logic.
For the moment , I can say you that you should correct the code of load_data() as follows:
def load_data():
f = open ("data.txt", "r")
ch = f.read()
contact = agenda.contact () # class object
if ch in (""," ","None"," None"):
f.close()
return [] # if the file is empty or has None in some way I pass an empty stack
else:
stack = [] # the list where I load all my class objects
f.seek(0,0)
for line in f:
contact.loadFromString(line) # explained bellow
stack.append(deepcopy(contact))
f.close()
return stack
I don't see how the file-like handler f could ever have a value None or string, so I think you want to test the content of the file -> f.read()
But then, the file's pointer is at its end and must be moved back to the start -> seek(0,0)
I will progressively add complementing considerations when I will understand more the problem.
edit
To test if a file is empty or not
import os.path
if os.path.isfile(filepath) and os.path.getsize(filepath):
.........
If the file with path filepath doesn't exist, getsize() raises an error. So the preliminary test if os.path.isfile() is necessary to avoid the second condition test to be evaluated.
But if your file can contain the strings "None" or " None" (really ? !), the getsize() will return 4 or 5 in this case.
You should avoid to manage with files containing these kinds of useless data.
edit
In validateTel(), after the instruction if isinstance (call_ID, str) == True: you are sure that call_ID is a string. Then the iteration for item in call_ID: will produce only item being ONE character long, hence it's useless to test if (int(item) < 0): , it will never happen; it could be possible that there is a sign - in the string call_ID but you won't detect it with this last condition.
In fact, as you test each character of callID, it is enough to test if it is one of the digits 0,1,2,3,4,5,6,7,8,9. If the sign - is in calml_ID, it will be detected as not being a digit.
To test if all the character in call_ID, there's an easy way provided by Python: all()
def validateTel(call_ID):
if isinstance(call_ID, str):
call_ID = call_ID.replace (" ", "")
if len(call_ID) != 10:
print ("A telephone number must be 10 digits long")
return False
else:
print ("Every character in a telephone number must be a digit")
return all(c in '0123456789' for c in call_ID)
else:
print ("call_ID must be a string")
return False
If one of the character c in call_ID isn't a digit, c in '0123456789' is False and the function all() stop the exam of the following characters and returns False; otherwise, it returns True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup is unnecessarily slow - python

Related

Python string indices must be integers

Python try-except-except

iterate threw list and if value doesnt excist hide error and continue

Python: Append a parsed string but throw out non-compliant values?

Python trying to write and read class from file but something went horribly wrong

Categories

Resources