I am trying to figure out a good way to handle blacklists for words via a MySQL database. I have hit a roadblock when it comes to handling the data returned from the database.
cursor.execute('SELECT word FROM blacklist')
blacklist1 = []
for word in cursor.fetchall():
if word in blacklist1:
return
else:
blacklist1.append(word)
The above code is what I am using to pull the info which I know works. However, I need some help with converting this:
[('word1',), ('word2',), ('word3',), ('word4',), ('word5',)]
into this:
['word1', 'word2', 'word3', 'word4', 'word5']
my biggest issue is that I need it to scale so that it will check each word within the blacklist from no words to several thousand if necessary. I know a for loop would work when it comes to checking them versus the message it checks. but I know I will not be able to check the words till it is a normal list. any help would be appreciated.
In each iteration of for word in cursor.fetchall(), the variable word is a tuple, or a collection of values. This is documented here.
These correspond to each column returned, i.e. if you had a second column in your select statement ('SELECT word, replacement FROM blacklist') you would get tuples of two elements.
Use a set, and add the one and only element of the tuple, instead of the tuple itself:
for word_tuple in cursor.fetchall():
blacklist1.add(word[0])
Looking at the code more closely, if word in blacklist1: return may be a logical error - as soon as you see a duplicate, you'll stop reading rows from the database. You were likely looking to just skip that duplicate - you don't actually need that logic anymore because sets automatically remove duplicates.
Your list currently contains one element tuples. If you want to extract the strings you could try this:
blacklist1 = []
for word_tuple in cursor.fetchall():
if word_tuple[0] in blacklist1:
return
else:
blacklist1.append(word_tuple[0])
For your use case you might also benefit from having blacklist1 be a set, that way you can check for membership in O(1) time:
blacklist1 = set()
for word_tuple in cursor.fetchall():
if word_tuple[0] in blacklist1:
return
else:
blacklist1.add(word_tuple[0])
First, your actual problem is that the cursor is a wrapper of an iterator over rows returned from MySQL, so it can be operated on similarly to a list of tuples. That being said, my advice would be to split your "business" logic from your data access logic. This might seem trivial but it will make debugging much easier. The overall approach will look like this:
def get_from_database():
cursor.execute('SELECT word FROM blacklist')
return [row[0] for row in cursor.fetchall()]
def get_blacklist():
words = get_from_database()
return list(set(words))
In this approach, get_from_database retrieves all the words from MySQL and returns them in the format your program needs. get_blacklist encapsulates this logic and also makes the returned list unique. So now, if there's a bug, you can verify each independently.
Related
This seems like a pretty rudimentary question, but I'm wondering because the items in these lists change every so often when a website is scraped...
employees = ['leadership(x)', 'drivers(y)', 'trainers(z)']
Where x,y,z are the number of employees in those specific roles, and are the values that change every so often.
If I know that the strings will always be 'leadership' 'drivers' and 'trainers', just with a difference in what's in between the parentheses, how can I dynamically remove these strings without having to hardcode it every week that I run the program?
The obvious but not so successful solution is...
employees = ['leadership(x)', 'drivers(y)', 'trainers(z)']
unwanted = ['leadership(x)', 'drivers(y)', 'trainers(z)']
for i in unwanted:
if i in employees:
employees.remove(i)
This of course fails because the values are hardcoded and the values are bound to change, any help with this would be greatly appreciated!
You could do something like
unwanted_prefixes = ['leadership', 'drivers', 'trainers']
unwanted = [s for s in employees if s.split('(')[0] in unwanted_prefixes]
This will make the list of things to delete contain any string beginning with those 3 prefixes and either containing nothing else or immediately followed by a parenthesis.
A more complicated solution, if that one deletes strings that you want, that follows roughly the same idea, but with a regex:
import re
unwanted_re = re.compile(r'(leadership|drivers|trainers)\(\d+\)')
unwanted = [x for x in employees if unwanted_re.fullmatch(x)]
Trying to understand the docs has been very difficult in relation to trying to understand how to do a simple scan in AWS DynamoDB.
Can someone please explain to me in simple terms how to do a basic scan?
What is a Scan?
The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index.
Explanation
A scan operation in it's simplest form looks through everything in your table. Most of the time, you probably don't need the whole table to be returned or even looked at. As a result, many often decide to use filters to cut down on the stuff to look through, process and return.
How do I Scan?
Here is a simple scan operation in python. Even if you aren't using python, this guide will be very helpful.
# Table = 'grades'
# Year_levels = {0-12}
# Sort_key = overall_rank
# Attribute_categories = math, english, science | out of 100
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table("grades")
result = table.scan(
FilterExpression ='math > :math AND english > :eng',
ExpressionAttributeValues = {':math': 80,':eng': 70},
Select='SPECIFIC_ATTRIBUTES',
ProjectionExpression='year_level,overall_rank,math,english',
Limit = 50 #This is the amount of items to SCAN, not necessarily RETURN.
)
# return or print result
Explanation
FilterExpression and ExpressionAttributeValues. There are multiple ways of understanding how these work. One way to understand it is by seeing it as an item attribute value checker. In other words, every item that the scan goes through, the filter's applied upon it's attributes must be true for the item to be returned. e.g (a math score of 80%+ and an english score of 70%+)
Select and Projection Expression. In technical terms, the way in which I explain this is incorrect, however, in practical terms this way of understanding holds up: You can see that there is a SECOND filter, not for the item, but the ATTRIBUTES of the item that will be returned. e.g (I only want the
year_level, overall_rank, math, english to be returned, but no science)
Now if we combine the two we have an example: If an item is checked and matches the criteria placed upon it by the FilterExpression, it will be returned. HOWEVER, we only want SPECIFIC_ATTRIBUTES to be returned. At this point, the item will then be checked AGAIN against, this time, the Select criteria. The select criteria tells you what attributes FROM the item to return.
Limit is just the amount of items to check through, but not necessarily return.
References
References
I've searched around and most of the errors I see are when people are trying to iterate over a list and modify it at the same time. In my case, I am trying to take one list, and remove items from that list that are present in a second list.
import pymysql
schemaOnly = ["table1", "table2", "table6", "table9"]
db = pymysql.connect(my connection stuff)
tables = db.cursor()
tables.execute("SHOW TABLES")
tablesTuple = tables.fetchall()
tablesList = []
# I do this because there is no way to remove items from a tuple
# which is what I get back from tables.fetchall
for item in tablesTuple:
tablesList.append(item)
for schemaTable in schemaOnly:
tablesList.remove(schemaTable)
When I put various print statements in the code, everything looks like proper and like it is going to work. But when it gets to the actual tablesList.remove(schemaTable) I get the dreaded ValueError: list.remove(x): x not in list.
If there is a better way to do this I am open to ideas. It just seemed logical to me to iterate through the list and remove items.
Thanks in advance!
** Edit **
Everyone in the comments and the first answer is correct. The reason this is failing is because the conversion from a Tuple to a list is creating a very badly formatted list. Hence there is nothing that matches when trying to remove items in the next loop. The solution to this issue was to take the first item from each Tuple and put those into a list like so: tablesList = [x[0] for x in tablesTuple] . Once I did this the second loop worked and the table names were correctly removed.
Thanks for pointing me in the right direction!
I assume that fetchall returns tuples, one for each database row matched.
Now the problem is that the elements in tablesList are tuples, whereas schemaTable contains strings. Python does not consider these to be equal.
Thus when you attempt to call remove on tablesList with a string from schemaTable, Python cannot find any such value.
You need to inspect the values in tablesList and find a way convert them to a strings. I suspect it would be by simply taking the first element out of the tuple, but I do not have a mySQL database at hand so I cannot test that.
Regarding your question, if there is a better way to do this: Yes.
Instead of adding items to the list, and then removing them, you can append only the items that you want. For example:
for item in tablesTuple:
if item not in schemaOnly:
tablesList.append(item)
Also, schemaOnly can be written as a set, to improve search complexity from O(n) to O(1):
schemaOnly = {"table1", "table2", "table6", "table9"}
This will only be meaningful with big lists, but in my experience it's useful semantically.
And finally, you can write the whole thing in one list comprehension:
tablesList = [item for item in tablesTuple if item not in schemaOnly]
And if you don't need to keep repetitions (or if there aren't any in the first place), you can also do this:
tablesSet = set(tablesTuple) - schemaOnly
Which is also has the best big-O complexity of all these variations.
I would like to loop trough a database, find the appropriate values and insert them in the appropriate cell in a separate file. It maybe a csv, or any other human-readable format.
In pseudo-code:
for item in huge_db:
for list_of_objects_to_match:
if itemmatch():
if there_arent_three_matches_yet_in_list():
matches++
result=performoperationonitem()
write_in_file(result, row=object_to_match_id, col=matches)
if matches is 3:
remove_this_object_from_object_to_match_list()
can you think of any way other than going every time through all the outputfile line by line?
I don't even know what to search for...
even better, there are better ways to find three matching objects in a db and have the results in real-time? (the operation will take a while, but I'd like to see the results popping out RT)
Assuming itemmatch() is a reasonably simple function, this will do what I think you want better than your pseudocode:
for match_obj in list_of_objects_to_match:
db_objects = query_db_for_matches(match_obj)
if len(db_objects) >= 3:
result=performoperationonitem()
write_in_file(result, row=match_obj.id, col=matches)
else:
write_blank_line(row=match_obj.id) # if you want
Then the trick becomes writing the query_db_for_matches() function. Without detail, I'll assume you're looking for objects that match in one particular field, call it type. In pymongo such a query would look like:
def query_db_for_matches(match_obj):
return pymongo_collection.find({"type":match_obj.type})
To get this to run efficiently, make sure your database has an index on the field(s) you're querying on by first calling:
pymongo_collection.ensure_index({"type":1})
The first time you call ensure_index it could take a long time for a huge collection. But each time after that it will be fast -- fast enough that you could even put it into query_db_for_matches before your find and it would be fine.
I'm trying to add some records into a dictionary.
Initially I was doing it this way
licenses = [dict(licenseid=row[0], client=row[1], macaddress=row[2], void=row[18]) for row in db]
But I've since realized I need to do some processing to filter records from db, so I tried changing the code to:
for rec in db:
if rec['deleted'] == False:
licenses.update(dict(licenseid=row[0], client=row[1], macaddress=row[2], void=row[18])
That code runs without exceptions, but I only end up with the last db record in licenses, which is confusing me.
I think licenses is a list:
licenses = []
...
and you should append to it new dictionaries:
licenses.append(dict(...))
If I understand correctly, you want to add multiple records in a single dictionary, right ? Instead of making a list of dictionaries, why wouldn't you make a dictionary of lists instead?
Start by building a list of the keys you'll need (so that you always access them in the same order).
keys = ["licenses", "client", "macaddress", "void"]
Construct an empty dictionary:
licences = dict((k,[]) for k in keys]
Recursively add entries to your dictionary:
for (k,item) in row:
dict[k].append(item)
Of course, it might be easier to build a list of all your records first, and then construct a dictionary at the very end.
Quoth the dict.update() documentation:
update([other]) Update the dictionary with the key/value pairs from
other, overwriting existing keys. Return None.
Which explains why the last update "wins". licences cannot be a list as there is no update method for lists.
If the code in your post is your genuine code, then you might consider replacing row with rec in the last line (the one with the update), because there are chances you're updating your dictionary with always the same values !
Edit: There's obviously something very wrong in this code, from the other answer I see that I overlooked the fact that licenses was declared as a list: so the only explanation for not having an exception is either the snippets you show are not the genuine one or all your record are so that rec['deleted'] is True (so that the update method is never called).
After responses, I've amended my code:
licenses = []
for row in db:
if row.deleted == False:
licenses.append(dict(licenseid=row[0], client=row[1], macaddress=row[2], void=row[18]))
Which now works perfectly. Thanks for spotting my stupidity! ;)