Incorrect value from dynamodb table description and scan count - python

I'm having a problem with dynamodb. I'm attempting to verify the data contained within,
but scan seems to be only returning a subset of the data, here is the code I'm using with the python boto bindings
#!/usr/bin/python
#Check the scanned length of a table against the Table Description
import boto.dynamodb
#Connect
TABLENAME = "MyTableName"
sdbconn = boto.dynamodb.connect_to_region(
"eu-west-1",
aws_access_key_id='-snipped-',
aws_secret_access_key='-snipped-')
#Initial Scan
results = sdbconn.layer1.scan(TABLENAME,count=True)
previouskey = results['LastEvaluatedKey']
#Create Counting Variable
count = results['Count']
#DynamoDB scan results are limited to 1MB but return a Key value to carry on for the next MB
#so loop untill it does not return a continuation point
while previouskey != False:
results = sdbconn.layer1.scan(TABLENAME,exclusive_start_key=previouskey,count=True)
print(count)
count = count + results['Count']
try:
#get next key
previouskey = results['LastEvaluatedKey']
except:
#no key returned so thats all folks!
print(previouskey)
print("Reached End")
previouskey = False
#these presumably should match, they dont on the MyTableName Table, not even close
print(sdbconn.describe_table(TABLENAME)['Table']['ItemCount'])
print(count)
print(sdbconn.describe_table) gives me 1748175 and
print(count) gives me 583021.
I was the under the impression that these should always match? (I'm aware of the 6 hour update) only 300 rows have been added in the last 24 hours though
does anyone know if this is an issue with dynamodb? or does my code have a wrong assumption?

figured it out finally, its to do with Local Secondary Indexes, they show up in the table description as unique items, the table has two LSI's causing it to show 3x the number of items actually present

Related

What is the fastest way to generate 1 billion custom unique code using python and mariadb?

This is the code :
import mysql.connector as mariadb
import time
import random
mariadb_connection = mariadb.connect(user='root', password='xxx', database='UniqueCode',
port='3306', host='192.168.xx.xx')
cursor = mariadb_connection.cursor()
FullChar = 'CFLMNPRTVWXYK123456789' # i just need that char
total = 5000
count = 10
limmit = 0
count = int(count)
entries = []
uq_id = 0
total_all = 0
def inputDatabase(data):
try:
maria_insert_query = "INSERT INTO SN_UNIQUE_CODE(unique_code) VALUES (%s)"
cursor.executemany(maria_insert_query, data)
mariadb_connection.commit()
print("Commiting " + str(total) + " entries..")
except Exception:
maria_alter_query = "ALTER TABLE UniqueCode.SN_UNIQUE_CODE AUTO_INCREMENT=0"
cursor.execute(maria_alter_query)
print("UniqueCode Increment Altered")
while (0 < 1) :
for i in range(total):
unique_code = ''.join(random.sample(FullChar, count))
entry = (unique_code)
entries.append(entry)
inputDatabase(entries)
#print(entries)
entries.clear()
time.sleep(0.1)
Output:
Id unique_code
1 N5LXK2V7CT
2 7C4W3Y8219
3 XR9M6V31K7
The code above runs well, the time it takes to generate it is also fast, the problem I faced was when the unique_code stored in the tuple was to be entered into mariadb, to avoid data redundancy, i added a unique index in the unique_code column.
The more data that is entered, the more checking of the unique code that will be entered, which makes the process of entering data into the database longer.
From that problem, how can I generate 1 billion data to the database in a short time?
note: the process will slow down when the unique_code that enters the database is > 150 million unique_codes
Thank's a lot
The quick way
If you want to insert many records into the database, you can bulk-insert them as you do now.
I would recommend you disable the keys on the table before inserting and skip the check for unique. Else you will have a bad time like #CryptoFool mentioned.
ALTER TABLE SN_UNIQUE_CODE DISABLE KEYS;
<run code>
ALTER TABLE SN_UNIQUE_CODE ENABLE KEYS;
If I were you, then I would try to play around with the maximum you can insert at once. Try changing max_allowed_packet variable in MariaDB if necessary.
The table
It seems like your unique_code could be a natural key. Therefore you could remove the auto_incremented variable, it won't bring much performance but it is a start.

How to update of Azure Table Storage Entity with ETag

Overview: When I upload blob in blob storage under container/productID(folder)/blobName, then event-subscription saves this event in the storage queue. After that azure function polls this event and does the following:
1- read from the corresponding table the current count property (how
many blobs are stored under productID(folder)) with the etag
2- increase the count + 1
3- write it back in the corresponding table, if ETag is matched, then the field count will be increased, otherwise throws an error. if err is thrown, sleep while and go to step 1 (while loop)
4- if property field successful increased, then return
scenario: trying to upload five items to blob storage
Expectation: the count property in table storage stores 5.
problem: after inserting the first four items successful, the code get in an infinite loop for inserting the fifth item, and the count property increased forever. why that could happen, I don't have any ideas. any ideas from you will be good
#more code
header_etag = "random-etag"
response_etag = "random-response"
while(response_etag != header_etag):
sleep(random.random()) # sleep between 0 and 1 second.
header = table_service.get_entity_table(
client_table, client_table, client_product)
new_count = header['Count'] + 1
entity_product = create_product_entity(
client_table, client_product, new_count, client_image_table)
header_etag = header['etag']
try:
response_etag = table_service1.merge_entity(client_table, entity_product,
if_match=header_etag)
except:
logging.info("race condition detected")
Try implementing your parameters and logic in while loop with below code:
val = 0 # Initial value as zero
success = False
while True:
val = val + 1 #incrementing
CheckValidations = input(‘Check’) #add validations
if CheckValidations == "abc123":
success = True # this will allow the loop
break
if val > 5:
break
print("Please Try Again")
if success == True:
print("Welcome!")
Also check for the below .py file from Azure Python SDK documentation where we have a clear function for update, merge and insert for entities in table storage
azure-sdk-for-python/sample_update_upsert_merge_entities.py at main · Azure/azure-sdk-for-python (github.com)
Refer to Microsoft documentation to check how can we pass exact parameters for create entity.
For more insight on table samples check Azure samples
storage-table-python-getting-started/table_basic_samples.py at master · Azure-Samples/storage-table-python-getting-started (github.com)

Execute an action based on how many times an element appears in a list

I have a list :
CASE 1 : group_member = ['MEU1', 'MEU1','MEU2', 'MEU1','MEU1','MEU2','MEU1','MEU2','MEU1','MEU3']
CASE 2 : group_member = ['MEU1','MEU2','MEU3','None','None']
CASE 3 : group_member = ['MEU1','MEU2','MEU3','MEU1','CEU1']
What I'm trying to do is insert a value in a table in sql if 70% of the list has the same value or send mail to some users if the values is below 70%.
For the list I have above it will be the first case because EU1 value is bigger than 70%.
I tried something like this :
from collections import Counter
freqDict = Counter(group_member)
size = len(group_member)
if len(group_member) > 0 :
for (key,val) in freqDict.items():
if (val >= (7*size/10)):
print(">= than70%")
insert_into_table(group)
elif (val <(7*size/10)) :
print("under 70%")
send_mail_notification(group)
The problem with this is that it will check for each combinations of key and value and that would mean even if one value is >= 70% it will still enter the elif and send mail multiple times for the same group which is unacceptable but i didn't found a solution for this yet.
How can I avoid this cases? For the list above it should only insert the value in the table and move on to the next list, for the second list it should only send a mail notification only one time because there is no element >=70%.
I need to implement the following cases:
If >=70% is the same value (ex MEU1 in CASE1) then insert into a table.
IF >=70% is in the same unit (M) but not the same tribe, so in CASE 3 because 4 of 5 elements have M they belong to the same Unit --> send notification
I believe you should check if there's at least one item with a value greater or equal than 70, then send mail if there's no such value. This means you should check if you should send a mail after you go through the list.
from collections import Counter
freqDict = Counter(group_member)
size = len(group_member)
foundBigVal = False
if len(group_member) > 0 :
for (key,val) in freqDict.items():
if (val >= (7*size/10)):
print(">= than70%")
insert_into_table(group)
foundBigVal = True
break #no need to check the list further since only one can have %70 percent
if foundBigVal:
#if there's a value greater than %70 in the list, we would enter this part
print("under 70%")
send_mail_notification(group)
I put the if outside the loop in order to call send_mail_notification once but check each element inside the list.
You could try something like this
a = ['EU1', 'EU1','EU2', 'EU1','EU1','EU2','EU1','EU2','EU1','EU3']
max_percentage = Counter(a).most_common(1)[0][1]/len(a)*100
most_common(1)[0] returns a tuple in the form of (most_common_element, its_count). We just extract the count and divide it by the total length of the list to find the percentage.
Although it seems that the percentage of EU1 in the above list is 60%, not 70, as you mentioned.

Filter twitter files by location

I am trying to find lat/long information for numerous tweets. One path to a tweet lat/long data in a json tweet is,
{u'location: {u'geo': {u'coordinates : [120.0,-5.0]}}}
I want to be able to check each tweet if this location path exists. If it does then I want to use that information in a function later on. If it does not then I want to check another location path and finally move on to the next tweet.
Here is the code I have currently to check if this path exists and if there is corresponding data. 'data' is a list of twitter files that I opened using the data.append(json.loads(line)) method.
counter = 0
for line in data:
if u'coordinates' in data[counter][u'location'][u'geo']:
print counter, "HAS FIELD"
counter += 1
else:
counter += 1
print counter, 'no location data'
I get a KeyError error with this code. If I just do the below code it works, but is not specific enough to actually get me to the information that I need.
counter = 0
for line in data:
if u'location' in data[counter]:
print counter, "HAS FIELD"
counter += 1
else:
counter += 1
print counter, 'no location data'
Does anyone have a way to do this.
Below is some more background on what I am doing overall, but the above sums up where I am stuck.
Background: I have access to 12 billion tweets, purchased through gnip, that are divided up into multiple files. I am trying to comb through those tweets one-by-one and find which ones have location (lat/long) data and then see if the corresponding coordinates fall in a certain country. If that tweet does fall in that country I will add it to a new database which is a subset of the my larger database.
I have successfully created a function to test if a lat/long falls in the bounding box of my target country, but I am having difficulty populating the lat/long for each tweet for 2 reasons. 1) There are multiple places that long/lat data are stored in each json file, if it exists at all. 2) The tweets are organized in a complex dictionary of dictionaries which I have difficulty maneuvering through.
I need to be able to loop through each tweet and see if a specific lat/long combination exists for the different location paths so that I can pull it and feed it into my function that tests if that tweet originated in my country of interest.
I get a KeyError error with this code
Assume keys should be in double quotes because they have ':
counter = 0
for line in data:
if "u'coordinates" in data[counter]["u'location"]["u'geo"]:
print counter, "HAS FIELD"
counter += 1
else:
counter += 1
print counter, 'no location data'
The solution that I found may not be the most efficient, but is functional. It uses if statements nested in try-except statements. This allows me to check different location paths but push through KeyError s so that I can move on to other tweets and pathways. Below is my code. It goes through multiple tweets and checks for ones that have an available Lat/Long combos in any of 3 pathways. It works with my addTOdb function that checks to see if that Lat/Long combo is in my target country. It also builds a separate dictionary called Lat Long where I can view all the tweets that had Lat/Long combos and what path I pulled them through.
#use try/except function to see if entry is in json files
#initialize counter that numbers each json entry
counter = 0
#This is a test dict to see what lat long was selected
Lat_Long = {}
for line in data:
TweetLat = 0
TweetLong = 0
#variable that will indicate what path was used for coordinate lat/long
CoordSource = 0
#Sets while variable to False. Will change if coords are found.
GotCoord = False
while GotCoord == False:
#check 1st path using geo function
try:
if u'coordinates' in data[counter][u'geo'] and GotCoord == False:
TweetLat = data[counter][u'geo'][u'coordinates'][0]
TweetLong = data[counter][u'geo'][u'coordinates'][1]
#print 'TweetLat',TweetLat
print counter, "HAS FIELD"
addTOdb(TweetLat,TweetLong,North,South,East,West)
CoordSource = 1
GotCoord = True
except KeyError:
pass
#check 2nd path using gnip info
try:
if u'coordinates' in data[counter][u'gnip'][u'profileLocations'][0][u'geo'] and GotCoord == False:
TweetLat = data[counter][u'gnip'][u'profileLocations'][0][u'geo'][u'coordinates'][1]
TweetLong = data[counter][u'gnip'][u'profileLocations'][0][u'geo'][u'coordinates'][0]
print counter, "HAS FIELD"
addTOdb(TweetLat,TweetLong,North,South,East,West)
CoordSource = 2
GotCoord = True
except KeyError:
pass
#check 3rd path using location polygon info
try:
if u'coordinates' in data[counter][u'location'][u'geo'] and GotCoord == False:
TweetLat = data[counter][u'location'][u'geo'][u'coordinates'][0][0][1]
TweetLong = data[counter][u'location'][u'geo'][u'coordinates'][0][0][0]
print counter, "HAS FIELD"
addTOdb(TweetLat,TweetLong,North,South,East,West)
CoordSource = 3
GotCoord = True
except KeyError:
pass
if GotCoord==True:
Lat_Long[counter] = [CoordSource,TweetLat, TweetLong]
else:
print counter, "no field"
GotCoord = True
counter += 1

Tracking how many elements processed in generator

I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs = read_records(fd)
recs_p = (process_records(r) for r in recs)
write_records(recs_p)
My process_records function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records. I have been keeping track of number of records in process_records function using a list:
def process_records(r):
if r.sender('sender_of_interest'):
records_list.append(1)
else:
records_list.append(0)
...
The problem with this approach is that records_list could grow without bounds depending upon the input. I want to be able to consume the content of records_list once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest' and how many are from other sources and empty the list. Can I do this without using a lock?
You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:
class RecordProcessor(object):
def __init__(self, recs):
self.recs = recs
self.processed_rec_count = 0
def __call__(self):
for r in self.recs:
if r.sender('sender_of_interest'):
self.processed_rec_count += 1
# process record r...
yield r # processed record
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs_p = RecordProcessor(read_records(fd))
write_records(recs_p)
print 'records processed:', recs_p.processed_rec_count
Here's the straightforward approach. Is there some reason why something this simple won't work for you?
seen=0
matched=0
def process_records(r):
seen = seen + 1
if r.sender('sender_of_interest'):
matched = match + 1
records_list.append(1)
else:
records_list.append(0)
if seen > 1000 or someOtherTimeBasedCriteria:
print "%d of %d total records had the sender of interest" % (matched, seen)
seen = 0
matched = 0
If you have the ability to close your stream of messages and re-open them, you might want one more total seen variable, so that if you had to close that stream and re-open it later, you could go to the last record you processed and pick up there.
In this code "someOtherTimeBasedCriteria" might be a timestamp. You can get the current time in milliseconds when you begin processing, and then if the current time now is more than 20,000ms more (20 sec) then reset the seen/matched counters.

Categories