Filter twitter files by location - python

I am trying to find lat/long information for numerous tweets. One path to a tweet lat/long data in a json tweet is,
{u'location: {u'geo': {u'coordinates : [120.0,-5.0]}}}
I want to be able to check each tweet if this location path exists. If it does then I want to use that information in a function later on. If it does not then I want to check another location path and finally move on to the next tweet.
Here is the code I have currently to check if this path exists and if there is corresponding data. 'data' is a list of twitter files that I opened using the data.append(json.loads(line)) method.
counter = 0
for line in data:
if u'coordinates' in data[counter][u'location'][u'geo']:
print counter, "HAS FIELD"
counter += 1
else:
counter += 1
print counter, 'no location data'
I get a KeyError error with this code. If I just do the below code it works, but is not specific enough to actually get me to the information that I need.
counter = 0
for line in data:
if u'location' in data[counter]:
print counter, "HAS FIELD"
counter += 1
else:
counter += 1
print counter, 'no location data'
Does anyone have a way to do this.
Below is some more background on what I am doing overall, but the above sums up where I am stuck.
Background: I have access to 12 billion tweets, purchased through gnip, that are divided up into multiple files. I am trying to comb through those tweets one-by-one and find which ones have location (lat/long) data and then see if the corresponding coordinates fall in a certain country. If that tweet does fall in that country I will add it to a new database which is a subset of the my larger database.
I have successfully created a function to test if a lat/long falls in the bounding box of my target country, but I am having difficulty populating the lat/long for each tweet for 2 reasons. 1) There are multiple places that long/lat data are stored in each json file, if it exists at all. 2) The tweets are organized in a complex dictionary of dictionaries which I have difficulty maneuvering through.
I need to be able to loop through each tweet and see if a specific lat/long combination exists for the different location paths so that I can pull it and feed it into my function that tests if that tweet originated in my country of interest.

I get a KeyError error with this code
Assume keys should be in double quotes because they have ':
counter = 0
for line in data:
if "u'coordinates" in data[counter]["u'location"]["u'geo"]:
print counter, "HAS FIELD"
counter += 1
else:
counter += 1
print counter, 'no location data'

The solution that I found may not be the most efficient, but is functional. It uses if statements nested in try-except statements. This allows me to check different location paths but push through KeyError s so that I can move on to other tweets and pathways. Below is my code. It goes through multiple tweets and checks for ones that have an available Lat/Long combos in any of 3 pathways. It works with my addTOdb function that checks to see if that Lat/Long combo is in my target country. It also builds a separate dictionary called Lat Long where I can view all the tweets that had Lat/Long combos and what path I pulled them through.
#use try/except function to see if entry is in json files
#initialize counter that numbers each json entry
counter = 0
#This is a test dict to see what lat long was selected
Lat_Long = {}
for line in data:
TweetLat = 0
TweetLong = 0
#variable that will indicate what path was used for coordinate lat/long
CoordSource = 0
#Sets while variable to False. Will change if coords are found.
GotCoord = False
while GotCoord == False:
#check 1st path using geo function
try:
if u'coordinates' in data[counter][u'geo'] and GotCoord == False:
TweetLat = data[counter][u'geo'][u'coordinates'][0]
TweetLong = data[counter][u'geo'][u'coordinates'][1]
#print 'TweetLat',TweetLat
print counter, "HAS FIELD"
addTOdb(TweetLat,TweetLong,North,South,East,West)
CoordSource = 1
GotCoord = True
except KeyError:
pass
#check 2nd path using gnip info
try:
if u'coordinates' in data[counter][u'gnip'][u'profileLocations'][0][u'geo'] and GotCoord == False:
TweetLat = data[counter][u'gnip'][u'profileLocations'][0][u'geo'][u'coordinates'][1]
TweetLong = data[counter][u'gnip'][u'profileLocations'][0][u'geo'][u'coordinates'][0]
print counter, "HAS FIELD"
addTOdb(TweetLat,TweetLong,North,South,East,West)
CoordSource = 2
GotCoord = True
except KeyError:
pass
#check 3rd path using location polygon info
try:
if u'coordinates' in data[counter][u'location'][u'geo'] and GotCoord == False:
TweetLat = data[counter][u'location'][u'geo'][u'coordinates'][0][0][1]
TweetLong = data[counter][u'location'][u'geo'][u'coordinates'][0][0][0]
print counter, "HAS FIELD"
addTOdb(TweetLat,TweetLong,North,South,East,West)
CoordSource = 3
GotCoord = True
except KeyError:
pass
if GotCoord==True:
Lat_Long[counter] = [CoordSource,TweetLat, TweetLong]
else:
print counter, "no field"
GotCoord = True
counter += 1

Related

How to update of Azure Table Storage Entity with ETag

Overview: When I upload blob in blob storage under container/productID(folder)/blobName, then event-subscription saves this event in the storage queue. After that azure function polls this event and does the following:
1- read from the corresponding table the current count property (how
many blobs are stored under productID(folder)) with the etag
2- increase the count + 1
3- write it back in the corresponding table, if ETag is matched, then the field count will be increased, otherwise throws an error. if err is thrown, sleep while and go to step 1 (while loop)
4- if property field successful increased, then return
scenario: trying to upload five items to blob storage
Expectation: the count property in table storage stores 5.
problem: after inserting the first four items successful, the code get in an infinite loop for inserting the fifth item, and the count property increased forever. why that could happen, I don't have any ideas. any ideas from you will be good
#more code
header_etag = "random-etag"
response_etag = "random-response"
while(response_etag != header_etag):
sleep(random.random()) # sleep between 0 and 1 second.
header = table_service.get_entity_table(
client_table, client_table, client_product)
new_count = header['Count'] + 1
entity_product = create_product_entity(
client_table, client_product, new_count, client_image_table)
header_etag = header['etag']
try:
response_etag = table_service1.merge_entity(client_table, entity_product,
if_match=header_etag)
except:
logging.info("race condition detected")
Try implementing your parameters and logic in while loop with below code:
val = 0 # Initial value as zero
success = False
while True:
val = val + 1 #incrementing
CheckValidations = input(‘Check’) #add validations
if CheckValidations == "abc123":
success = True # this will allow the loop
break
if val > 5:
break
print("Please Try Again")
if success == True:
print("Welcome!")
Also check for the below .py file from Azure Python SDK documentation where we have a clear function for update, merge and insert for entities in table storage
azure-sdk-for-python/sample_update_upsert_merge_entities.py at main · Azure/azure-sdk-for-python (github.com)
Refer to Microsoft documentation to check how can we pass exact parameters for create entity.
For more insight on table samples check Azure samples
storage-table-python-getting-started/table_basic_samples.py at master · Azure-Samples/storage-table-python-getting-started (github.com)

how to code the frequency of something that has been deleted by the user in a dictionary?

I am running with some issue. I will like to view the number of times a user has delete a value from a key even if the user exits the program, it will still retain that number. So that in future if the user will to delete any values again, it will use the existing number and add on from there.
edited: Just to add. All the dict will be store on a .txt file
dict= {} #start off with an empty list
key_search = ("Enter to find a key")
if options_choose == 2:
c = input('Which value would you like to change? ')
c = change.lower()
if change in list_of_value:
loc = list_of_value.index(c)
list_of_value.remove(c)
correction = input("Enter correction: ")
correction = correction.lower()
print(f"value(s) found relating to the key '{key_search}' are:")
list_of_value.insert(loc, correction)
list_of_value = dict[key_search]
for key, value in enumerate(list_of_value, 1):
print(f"{key}.) {value}")
else:
print('Entry invalid')
As you can see in the below screenshot I have exited and re-entered the program with the same counter.
You can adapt this to fit the features of your program. Since it seems like the code you provided is incomplete, I have substituted it with an example of dictionary modification to show how you can store and read a value after program termination.
You have to have a folder with test.py and modification.value.txt inside. Write "0" inside the text file.
ex_dict= {'example_entry':1}
#Read in the value stored into the text file.
with open('modification_value.txt','r') as file:
counter = int(file.readline()) #counter value imported from file.
print('Counter: ', counter)
modify_dict = True #boolean to check when user wants to stop making changes
while modify_dict == True:
for key in ex_dict:
dict_key = key
new_value = input('Enter value for new key.\n')
ex_dict[dict_key] = new_value
counter+=1
print('New dictionary: ', ex_dict, '\n')
response = input("Do you still want to modify Y/N?\n")
if (response =='Y'):
continue
elif(response =='N'):
modify_dict=False
#Write the value of the counter recorded by the program to the text file so the program can access it when it is run again after termination.
with open('modification_value.txt','w+') as file:
file.write(str(counter))

Compiling a dictionary by pulling data from other dictionaries

I am doing a project in which I extract data from three different data sets and combine it to look at campaign contributions. To do this I turned the relevant data from two of the sets into dictionaries (canDict and otherDict) with ID numbers as keys and the information I need (party affiliation) as values. Then I wrote a program to pull party information based on the key (my third set included these ID numbers as well) and match them with the employer of the donating party, and the amount donated. That was a long winded explanation, but I thought it would help with understanding this chunk of code.
My problem is that, for some reason, my third dictionary (employerDict) won't compile. By the end of this step I should have a dictionary containing employers as keys, and a list of tuples as values, but after running it, the dictionary remains blank. I've been over this line by line a dozen times and I'm pulling my hair out - I can't for the life of me think why it won't work, which is making it hard to search for answers. I've commented almost every line to try to make it easier to understand out of context. Can anyone spot my mistake?
Update: I added a counter, n, to the outermost for loop to see if the program was iterating at all.
Update 2: I added another if statement in the creation of the variable party, in case the ID at data[0] did not exist in canDict or in otherDict. I also added some already suggested fixes from the comments.
n=0
with open(path3) as f: # path3 is a txt file
for line in f:
n+=1
if n % 10000 == 0:
print(n)
data = line.split("|") # Splitting each line into its entries (delimited by the symbol |)
party = canDict.get(data[0]) # data[0] is an ID number. canDict and otherDict contain these IDs as keys with party affiliations as values
if party is None:
party = otherDict[data[0]] # If there is no matching ID number in canDict, search otherDict
if party is None:
party = 'Other'
else:
print('ERROR: party is None')
x = (party, int(data[14])) # Creating a tuple of the the party (found through the loop) and an integer amount from the file path3
employer = data[11] # Index 11 in path3 is the employer of the person
if employer != '':
value = employerDict.get(employer) # If the employer field is not blank, see if this employer is already a key in employerDict
if value is None:
employerDict[employer] = [x] # If the key does not exist, create it and add a list including the tuple x as its value
else:
employerDict[employer].append(x) # If it does exist, add the tuple x to the existing value
else:
print('ERROR: employer == ''')
Thanks for all the input everyone - however, it looks like its a problem with my data file, not a problem with the program. Dangit.

Python - Perform file check based on format of 3 values then perform tasks

All,
I am trying to write a python script that will go through a crime file and separate the file based on the following items: UPDATES, INCIDENTS, and ARRESTS. The reports that I generally receive either show these sections as I have previously listed or by **UPDATES**, **INCIDENTS**, or **ARRESTS**. I have already started to write the following script to separate the files based on the following format with the **. However, I was wondering if there was a better way to check the files for both formats at the same time? Also, sometimes there is not an UPDATES or ARRESTS section which causes my code to break. I was wondering if there is a check I can do for this instance, and if this is the case, how can I still get the INCIDENTS section without the other two?
with open('CrimeReport20150518.txt', 'r') as f:
content = f.read()
print content.index('**UPDATES**')
print content.index('**INCIDENTS**')
print content.index('**ARRESTS**')
updatesLine = content.index('**UPDATES**')
incidentsLine = content.index('**INCIDENTS**')
arrestsLine = content.index('**ARRESTS**')
#print content[updatesLine:incidentsLine]
updates = content[updatesLine:incidentsLine]
#print updates
incidents = content[incidentsLine:arrestsLine]
#print incidents
arrests = content[arrestsLine:]
print arrests
You are currently using .index() to locate the headings in the text. The documentation states:
Like find(), but raise ValueError when the substring is not found.
That means that you need to catch the exception in order to handle it. For example:
try:
updatesLine = content.index('**UPDATES**')
print "Found updates heading at", updatesLine
except ValueError:
print "Note: no updates"
updatesLine = -1
From here you can determine the correct indexes for slicing the string based on which sections are present.
Alternatively, you could use the .find() method referenced in the documentation for .index().
Return -1 if sub is not found.
Using find you can just test the value it returned.
updatesLine = content.find('**UPDATES**')
# the following is straightforward, but unwieldy
if updatesLine != -1:
if incidentsLine != -1:
updates = content[updatesLine:incidentsLine]
elif arrestsLine != -1:
updates = content[updatesLine:arrestsLine]
else:
updates = content[updatesLine:]
Either way, you'll have to deal with all combinations of which sections are and are not present to determine the correct slice boundaries.
I would prefer to approach this using a state machine. Read the file line by line and add the line to the appropriate list. When a header is found then update the state. Here is an untested demonstration of the principle:
data = {
'updates': [],
'incidents': [],
'arrests': [],
}
state = None
with open('CrimeReport20150518.txt', 'r') as f:
for line in f:
if line == '**UPDATES**':
state = 'updates'
elif line == '**INCIDENTS**':
state = 'incidents'
elif line == '**ARRESTS**':
state = 'arrests'
else:
if state is None:
print "Warn: no header seen; skipping line"
else
data[state].append(line)
print data['arrests'].join('')
Try using content.find() instead of content.index(). Instead of breaking when the string isn't there, it returns -1. Then you can do something like this:
updatesLine = content.find('**UPDATES**')
incidentsLine = content.find('**INCIDENTS**')
arrestsLine = content.find('**ARRESTS**')
if incidentsLine != -1 and arrestsLine != -1:
# Do what you normally do
updatesLine = content.index('**UPDATES**')
incidentsLine = content.index('**INCIDENTS**')
arrestsLine = content.index('**ARRESTS**')
updates = content[updatesLine:incidentsLine]
incidents = content[incidentsLine:arrestsLine]
arrests = content[arrestsLine:]
elif incidentsLine != -1:
# Do whatever you need to do to files that don't have an arrests section here
elif arreststsLine != -1:
# Handle files that don't have an incidents section here
else:
# Handle files that are missing both
Probably you'll need to handle all four possible combinations slightly differently.
Your solution generally looks OK to me as long as the sections always come in the same order and the files don't get too big. You can get real feedback at stack exchange's code review https://codereview.stackexchange.com/

Incorrect value from dynamodb table description and scan count

I'm having a problem with dynamodb. I'm attempting to verify the data contained within,
but scan seems to be only returning a subset of the data, here is the code I'm using with the python boto bindings
#!/usr/bin/python
#Check the scanned length of a table against the Table Description
import boto.dynamodb
#Connect
TABLENAME = "MyTableName"
sdbconn = boto.dynamodb.connect_to_region(
"eu-west-1",
aws_access_key_id='-snipped-',
aws_secret_access_key='-snipped-')
#Initial Scan
results = sdbconn.layer1.scan(TABLENAME,count=True)
previouskey = results['LastEvaluatedKey']
#Create Counting Variable
count = results['Count']
#DynamoDB scan results are limited to 1MB but return a Key value to carry on for the next MB
#so loop untill it does not return a continuation point
while previouskey != False:
results = sdbconn.layer1.scan(TABLENAME,exclusive_start_key=previouskey,count=True)
print(count)
count = count + results['Count']
try:
#get next key
previouskey = results['LastEvaluatedKey']
except:
#no key returned so thats all folks!
print(previouskey)
print("Reached End")
previouskey = False
#these presumably should match, they dont on the MyTableName Table, not even close
print(sdbconn.describe_table(TABLENAME)['Table']['ItemCount'])
print(count)
print(sdbconn.describe_table) gives me 1748175 and
print(count) gives me 583021.
I was the under the impression that these should always match? (I'm aware of the 6 hour update) only 300 rows have been added in the last 24 hours though
does anyone know if this is an issue with dynamodb? or does my code have a wrong assumption?
figured it out finally, its to do with Local Secondary Indexes, they show up in the table description as unique items, the table has two LSI's causing it to show 3x the number of items actually present

Categories