Compiling a dictionary by pulling data from other dictionaries

Compiling a dictionary by pulling data from other dictionaries - python

I am doing a project in which I extract data from three different data sets and combine it to look at campaign contributions. To do this I turned the relevant data from two of the sets into dictionaries (canDict and otherDict) with ID numbers as keys and the information I need (party affiliation) as values. Then I wrote a program to pull party information based on the key (my third set included these ID numbers as well) and match them with the employer of the donating party, and the amount donated. That was a long winded explanation, but I thought it would help with understanding this chunk of code.
My problem is that, for some reason, my third dictionary (employerDict) won't compile. By the end of this step I should have a dictionary containing employers as keys, and a list of tuples as values, but after running it, the dictionary remains blank. I've been over this line by line a dozen times and I'm pulling my hair out - I can't for the life of me think why it won't work, which is making it hard to search for answers. I've commented almost every line to try to make it easier to understand out of context. Can anyone spot my mistake?
Update: I added a counter, n, to the outermost for loop to see if the program was iterating at all.
Update 2: I added another if statement in the creation of the variable party, in case the ID at data[0] did not exist in canDict or in otherDict. I also added some already suggested fixes from the comments.
n=0
with open(path3) as f: # path3 is a txt file
for line in f:
n+=1
if n % 10000 == 0:
print(n)
data = line.split("|") # Splitting each line into its entries (delimited by the symbol |)
party = canDict.get(data[0]) # data[0] is an ID number. canDict and otherDict contain these IDs as keys with party affiliations as values
if party is None:
party = otherDict[data[0]] # If there is no matching ID number in canDict, search otherDict
if party is None:
party = 'Other'
else:
print('ERROR: party is None')
x = (party, int(data[14])) # Creating a tuple of the the party (found through the loop) and an integer amount from the file path3
employer = data[11] # Index 11 in path3 is the employer of the person
if employer != '':
value = employerDict.get(employer) # If the employer field is not blank, see if this employer is already a key in employerDict
if value is None:
employerDict[employer] = [x] # If the key does not exist, create it and add a list including the tuple x as its value
else:
employerDict[employer].append(x) # If it does exist, add the tuple x to the existing value
else:
print('ERROR: employer == ''')

Thanks for all the input everyone - however, it looks like its a problem with my data file, not a problem with the program. Dangit.

Related

Extracting multiple data from a single list

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!

A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Returning a value based on matching one lists values to another list based on order

I have four lists:
user = [0,0,1,1]
names = ["jake","ryan","paul","david"]
disliked_index = [0,1]
ranked_names = ["paul","ryan","david","jake"]
List "user" holds a user's response to which names they like (1 if like, 0 if dislike) from list "names". disliked_index holds the list spots that user indicated 0 in the user list. ranked_names holds the names ranked from most popular to least popular based on the data set (multiple students). What I am trying to achieve is to return the most popular name that the user responded they didn't like. So that
mostpopular_unlikedname = "ryan"
So far what I have done is:
placement = []
for i in disliked_index:
a = names[i]
placement.append(a)
Where I now have a list that holds the names the user did not like.
placement = ["jake","ryan"]
Here my logic is to run a loop to check which names in the placement list appear in the ranked_names list and get added to the top_name list in the order from most popular to least.
top_name =[]
for i in range(len(ranked_names)):
if ranked_names[i] == placement:
top_name.append[i]
Nothing ends up being added to the top_name list. I am stuck on this part and wanted to see if this is an alright direction to continue or if I should try something else.
Any guidance would be appreciated, thanks!

You don't really need disliked_index list for this. Just do something along these lines:
dis_pos = []
for name, sentiment in zip(names,user):
if sentiment == 0:
dis_pos.append(ranked_names.index(name))
mostpopular_unlikedname = ranked_names[min(dis_pos)]
print(mostpopular_unlikedname)
Output:
ryan

Python, loop that gets a value then tests that value again

This is my first time asking here. I tried searching for an answer, but wasn't certain how to phrase what I need so I decided to ask.
I am working on a character creator for a tabletop RPG. I want to get the results for the character's previous occupation, which are on a list, then test that value again to get the occupation previous to that.
I already have a way of getting the first occupation, which is then compared with a text database, with entries such as:
Captain ,Explorer,Knight,Sergeant,
Where Captain is the first occupation and the commas mark the beginning and the end of the possible previous occupations. I have managed to get one of those randomly, but I haven't been able to make the loop then take the selected occupation and run it again. For example:
Explorer ,Cartographer,
Here's the simplified version of my code. It gets the first part right, but I'm not sure how to trigger a loop for the next.
import random
def carOld(carrera,nivPoder):
carActual=carrera
u=0
indPoder=int(nivPoder)
carAnterior=[]
commas=[]
entTemp=[]
d=open("listaCarreras.txt","r")
f=(d.readlines())
while indPoder!=0:
indPoder=indPoder-1
for line in f:
if carActual in line:
entTemp=line.split(",")
d.close
del entTemp[0]
del entTemp[-1]
print (entTemp)
carAnterior=random.choice(entTemp)

I think this. I believe based on your description that the current occupation is in the front of the list, and the previous occupations are next in the list.
str_occs = 'Occ1,Occ2,Occ3'
list_occs = str_occs.split(',')
def prev_occ(occupation, list_occs):
prev_occ_index = list_occs.index(occupation) + 1
try:
ret_val = list_occs[prev_occ_index]
except:
ret_val = "No prior occupations."
return ret_val
You can try it out here: https://repl.it/B08A

Best way to deal with giant number of combinations python

I have a bunch of Twitter data (300 million messages from 450k users) and am trying to unravel a social network through #mentions. My end goal is to have a bunch of pairs where the first item is a pair of #mentions and the second item is the number of users who mention both people. For example: [(#sam, #kim), 25]. The order of the #mentions doesn't matter, so (#sam,#kim)=(#kim,#sam).
First I am creating a dictionary where the key is the user id and the value is a set of #mentions
for row in data:
user_id = int(row[1])
msg = str(unicode(row[0], errors='ignore'))
if user_id not in userData:
userData[user_id] = set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
else:
userData[user_id] |= set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
I then loop through the users and create a dictionary where the key is a tuple of #mentions and the values is the number of users who mention both:
for user in userData.keys():
if len(userData[user]) < MENTION_THRESHOLD:
continue
for ht in itertools.combinations(userData[user], 2):
if ht in hashtag_set:
hashtag_set[ht] += 1
else:
hashtag_set[ht] = 1
This second part is taking FOREVER to run. Is there a better way to run this and/or a better way to store this data?

Instead of trying to do all this stuff in-memory as you are now, I would suggest using generators to pipeline your data. Check out this slide deck from PyCon 2008 by David Beazely: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf
In particular, Part 2 has a number of examples of parsing big data that directly apply to what you want to do. By using generators, you can avoid most of the memory consumption you have now, and I would expect you to see significant performance improvements as a result.

Organizing and printing information by a specific row in a csv file

I wrote a code that takes in some data, and I end up with a csv file that looks like the following:
1,Steak,Martins
2,Fish,Martins
2,Steak,Johnsons
4,Veggie,Smiths
3,Chicken,Johnsons
1,Veggie,Johnsons
where the first column is a quantity, the second column is the type of item (in this case the meal), and the third column is an identifier (in this case it is family name). I need to print this information to a text file in a specific way:
Martins
1 Steak
2 Fish
Johnsons
2 Steak
3 Chicken
1 Veggie
Smiths
4 Veggie
So What I want is the family name followed by what that family ordered. I wrote the following code to accomplish this, but it doesn't seem to be quite there.
import csv
orders = "orders.txt"
messy_orders = "needs_sorting.csv"
with open(messy_orders, 'rb') as orders_for_sorting, open(orders, 'a') as final_orders_file:
comp = []
reader_sorting = csv.reader(orders_for_sorting)
for row in reader_sorting:
test_bit = [row[2]]
if test_bit not in comp:
comp.append(test_bit)
final_orders_file.write(row[2])
for row in reader_sorting:
if [row[2]] == test_bit:
final_orders_file.write(row[0], row[1])
else:
print "already here"
continue
What I end up with is the following
Martins
2 Fish
Additionally, I never see it print "already here" though I think I should if it were working properly. What I suspect is happening is that the program goes through the second for loop, then exits the program without continuing the first loop. Unfortunately I'm not sure how to make it go back to the original loop once it has identified and printed all instances of a given family name in a file. I thought The reason I have it set up this way, is so that I can get the family name written as a header. Otherwise I would just sort the file by family name. Please note that after running the orders through my first program, I did manage to sort everything such that each row represents the complete quantity of that type of food for that family (there are no recurring instances of a row containing both Steak and Martins).

This is a problem that you solve with a dictionary; which will accumulate your items by the last name (family name) of your file.
The second thing you have to do is accumulate a total of each type of meal - keeping in mind that the data you are reading is a string, and not an integer that you can add, so you'll have to do some conversion.
To put all that together, try this snippet:
import csv
d = dict()
with open(r'd:/file.csv') as f:
reader = csv.reader(f)
for row in reader:
# if the family name doesn't
# exist in our dictionary,
# set it with a default value of a blank dictionary
if row[2] not in d:
d[row[2]] = dict()
# If the meal type doesn't exist for this
# family, set it up as a key in their dictionary
# and set the value to int value of the count
if row[1] not in d[row[2]]:
d[row[2]][row[1]] = int(row[0])
else:
# Both the family and the meal already
# exist in the dictionary, so just add the
# count to the total
d[row[2]][row[1]] += int(row[0])
Once you run through that loop, d looks like this:
{'Johnsons': {'Chicken': 3, 'Steak': 2, 'Veggie': 1},
'Martins': {'Fish': 2, 'Steak': 1},
'Smiths': {'Veggie': 4}}
Now its just a matter of printing it out:
for family,data in d.iteritems():
print('{}'.format(family))
for meal, total in data.iteritems():
print('{} {}'.format(total, meal))
At the end of the loop, you'll have:
Johnsons
3 Chicken
2 Steak
1 Veggie
Smiths
4 Veggie
Martins
2 Fish
1 Steak
You can later improve this snippet by using defaultdict

First time replier so here's a go. Have you considered keeping track of the orders and then writing to a file? I tried something using a dict based approach and it seems to work fine. The idea was to index by the family name and store a list of pairs containing the order quantities and types.
You may also want to consider the readability of your code - it's hard to follow and debug. However, what I think is happening is the line
for line in reader_sorting:
Iterates through reader_sorting. You read the 1st name, extract the family name, and later proceed to iterate again in reader_sorting. This time you start at the 2nd line, which family name matches, and you print it successfully. The rest of the line don't match, but you still iterate through them all. Now you've finished iterating through reader_sorting, and the loop finishes, even though in the outer loop you've read only one line.
One solution may be to create another iterator in the outer for loop and not expend the iterator that loop goes through. However, then you still need to deal with the possibility of double counting, or keeping track of indices. Another way may be to keep of the orders by family as you iterate.
import csv
orders = {}
with open('needs_sorting.csv') as file:
needs_sorting = csv.reader(file)
for amount, meal, family in needs_sorting:
if family not in orders:
orders[family] = []
orders[family].append((amount, meal))
with open('orders.txt', 'a') as file:
for family in orders:
file.write('%s\n' % family)
for amount, meal in orders[family]:
file.write('%s %s\n' % (amount, meal))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compiling a dictionary by pulling data from other dictionaries - python

Thanks for all the input everyone - however, it looks like its a problem with my data file, not a problem with the program. Dangit.

Related

Extracting multiple data from a single list

Returning a value based on matching one lists values to another list based on order

Python, loop that gets a value then tests that value again

Best way to deal with giant number of combinations python

Organizing and printing information by a specific row in a csv file

Categories

Resources