Identifying keys with multiple values in a hash table

Identifying keys with multiple values in a hash table - python

I am a beginner in Python scripting.
I have a CSV file which has 5 columns and over 1000 rows. I am attaching a screenshot to give an idea of how the file looks like. (I have included only 4 rows, but the real file has over 1000 rows). So the task I am trying to achieve is this:
I need to print an output csv file, which prints the rows of original csv file based on following conditions.
Each "number" field (column1) is supposed to have just one "name" field associated with it. If it has more than one name fields associated with it, it must throw an error (or display a message next to the number in the output.csv)
If a number field has just one name associated with it, simply print the entire row.
The data in CSV file is in the below format.
Number Name Choices
11234 ABCDEF A1B6N5
11234 ABCDEF A2B6C4
11234 EFGHJK A4F2
11235 ABCDEF A3F5H7
11236 MNOPQR F3D4D5
So my expected output should look something like this. Flag and Message should be displayed only when a "number" has more than one "name" associated with it.
If a "name" has been associated to more than one "number" it should not be flagged. (like 11235 had same name as 11234, but not flagged).
Number Name Choices Flag Message
11234 1 More than 1 name
11234
11234
11235 ABCDEF A3F5H7
11236 MNOPQR F3D4D5
I do understand that this can be implemented as a hashtable, where the number serves as a key and the name serves as value. If the value count is more than 1 for any key, we can probably set a flag and print the error message accordingly.
But could someone help me get started with this? As in, how do I implement this in Python?
Any help is appreciated.
Thanks!

Here are a few concepts you should learn and understand first:
Importing and Exporting CSV: https://docs.python.org/2/library/csv.html
Counter: https://docs.python.org/2/library/collections.html#collections.Counter
or
Defaultdict(int) for counting: https://docs.python.org/2/library/collections.html#collections.defaultdict
It sounds like you need column1 to be the key of a dictionary. If you're trying to count how many times it appears (that's not clear), then you can use names = defaultdict(int); names[key]+=1
If all you want is to remove the duplicates with no counting or crash if there's a duplicate, then here's what you can do:
mydict = {}
with open('yourfile.csv', mode='r') as infile:
reader = csv.reader(infile)
with open('yourfile.csv', mode='w') as outfile:
writer = csv.writer(outfile)
for row in reader:
key = row[0]
if key in mydict:
#Could handle this separately
print "Bad key, already found: %s. Ignoring row: %s" % (key, row)
raise #Element already found
mydict[key] = row
writer.writerows(mydict.values())
If this doesn't work, please give us sample input and expected output. Either way, this should get you started. Also, be patient: you'll learn most by doing things wrong and figuring out why they are wrong. Good luck!
====
Update:
You have a few choices. The easiest for a beginning is probably to build two lists and then output them.
Use key = row[1]
If key is already in the dictionary, remove it (del mydict[key]) and add it to the other dict multiple_dict = {}; multiple_dict[key] = [number, None, None, Data, Message]
def proc_entry(row):
key = row[1]
Saved existing data
if key in mydict:
multiple_dict[key] = key, None, None, 1, "Message"
del mydict[key]
elif key in multiple_dict:
#Key was already duplicated, increase flag?
multiple_dict[key][4]+=1
At this point, your code is getting complicated enough to use things like:
number, name, value = row, and splitting your code into functions. Then you should test the functions with known input to see if the output is as expected.
i.e. pre-load "mydict", then call your processing function and see how it worked. Even better? Learn to write simple unit tests :) .
While we could write it for you, that's not the spirit of Stackoverflow. If you have any more questions, you might want to split this into precise questions that haven't been answered already. Everything I mentioned above could have been found on Stackoverflow and a bit of practice. Knowing what solution to go for is then the art of programming! Have fun ...or hire a programmer if this isn't fun to you!

Related

can someone explain to me what I did wrong?

I need help unscrambling this code. I am only allowed to use these specific lines of code, but I need to 'unscramble' it to make it work. To me, this code looks good but I don't seem to get it to work so I would like to find out why this is the case.
The assignment that I am trying to solve is as follows:
Read in the file using the csv reader and build a dictionary with the tree species as the key and a count of the number of times the tree appears. Use the "in" operator to see if a tree has been added, and if not set it to 1.
Print the dictionary with the counts at the end.
My code is as follows:
from BrowserFile import open as _
import csv
with open("treeinventory.csv", "r", newline='') as f:
count = {}
reader = csv.reader(f)
for yard in reader:
for tree in yard:
if tree in count:
count[tree] = 1
else:
count[tree] = count[tree] + 1
print(count)
I would love if someone can help me and also explain why this code is not able to work as it is, i am trying to learn and this would be very helpful!
thank you!

Generally, we don't solve "homework" problems on SO. You should also try to ask specific questions. Also put better titles on your questions. And, as such, I always like to post This to help new question askers out.
Since I'm here: The answer to your assignment is that line 9 and line 11 are swapped.
This is because the logic seems to set that dict count with the key tree is being set to 1 if the key is in the dict, and add 1 to the value stored at count[tree] if it's not in the dict. This will result in a KeyError exception to be thrown when the value is accessed to do this addition in the statement count[tree] + 1, because, there is no value there yet.
Of course, without the input file, I can't actually run the code to verify it, so please try this out for yourself and update your question with specific issues if any come up.

Printing Dictionary Key if Values were found in CSV List

I'm pretty new to python, so forgive me if this is a long explanation to a simple problem. I need some help in understanding how to use a dictionary to find matches from csv list, then print the key in a reporting type output.
Goal: I have a list of clear text privacy data like Social Security Numbers. I need to compare the hash of that clear text and at the same time, obfuscate the clear text to the last 4 digits (XXX-XX-1245). If there is a match from my clear text hash, to a hash I already have in a CSV lookup, I do a mini report linking demographic information of who the found hash might belong to. Also, because nothing is easy, in the mini report needs to print the obfuscated SPI value.
output should look like this if hash I just generated, matches the hash of column 2 in my spreadsheet:
user#gmail.com Full Name Another Full Name xxx-xx-1234 location1 location2
Problem: All of the hash, obfuscation, and matching is done and stored in lists and works correctly. I need help figuring out how to print the key from the dictionary with my other columns below without printing the whole set each time in the for-loop.
This works outside of my reader:
for i in hashes_ssnxxxx:
print(i)
but I do not know how to take that value and put it in my print statement inside of the reader.
clear_text_hash = [] #Where Hash of clear text value found is stored
obfuscate_xxxxssn = [] #Where obfuscated SPI found by using re.sub is stored
#Zip them in a dictonary to keep the two related
hashes_and_ssnxxxx = dict(zip(obfuscate_xxxxssn,clear_text_hash))
book_of_record = open('path\to\bookofrecord.csv', 'rt', encoding='UTF-8')
a1 = csv.reader(book_of_record, delimiter=',')
for row in a1:
hashes = row[2]
if hashes in hashes_ssnxxxx.values():
print(row[16], row[6], hashes_ssnxxxx.keys(), row[13], row[35], row[18], row[43])
UPDATE [Solved]
using the list comprehension suggested by #tianhua liao all it needed was:
if hashes in hashes_ssnxxxx.values():
obfuscate = [k for k,v in hashes_ssnxxxx.items() if hashes == v]
print(row[16], obfuscate, row[6], row[13], row[35], row[18], row[43])

Actually, I am not sure what your problems really are. If you could give us some simple examples of hashes_ssnxxxx and hashes will be good.
Here I just give some guessed answers.
After you judge that if hashes in hashes_ssnxxxx.values():, you want to print some relative key from hashes_ssnxxxx.keys() instead of all of them.
Maybe you could use some list comprehension to do that simply. Just like
[keys for key,vals in hashes_ssnxxxx.items() if hashes == vals]
The output of that code is a list. If you want to make it more readable, maybe you need to use some index[0] or ','.join() to print it.

Python--adding list into dict (beginner)

I'm very new to programming (taking my first class in it now), so bear with me for format issues and misunderstandings, or missing easy fixes.
I have a dict with tweet data: 'user' as keys and then 'text' as their values. My goal here is to find the tweets where they are replying to another user, signified by starting with the # symbol, and then make a new dict that contains the author's user and the users of everyone he replied to. That's the fairly simple if statement I have below. I was also able to use the split function to isolate the username of the person they are replying to (the function takes all the text between the # symbol and the next space after it).
st='#'
en=' '
task1dict={}
for t in a,b,c,d,e,f,g,h,i,j,k,l,m,n:
if t['text'][0]=='#':
user=t['user']
repliedto=t['text'].split(st)[-1].split(en)[0]
task1dict[user]=[repliedto]
Username1 replied to username2. Username2 replied to both username3 and username5.
I am trying to create a dict (caled tweets1) that reads something like:
'user':'repliedto'
username1:[username2]
username2:[username3, username5]
etc.
Is there a better way to isolate the usernames, and then put them into a new dict? Here's a 2 entry sample of the tweet data:
{"user":"datageek88","text":"#sundevil1992 good question! #joeclarknet Is this on the exam?"},
{"user":"joeclarkphd","text":"Exam questions will be answered in due time #sundevil1992"}
I am now able to add them to a dict, but it would only save one 'repliedto' for each 'user', so instead of showing username2 have replied to both 3 and 5, it just shows the latest one, 5:
{'username1': ['username2'],
'username2': ['username5']}
Again, if I'm making a serious no-no anywhere in here, I apologize, and please show me what I'm doing wrong!

Modify the last line to
task1dict.setdefault(user, [])
task1dict[user].append (repliedto)
You were overwriting the users replied to array each time you edited it. The setdefault method will set the dict to have a empty list if it doesn't already exist. Then just append to the list.
EDIT: same code using a set for uniqueness.
task1dict.setdefault(user, set())
task1dict[user].add (repliedto)
For a set you add an element to the set. Whereas a list you append to the list

I might do it like this. Use the following regular expression to identify all usernames.
r"#([^\s]*)"
It means look for the # symbol, and then return all characters that aren't a space. A defaultdict is a simply a dictionary that returns a default value if they key isn't found. In this case, I specify an empty set as the return type in the event that we are adding a new key.
import re
from collections import defaultdict
tweets = [{"user":"datageek88","text":"#sundevil1992 good question! #joeclarknet Is this on the exam?"},
{"user":"joeclarkphd","text":"Exam questions will be answered in due time #sundevil1992"}]
from_to = defaultdict(set)
for tweet in tweets:
if "#" in tweet['text']:
user = tweet['user']
for replied_to in re.findall(r"#([^\s]*)", tweet['text']):
from_to[user].add(replied_to)
print from_to
Output
defaultdict(<type 'list'>, {'joeclarkphd': ['sundevil1992'],
'datageek88': ['sundevil1992', 'joeclarknet']})

Python, make a list for a for loop when parsing a CSV

I'm working on parsing a CSV from an export of my company's database. This is a slimmed down version has around 15 columns, the actual CSV has over 400 columns of data (all necessary). The below works perfectly:
inv = csv.reader(open('inventory_report.txt', 'rU'), dialect='excel', delimiter="\t")
for PART_CODE,MODEL_NUMBER,PRODUCT_NAME,COLOR,TOTAL_ONHAND,TOTAL_ON_ORDER,TOTAL_SALES,\
SALES_YEAR_TO_DATE,SALES_LASTYEAR_TO_DATE,TOTAL_NUMBER_OF_QTYsSOLD,TOTAL_PURCHASES,\
PURCHASES_YEAR_TO_DATE,PURCHASES_LASTYEAR_TO_DATE,TOTAL_NUMBER_OF_QTYpurchased,\
DATE_LAST_SOLD,DATE_FIRST_SOLD in inv:
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (MODEL_NUMBER,PRODUCT_NAME,\
TOTAL_ONHAND,TOTAL_ON_ORDER)
As you can already tell, it will be very painful to read when the 'for' loop has 400+ names attached to it for each of item of the row in the CSV. However annoying, it is however very handy for being able to access the output I'm after by this method. I can easily get specific items and perform calculations within the common names we're already familiar with in our point of sale database.
I've been attempting to make this more readable. Trying to figure out a way where I could define a list of all these names in the for loop but still be able to call for them by name when it's time to do calculations and print the output.
Any thoughts?

you can use csv.DictReader. Elements are read as dict. Assuming u have first line as column name.
inv = csv.DictReader(open('file.csv')):
for i in inv:
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (i['MODEL_NUMBER'],i['PRODUCT_NAME'],i['TOTAL_ONHAND'],i['TOTAL_ON_ORDER'])
And if you want the i[MODEL_NUMBER] to come from list. Define a list with all column name. Assuming, l = ['MODEL_NUMBER','PRODUCT_NAME','TOTAL_ONHAND','TOTAL_ON_ORDER']. Then my print statement in above code will be,
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (i[l[0]],i[l[1]],i[l[2]],i[l[3]])
Code not checked.. :)

To make your code more readable and easier to reuse, you should read the name of the columns dynamically. CSV files use to have a header with this information on top of the file, so you can read the first line and store it in a tuple or a list.

Search a single column for a particular value in a CSV file and return an entire row

Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.

You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.