I'm pretty new to python, so forgive me if this is a long explanation to a simple problem. I need some help in understanding how to use a dictionary to find matches from csv list, then print the key in a reporting type output.
Goal: I have a list of clear text privacy data like Social Security Numbers. I need to compare the hash of that clear text and at the same time, obfuscate the clear text to the last 4 digits (XXX-XX-1245). If there is a match from my clear text hash, to a hash I already have in a CSV lookup, I do a mini report linking demographic information of who the found hash might belong to. Also, because nothing is easy, in the mini report needs to print the obfuscated SPI value.
output should look like this if hash I just generated, matches the hash of column 2 in my spreadsheet:
user#gmail.com Full Name Another Full Name xxx-xx-1234 location1 location2
Problem: All of the hash, obfuscation, and matching is done and stored in lists and works correctly. I need help figuring out how to print the key from the dictionary with my other columns below without printing the whole set each time in the for-loop.
This works outside of my reader:
for i in hashes_ssnxxxx:
print(i)
but I do not know how to take that value and put it in my print statement inside of the reader.
clear_text_hash = [] #Where Hash of clear text value found is stored
obfuscate_xxxxssn = [] #Where obfuscated SPI found by using re.sub is stored
#Zip them in a dictonary to keep the two related
hashes_and_ssnxxxx = dict(zip(obfuscate_xxxxssn,clear_text_hash))
book_of_record = open('path\to\bookofrecord.csv', 'rt', encoding='UTF-8')
a1 = csv.reader(book_of_record, delimiter=',')
for row in a1:
hashes = row[2]
if hashes in hashes_ssnxxxx.values():
print(row[16], row[6], hashes_ssnxxxx.keys(), row[13], row[35], row[18], row[43])
UPDATE [Solved]
using the list comprehension suggested by #tianhua liao all it needed was:
if hashes in hashes_ssnxxxx.values():
obfuscate = [k for k,v in hashes_ssnxxxx.items() if hashes == v]
print(row[16], obfuscate, row[6], row[13], row[35], row[18], row[43])
Actually, I am not sure what your problems really are. If you could give us some simple examples of hashes_ssnxxxx and hashes will be good.
Here I just give some guessed answers.
After you judge that if hashes in hashes_ssnxxxx.values():, you want to print some relative key from hashes_ssnxxxx.keys() instead of all of them.
Maybe you could use some list comprehension to do that simply. Just like
[keys for key,vals in hashes_ssnxxxx.items() if hashes == vals]
The output of that code is a list. If you want to make it more readable, maybe you need to use some index[0] or ','.join() to print it.
Related
I am a beginner in Python scripting.
I have a CSV file which has 5 columns and over 1000 rows. I am attaching a screenshot to give an idea of how the file looks like. (I have included only 4 rows, but the real file has over 1000 rows). So the task I am trying to achieve is this:
I need to print an output csv file, which prints the rows of original csv file based on following conditions.
Each "number" field (column1) is supposed to have just one "name" field associated with it. If it has more than one name fields associated with it, it must throw an error (or display a message next to the number in the output.csv)
If a number field has just one name associated with it, simply print the entire row.
The data in CSV file is in the below format.
Number Name Choices
11234 ABCDEF A1B6N5
11234 ABCDEF A2B6C4
11234 EFGHJK A4F2
11235 ABCDEF A3F5H7
11236 MNOPQR F3D4D5
So my expected output should look something like this. Flag and Message should be displayed only when a "number" has more than one "name" associated with it.
If a "name" has been associated to more than one "number" it should not be flagged. (like 11235 had same name as 11234, but not flagged).
Number Name Choices Flag Message
11234 1 More than 1 name
11234
11234
11235 ABCDEF A3F5H7
11236 MNOPQR F3D4D5
I do understand that this can be implemented as a hashtable, where the number serves as a key and the name serves as value. If the value count is more than 1 for any key, we can probably set a flag and print the error message accordingly.
But could someone help me get started with this? As in, how do I implement this in Python?
Any help is appreciated.
Thanks!
Here are a few concepts you should learn and understand first:
Importing and Exporting CSV: https://docs.python.org/2/library/csv.html
Counter: https://docs.python.org/2/library/collections.html#collections.Counter
or
Defaultdict(int) for counting: https://docs.python.org/2/library/collections.html#collections.defaultdict
It sounds like you need column1 to be the key of a dictionary. If you're trying to count how many times it appears (that's not clear), then you can use names = defaultdict(int); names[key]+=1
If all you want is to remove the duplicates with no counting or crash if there's a duplicate, then here's what you can do:
mydict = {}
with open('yourfile.csv', mode='r') as infile:
reader = csv.reader(infile)
with open('yourfile.csv', mode='w') as outfile:
writer = csv.writer(outfile)
for row in reader:
key = row[0]
if key in mydict:
#Could handle this separately
print "Bad key, already found: %s. Ignoring row: %s" % (key, row)
raise #Element already found
mydict[key] = row
writer.writerows(mydict.values())
If this doesn't work, please give us sample input and expected output. Either way, this should get you started. Also, be patient: you'll learn most by doing things wrong and figuring out why they are wrong. Good luck!
====
Update:
You have a few choices. The easiest for a beginning is probably to build two lists and then output them.
Use key = row[1]
If key is already in the dictionary, remove it (del mydict[key]) and add it to the other dict multiple_dict = {}; multiple_dict[key] = [number, None, None, Data, Message]
def proc_entry(row):
key = row[1]
Saved existing data
if key in mydict:
multiple_dict[key] = key, None, None, 1, "Message"
del mydict[key]
elif key in multiple_dict:
#Key was already duplicated, increase flag?
multiple_dict[key][4]+=1
At this point, your code is getting complicated enough to use things like:
number, name, value = row, and splitting your code into functions. Then you should test the functions with known input to see if the output is as expected.
i.e. pre-load "mydict", then call your processing function and see how it worked. Even better? Learn to write simple unit tests :) .
While we could write it for you, that's not the spirit of Stackoverflow. If you have any more questions, you might want to split this into precise questions that haven't been answered already. Everything I mentioned above could have been found on Stackoverflow and a bit of practice. Knowing what solution to go for is then the art of programming! Have fun ...or hire a programmer if this isn't fun to you!
I have a server log file, linked here
my goal is to find "valid" server visits, and write a summary valid top-level domains to an output file.
The output file must be a .tsv file. An example file is linked here
I have gotten the results of my regex (which definitely still needs some tweaking) into a dictionary of dictionaries, in order to conform to the desired output. My problem is sorting the dictionary keys alphabetically by the inner dictionary keys (the top-level domains), and then doing the same for the outer dictionary keys (the date). additionally, I'm not sure how to sneak the "\t" character in between each key/value pair.
I know that dictionaries are unsortable, I'm just having difficulty moving the data into a sortable format, and then writing the output into a style that conforms to the example file
I have included my code thus far below:
import re
fhandle=open("access_log.txt", "rU")
access_log=fhandle.readlines()
validfile=open("valid.tsv", "w")
invalidfile=open("invalid.tsv", "w")
valid_list=list()
valid_dict=dict()
invalid_list=list()
#write results into respective log files
for line in access_log:
valid=re.findall(r'(\d+/[a-zA-Z]+/\d+).*?(GET|POST)\s(http://|https://)([a-zA-Z]+)\.(\w+)\.((?<=com)\.[a-zA-Z]+|[a-zA-Z]+).*?(200)', line)
if valid:
date=valid[0][0]
domain=valid[0][5]
#writes results into 2d dcitonary (dictionary of dictonaries)
if date not in valid_dict:
valid_dict[date]={}
else:
if domain in valid_dict[date]:
valid_dict[date][domain]+=1
else:
valid_dict[date][domain]=1
else:
invalid_list.append(line)
for k,v in valid_dict.items():
valid_list.append([k,v])
for key in sorted(valid_dict.iterkeys()):
print key, valid_dict[key]
fhandle.close()
validfile.close()
invalidfile.close()
You are right, you cannot sort a dictionary. What could you use instead? Perhaps a list of keys?
In general when coding it helps to walk backwards from the solution. You goal is to write a summary. What do you need for the summary? Stub it out:
def write_summary(data):
# output summary
print data
Now where would data come from? Start simple, add complexity.
After some tinkering (and sleep), I have a solution that creates output that exactly matches the example output file.
after applying the regex function and adding valid entries into a 2d dictionary, the following code properly formats and writes the output file:
#step 2
#format output file for tsv
#ordered chronologically, with Key:Value pairs orgainzed alphabeticallv by key (Domain Name)
date_workspace=''
domain_workspace=''
for date in sorted(valid_dict.iterkeys()):
date_workspace+=date + "\t"
for domain_name in sorted(valid_dict[date].iterkeys()):
domain_workspace+="%s:%s\t" % (domain_name, valid_dict[date][domain_name])
date_workspace+=domain_workspace
date_workspace+="\n"
domain_workspace=''
# Step 3
# write output
validfile.write(date_workspace)
for line in invalid_list:
invalidfile.write(line)
going line by line...
the loop will first add the date value to date_workspace, then immediately switch to iterating over the second dictionary
using the .iterkey()method makes the keys sortable. We can now append all of our key:value pairs as strings into domain_workspace, along with the tab character in order to create the tab seperation.
once this is done concatenate our data_workspace (our string containing all of the K:V pairs) onto the date variable
lastly, we concatenate a newline character and clear out the domain_workspace in order to ensure that each date value has the correct k:v pairs attached to it. at the end of loop, the dat_workspace variable will look something like
date_workspace="date\tk:v\tk:v\tk:v\t\ndate\tk:v\tk:v\tk:v\t\n"
is it the prettiest output? no. Does it do some things in a less elegant fashion than some available modules? no. However, it creates output that is formatted to the exact specifications of the example.
I'm working on parsing a CSV from an export of my company's database. This is a slimmed down version has around 15 columns, the actual CSV has over 400 columns of data (all necessary). The below works perfectly:
inv = csv.reader(open('inventory_report.txt', 'rU'), dialect='excel', delimiter="\t")
for PART_CODE,MODEL_NUMBER,PRODUCT_NAME,COLOR,TOTAL_ONHAND,TOTAL_ON_ORDER,TOTAL_SALES,\
SALES_YEAR_TO_DATE,SALES_LASTYEAR_TO_DATE,TOTAL_NUMBER_OF_QTYsSOLD,TOTAL_PURCHASES,\
PURCHASES_YEAR_TO_DATE,PURCHASES_LASTYEAR_TO_DATE,TOTAL_NUMBER_OF_QTYpurchased,\
DATE_LAST_SOLD,DATE_FIRST_SOLD in inv:
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (MODEL_NUMBER,PRODUCT_NAME,\
TOTAL_ONHAND,TOTAL_ON_ORDER)
As you can already tell, it will be very painful to read when the 'for' loop has 400+ names attached to it for each of item of the row in the CSV. However annoying, it is however very handy for being able to access the output I'm after by this method. I can easily get specific items and perform calculations within the common names we're already familiar with in our point of sale database.
I've been attempting to make this more readable. Trying to figure out a way where I could define a list of all these names in the for loop but still be able to call for them by name when it's time to do calculations and print the output.
Any thoughts?
you can use csv.DictReader. Elements are read as dict. Assuming u have first line as column name.
inv = csv.DictReader(open('file.csv')):
for i in inv:
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (i['MODEL_NUMBER'],i['PRODUCT_NAME'],i['TOTAL_ONHAND'],i['TOTAL_ON_ORDER'])
And if you want the i[MODEL_NUMBER] to come from list. Define a list with all column name. Assuming, l = ['MODEL_NUMBER','PRODUCT_NAME','TOTAL_ONHAND','TOTAL_ON_ORDER']. Then my print statement in above code will be,
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (i[l[0]],i[l[1]],i[l[2]],i[l[3]])
Code not checked.. :)
To make your code more readable and easier to reuse, you should read the name of the columns dynamically. CSV files use to have a header with this information on top of the file, so you can read the first line and store it in a tuple or a list.
Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.
I'm trying to load a specific key from a dictionary, with keys like "character 1", "translation 1", etc. (I'm working a a flashcard program for my chinese studies). First, I load the dictionary flawlessly from a .txt file with
f = codecs.open(('%s.txt' % list_name),'r','utf-8')
quiz_list = eval(f.read())
Then, I want the program to print the list in order, so that I would get something along the lines of
"1. ('character 1' value) ('pinyin 1' value) ('translation 1' value)"
The program registers how many entries the list has and calculates the amount of chinese words it has to show (with each word having its own character, transcription and translation and entry number). Now, I want to load the first chinese word from the list, with the 3 keys "character 1", "pinyin 1" and "translation 1".
The tried-and-tested way of retrieving values from a dictionary is through my_dictionary[key]. However, if I were to insert the name of a variable in the part between brackets, python would read the name of this variable as the name of a key, and not use the value of the variable. Is there a way of doing the latter the right way? I have, for example, tried the following (obviously to no avail) to load key "character 1" from the list:
i = 1
character_to_load = "character %s" % str(i)
print quiz_list[character_to_load]
Any hints are extremely appreciated!
A more general solution to this problem, instead of flattening the data structure into a dictionary keyed by strings, is to use a better data structure. For instance, if you have a lot of keys that look like "translation n" for numbers n, you'd be better off making translation a dictionary keyed by numbers. You might even want to make the lookup go the other way: you could have a Word object (or whatever) which has properties translation, pinyin and character, and then have a list of Words.
You should build this data structure properly, instead of evaling a file. That's basically never a good idea, because it's horribly fragile: you're forcing the text file to be pure Python, but not writing it as a module. Instead, you should iterate over the lines in the file and build up the data structure as you go.
If you tell me the current structure of your file I can give you an example as to how to parse it properly.
If I got you question right, I believe you might have some bug in the code, as this works fine:
>>> d = {'translation 1': 'dilligent', 'pinyin 1': 'ren4wei2', 'character 1': '\xe8\xaa\x8d\xe7\x88\xb2'}
>>> key = "translation %s" % 1
>>> d[key]
'dilligent'
Does it help?