creating proper tsv output from dictionary of dictionaries - python

I have a server log file, linked here
my goal is to find "valid" server visits, and write a summary valid top-level domains to an output file.
The output file must be a .tsv file. An example file is linked here
I have gotten the results of my regex (which definitely still needs some tweaking) into a dictionary of dictionaries, in order to conform to the desired output. My problem is sorting the dictionary keys alphabetically by the inner dictionary keys (the top-level domains), and then doing the same for the outer dictionary keys (the date). additionally, I'm not sure how to sneak the "\t" character in between each key/value pair.
I know that dictionaries are unsortable, I'm just having difficulty moving the data into a sortable format, and then writing the output into a style that conforms to the example file
I have included my code thus far below:
import re
fhandle=open("access_log.txt", "rU")
access_log=fhandle.readlines()
validfile=open("valid.tsv", "w")
invalidfile=open("invalid.tsv", "w")
valid_list=list()
valid_dict=dict()
invalid_list=list()
#write results into respective log files
for line in access_log:
valid=re.findall(r'(\d+/[a-zA-Z]+/\d+).*?(GET|POST)\s(http://|https://)([a-zA-Z]+)\.(\w+)\.((?<=com)\.[a-zA-Z]+|[a-zA-Z]+).*?(200)', line)
if valid:
date=valid[0][0]
domain=valid[0][5]
#writes results into 2d dcitonary (dictionary of dictonaries)
if date not in valid_dict:
valid_dict[date]={}
else:
if domain in valid_dict[date]:
valid_dict[date][domain]+=1
else:
valid_dict[date][domain]=1
else:
invalid_list.append(line)
for k,v in valid_dict.items():
valid_list.append([k,v])
for key in sorted(valid_dict.iterkeys()):
print key, valid_dict[key]
fhandle.close()
validfile.close()
invalidfile.close()

You are right, you cannot sort a dictionary. What could you use instead? Perhaps a list of keys?
In general when coding it helps to walk backwards from the solution. You goal is to write a summary. What do you need for the summary? Stub it out:
def write_summary(data):
# output summary
print data
Now where would data come from? Start simple, add complexity.

After some tinkering (and sleep), I have a solution that creates output that exactly matches the example output file.
after applying the regex function and adding valid entries into a 2d dictionary, the following code properly formats and writes the output file:
#step 2
#format output file for tsv
#ordered chronologically, with Key:Value pairs orgainzed alphabeticallv by key (Domain Name)
date_workspace=''
domain_workspace=''
for date in sorted(valid_dict.iterkeys()):
date_workspace+=date + "\t"
for domain_name in sorted(valid_dict[date].iterkeys()):
domain_workspace+="%s:%s\t" % (domain_name, valid_dict[date][domain_name])
date_workspace+=domain_workspace
date_workspace+="\n"
domain_workspace=''
# Step 3
# write output
validfile.write(date_workspace)
for line in invalid_list:
invalidfile.write(line)
going line by line...
the loop will first add the date value to date_workspace, then immediately switch to iterating over the second dictionary
using the .iterkey()method makes the keys sortable. We can now append all of our key:value pairs as strings into domain_workspace, along with the tab character in order to create the tab seperation.
once this is done concatenate our data_workspace (our string containing all of the K:V pairs) onto the date variable
lastly, we concatenate a newline character and clear out the domain_workspace in order to ensure that each date value has the correct k:v pairs attached to it. at the end of loop, the dat_workspace variable will look something like
date_workspace="date\tk:v\tk:v\tk:v\t\ndate\tk:v\tk:v\tk:v\t\n"
is it the prettiest output? no. Does it do some things in a less elegant fashion than some available modules? no. However, it creates output that is formatted to the exact specifications of the example.

Related

How to read "well" from a file in python

I have to read a file that has always the same format.
As I know it has the same format I can readline() and tokenize. But I guess there is a way to read it more, how to say it, "pretty to the eyes".
The file I have to read has this format :
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
I just want a different way to read it without having to tokenize, if that is possbile
Your question seems to imply that "tokenizing" is some kind of mysterious and complicated process. But in fact, the thing you are trying to do is exactly tokenizing.
Here is a perfectly valid way to read the file you show, break it up into tokens, and store it in a data structure:
def read_file_data(data_file_path):
result = {}
with open(data_file_path) as data_file:
for line in data_file:
key, value = line.split(' ', maxsplit=1)
result[key] = value
return result
That wasn't complicated, it wasn't a lot of code, it doesn't need a third-party library, and it's easy to work with:
data = read_file_data('path/to/file')
print(data['Nom']) # prints "NMS-01"
Now, this implementation makes many assumptions about the structure of the file. Among other things, it assumes:
The entire file is structured as key/value pairs
Each key/value pair fits on a single line
Every line in the file is a key/value pair (no comments or blank lines)
The key cannot contain space characters
The value cannot contain newline characters
The same key does not appear multiple times in the file (or, if it does, it is acceptable for the last value given to be the only one returned)
Some of these assumptions may be false, but they are all true for the data sample you provided.
More generally: if you want to parse some kind of structured data, you need to understand the structure of the data and how values are delimited from each other. That's why common structured data formats like XML, JSON, and YAML (among many others!) were invented. Once you know the language you are parsing, tokenization is simply the code you write to match up the language with the text of your input.
Pandas does many magical things, so maybe that is prettier for you?
import pandas as pd
pd.read_csv('input.txt',sep = ' ',header=None,index_col=0)
This gives you a dataframe that you can manipulate further:
0 1
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129

Printing Dictionary Key if Values were found in CSV List

I'm pretty new to python, so forgive me if this is a long explanation to a simple problem. I need some help in understanding how to use a dictionary to find matches from csv list, then print the key in a reporting type output.
Goal: I have a list of clear text privacy data like Social Security Numbers. I need to compare the hash of that clear text and at the same time, obfuscate the clear text to the last 4 digits (XXX-XX-1245). If there is a match from my clear text hash, to a hash I already have in a CSV lookup, I do a mini report linking demographic information of who the found hash might belong to. Also, because nothing is easy, in the mini report needs to print the obfuscated SPI value.
output should look like this if hash I just generated, matches the hash of column 2 in my spreadsheet:
user#gmail.com Full Name Another Full Name xxx-xx-1234 location1 location2
Problem: All of the hash, obfuscation, and matching is done and stored in lists and works correctly. I need help figuring out how to print the key from the dictionary with my other columns below without printing the whole set each time in the for-loop.
This works outside of my reader:
for i in hashes_ssnxxxx:
print(i)
but I do not know how to take that value and put it in my print statement inside of the reader.
clear_text_hash = [] #Where Hash of clear text value found is stored
obfuscate_xxxxssn = [] #Where obfuscated SPI found by using re.sub is stored
#Zip them in a dictonary to keep the two related
hashes_and_ssnxxxx = dict(zip(obfuscate_xxxxssn,clear_text_hash))
book_of_record = open('path\to\bookofrecord.csv', 'rt', encoding='UTF-8')
a1 = csv.reader(book_of_record, delimiter=',')
for row in a1:
hashes = row[2]
if hashes in hashes_ssnxxxx.values():
print(row[16], row[6], hashes_ssnxxxx.keys(), row[13], row[35], row[18], row[43])
UPDATE [Solved]
using the list comprehension suggested by #tianhua liao all it needed was:
if hashes in hashes_ssnxxxx.values():
obfuscate = [k for k,v in hashes_ssnxxxx.items() if hashes == v]
print(row[16], obfuscate, row[6], row[13], row[35], row[18], row[43])
Actually, I am not sure what your problems really are. If you could give us some simple examples of hashes_ssnxxxx and hashes will be good.
Here I just give some guessed answers.
After you judge that if hashes in hashes_ssnxxxx.values():, you want to print some relative key from hashes_ssnxxxx.keys() instead of all of them.
Maybe you could use some list comprehension to do that simply. Just like
[keys for key,vals in hashes_ssnxxxx.items() if hashes == vals]
The output of that code is a list. If you want to make it more readable, maybe you need to use some index[0] or ','.join() to print it.

Python, make a list for a for loop when parsing a CSV

I'm working on parsing a CSV from an export of my company's database. This is a slimmed down version has around 15 columns, the actual CSV has over 400 columns of data (all necessary). The below works perfectly:
inv = csv.reader(open('inventory_report.txt', 'rU'), dialect='excel', delimiter="\t")
for PART_CODE,MODEL_NUMBER,PRODUCT_NAME,COLOR,TOTAL_ONHAND,TOTAL_ON_ORDER,TOTAL_SALES,\
SALES_YEAR_TO_DATE,SALES_LASTYEAR_TO_DATE,TOTAL_NUMBER_OF_QTYsSOLD,TOTAL_PURCHASES,\
PURCHASES_YEAR_TO_DATE,PURCHASES_LASTYEAR_TO_DATE,TOTAL_NUMBER_OF_QTYpurchased,\
DATE_LAST_SOLD,DATE_FIRST_SOLD in inv:
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (MODEL_NUMBER,PRODUCT_NAME,\
TOTAL_ONHAND,TOTAL_ON_ORDER)
As you can already tell, it will be very painful to read when the 'for' loop has 400+ names attached to it for each of item of the row in the CSV. However annoying, it is however very handy for being able to access the output I'm after by this method. I can easily get specific items and perform calculations within the common names we're already familiar with in our point of sale database.
I've been attempting to make this more readable. Trying to figure out a way where I could define a list of all these names in the for loop but still be able to call for them by name when it's time to do calculations and print the output.
Any thoughts?
you can use csv.DictReader. Elements are read as dict. Assuming u have first line as column name.
inv = csv.DictReader(open('file.csv')):
for i in inv:
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (i['MODEL_NUMBER'],i['PRODUCT_NAME'],i['TOTAL_ONHAND'],i['TOTAL_ON_ORDER'])
And if you want the i[MODEL_NUMBER] to come from list. Define a list with all column name. Assuming, l = ['MODEL_NUMBER','PRODUCT_NAME','TOTAL_ONHAND','TOTAL_ON_ORDER']. Then my print statement in above code will be,
print ('%-20s %-90s OnHand: %-10s OnOrder: %-10s') % (i[l[0]],i[l[1]],i[l[2]],i[l[3]])
Code not checked.. :)
To make your code more readable and easier to reuse, you should read the name of the columns dynamically. CSV files use to have a header with this information on top of the file, so you can read the first line and store it in a tuple or a list.

Search a single column for a particular value in a CSV file and return an entire row

Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.

python2.7 - taking a value from a dictionary, where the key is a variable

I'm trying to load a specific key from a dictionary, with keys like "character 1", "translation 1", etc. (I'm working a a flashcard program for my chinese studies). First, I load the dictionary flawlessly from a .txt file with
f = codecs.open(('%s.txt' % list_name),'r','utf-8')
quiz_list = eval(f.read())
Then, I want the program to print the list in order, so that I would get something along the lines of
"1. ('character 1' value) ('pinyin 1' value) ('translation 1' value)"
The program registers how many entries the list has and calculates the amount of chinese words it has to show (with each word having its own character, transcription and translation and entry number). Now, I want to load the first chinese word from the list, with the 3 keys "character 1", "pinyin 1" and "translation 1".
The tried-and-tested way of retrieving values from a dictionary is through my_dictionary[key]. However, if I were to insert the name of a variable in the part between brackets, python would read the name of this variable as the name of a key, and not use the value of the variable. Is there a way of doing the latter the right way? I have, for example, tried the following (obviously to no avail) to load key "character 1" from the list:
i = 1
character_to_load = "character %s" % str(i)
print quiz_list[character_to_load]
Any hints are extremely appreciated!
A more general solution to this problem, instead of flattening the data structure into a dictionary keyed by strings, is to use a better data structure. For instance, if you have a lot of keys that look like "translation n" for numbers n, you'd be better off making translation a dictionary keyed by numbers. You might even want to make the lookup go the other way: you could have a Word object (or whatever) which has properties translation, pinyin and character, and then have a list of Words.
You should build this data structure properly, instead of evaling a file. That's basically never a good idea, because it's horribly fragile: you're forcing the text file to be pure Python, but not writing it as a module. Instead, you should iterate over the lines in the file and build up the data structure as you go.
If you tell me the current structure of your file I can give you an example as to how to parse it properly.
If I got you question right, I believe you might have some bug in the code, as this works fine:
>>> d = {'translation 1': 'dilligent', 'pinyin 1': 'ren4wei2', 'character 1': '\xe8\xaa\x8d\xe7\x88\xb2'}
>>> key = "translation %s" % 1
>>> d[key]
'dilligent'
Does it help?

Categories