Python 2.7: Persisting search and indexing - python

I wrote a small 'search enginge' that finds all the text files in a Directory and its sub-directories - I can edit the code in but I don't think it is necessary for my question.
It works by creating a dictionary in a format like this:
term_frequency = {'file1' : { 'WORD1' : 1, 'WORD2' : 2, 'WORD3' : 3}}
{'file2' : { 'WORD1' : 1, 'WORD3' : 3, 'WORD4' : 4}}
...continues with all the files it has found...
From gathered information it creates a second dictionary like such:
document_frequency = {'WORD1' : ['file1', 'file2'....],
'WORD2' : ['file1',............],
....every found word..........]}
The purpose of the term_frequency dictionary is to hold data of how many times a word has been used in that file and document_frequency says in how many documents the word has been used.
Then, when given a word it calculates the relevance of every file by tf/df and lists the non-zero values in descending relevance of files.
for example:
file1 : 0.75
file2 : 0.5
I am aware that this is a very simple representation of the tf-idf but I am new to python and programming (2 weeks) and am getting familiar with it all.
Sorry for the long-ish intro but I feel it is relevant to the question... which brings me right to it:
How do I go about making an indexer that saves those dictionaries in a file and then make a 'searcher' read those dictionaries from a file. Because the issue right now is that every time you want to look for a different word it has to read ALL the files once again and make the same 2 dictionaries over and over.

The pickle (and for that matter cPickle) library is your friend here. By using pickle.dump(), you can turn the entire dictionary into one file which can be read back later by pickle.load().
In this case, you could use something like this:
import pickle
termfile = open('terms.pkl', 'wb')
documentfile = open('documents.pkl', 'wb')
pickle.dump(term_frequency, termfile)
pickle.dump(document_frequency, documentfile)
termfile.close()
documentfile.close()
and read it back like so:
termfile = open('terms.pkl', 'rb')
documentfile = open('documents.pkl', 'rb')
term_frequency = pickle.load(termfile)
document_frequency = pickle.load(documentfile)
termfile.close()
documentfile.close()

Related

How to Extract Data from CSV file and import to a Dictionary in Python?

I'm new in python new developer, just started my intership.
So, I have a csv file with datas customized like that
Event Category,Event Label,Total Events,Unique Events,Event Value,Avg. Value
From each row of the file I want to extract the labels of the ports (bellow) in an dictionary and add the total and unic events too. The total and unique events I have to sum them only the ports with same labels (not being dublicate).
My datas look like that :
'Search,Santorin (JTR) - Paros (PAS) - Santorin (JTR),"2,199","1,584",0,0.00'
I want my dictionary to look like that :
data_file = 'Analytics.csv' ports_dict = {
# "ATH-HER" : [10000, 5000],
# "ATH-JTR" : [20000, 3500],
# "HER-JTR" : [100, 500] }
data = 'Analytics.csv'
#row= 'Search,Santorin (JTR) - Paros (PAS) - Santorin (JTR),"2,199","1,584",0,0.00'
def extract_counts(data):
ports = []
for i in data.split('"')[1:]:
ports.append(i.split('"')[0])
return ports
An example from my code is this , when I'm running with the row runs ok when I'm using 'data' it gives me back an empty string. Can Anyone help me with this ?
extract_counts(data)
Out[13]: []
What I have to do to run this for the whole csv
Thank you for your help!
First of all, 'data' is just a string variable. In your loop, you are iterating over each character, not reading a file. Using split on a single character with '"', results in an empty string.
To begin your journey with reading CSV files in Python, I recommend:
https://realpython.com/python-csv/
https://docs.python.org/3/library/csv.html

Python 3 for loop error inside of a dictionary

I'm trying to put a for loop inside of a dict. (See my code bellow)
tempdict["data"] = {
"id": words[0],
for j in range(1, 100):
tempdict[fieldnames[j]] = words[j]
}
So, I'm reading out of a CSV file and converting it to JSON. (This is to automate a process at work)
Not sure if there is anything else to add, just ask if you need to know more.
I would appreciate any help I can get :)
so, extra context:
I'm trying to map the value from the CSV file to the dict with the key as one of the fieldnames (See fieldnames bellow)
fieldnames = ["id", "subscriber-A-01", "subscriber-A-02", "subscriber-A-03", "subscriber-A-04", "subscriber-A-05", "subscriber-A-06", "subscriber-A-07", "subscriber-A-08", "subscriber-A-09", "subscriber-A-10", "subscriber-B-01", "subscriber-B-02", "subscriber-B-03", "subscriber-B-04", "subscriber-B-05", "subscriber-B-06", "subscriber-B-07", "subscriber-B-08", "subscriber-B-09", "subscriber-B-10", "subscriber-C-01", "subscriber-C-02", "subscriber-C-03", "subscriber-C-04", "subscriber-C-05", "subscriber-C-06", "subscriber-C-07", "subscriber-C-08", "subscriber-C-09", "subscriber-C-10", "subscriber-D-01", "subscriber-D-02", "subscriber-D-03", "subscriber-D-04", "subscriber-D-05", "subscriber-D-06", "subscriber-D-07", "subscriber-D-08", "subscriber-D-09", "subscriber-D-10", "subscriber-E-01", "subscriber-E-02", "subscriber-E-03", "subscriber-E-04", "subscriber-E-05", "subscriber-E-06", "subscriber-E-07", "subscriber-E-08", "subscriber-E-09", "subscriber-E-10", "subscriber-F-01", "subscriber-F-02", "subscriber-F-03", "subscriber-F-04", "subscriber-F-05", "subscriber-F-06", "subscriber-F-07", "subscriber-F-08", "subscriber-F-09", "subscriber-F-10", "subscriber-G-01", "subscriber-G-02", "subscriber-G-03", "subscriber-G-04", "subscriber-G-05", "subscriber-G-06", "subscriber-G-07", "subscriber-G-08", "subscriber-G-09", "subscriber-G-10", "subscriber-H-01", "subscriber-H-02", "subscriber-H-03", "subscriber-H-04", "subscriber-H-05", "subscriber-H-06", "subscriber-H-07", "subscriber-H-08", "subscriber-H-09", "subscriber-H-10", "subscriber-I-01", "subscriber-I-02", "subscriber-I-03", "subscriber-I-04", "subscriber-I-05", "subscriber-I-06", "subscriber-I-07", "subscriber-I-08", "subscriber-I-09", "subscriber-I-10", "subscriber-J-01", "subscriber-J-02", "subscriber-J-03", "subscriber-J-04", "subscriber-J-05", "subscriber-J-06", "subscriber-J-07", "subscriber-J-08", "subscriber-J-09", "subscriber-J-10", "DISPLAY_TYPE", "stationCode"]
So, from subscriber-A-1 to subscriber-J-10 should be mapped to value 1 to 100 of the CSV file.
Update
This is how I made it work.
tempdict["data"] = dict(zip(fieldnames[1:100], words[1:100]))
tempdict["data"].update({"DISPLAY_TYPE": words[102]})
Thank you all for the help!
You can pair the sliced list of keys and the sliced list of values with zip and pass the zipped pairs to the dict constructor to build the desired dict instead:
tempdict['data'] = dict(zip(fieldnames[:101], words[:101]))

How to iterate over two files effectively (25000+ Lines)

So, I am trying to make a combined list inside of Python for matching data of about 25,000 lines.
The first list data came from file mac.uid and looks like this
Mac|ID
The second list data came serial.uid and looks like this:
Serial|Mac
Mac from list 1 must equal the Mac from list 2 before it's joined.
This is what I am currently doing, I believe there is too much repetition going on.
combined = [];
def combineData():
lines = open('mac.uid', 'r+')
for line in lines:
with open('serial.uid', 'r+') as serial:
for each in serial:
a, b = line.strip().split('|')
a = a.lower()
x, y = each.strip().split('|')
y = y.lower()
if a == y:
combined.append(a+""+b+""+x)
The final list is supposed to look like this:
Mac(List1), ID(List1), Serial(List2)
So that I can import it into an excel sheet.
Thanks for any and all help!
Instead of your nested loops (which cause quadratic complexity) you should use dictionaries which will give you roughly O(n log(n)) complexity. To do so, first read serial.uid once and store the mapping of MAC addresses to serial numbers in a dict.
serial = dict()
with open('serial.uid') as istr:
for line in istr:
(ser, mac) = split_fields(line)
serial[mac] = ser
Then you can close the file again and process mac.uid looking up the serial number for each MAC address in the dictionary you've just created.
combined = list()
with open('mac.uid') as istr:
for line in istr:
(mac, uid) = split_fields(line)
combined.append((mac, uid, serial[mac]))
Note that I've changed combined from a list of strings to a list of tuples. I've also factored the splitting logic out into a separate function. (You'll have to put its definition before its use, of course.)
def split_fields(line):
return line.strip().lower().split('|')
Finally, I recommend that you start using more descriptive names for your variables.
For files of 25k lines, you should have no issues storing everything in memory. If your data sets become too large for that, you might want to start looking into using a database. Note that the Python standard library ships with an SQLite module.

Filter large file using python, using contents of another

I have a ~1GB text file of data entries and another list of names that I would like to use to filter them. Running through every name for each entry will be terribly slow. What's the most efficient way of doing this in python? Is it possible to use a hash table if the name is embedded in the entry? Can I use make of the fact that the name part is consistently placed?
Example files:
Entries file -- each part of the entry is separated by a tab, until the names
246 lalala name="Jack";surname="Smith"
1357 dedada name="Mary";surname="White"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
Names file -- each on a newline
Jack
Dan
Ryan
Desired output -- only entries with a name in the names file
246 lalala name="Jack";surname="Smith"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
You can use the set data structure to store the names — it offers efficient lookup but if the names list is very large then you may run into memory troubles.
The general idea is to iterate through all the names, adding them to a set, then checking if each name from each line from the data file is contained in the set. As the format of the entries doesn't vary, you should be able to extract the names with a simple regular expression.
If you run into troubles with the size of the names set, you can read n lines from the names file and repeat the process for each set of names, unless you require sorting.
My first instinct was to make a dictionary of with names as keys, assuming that it was most efficient to look up the names using the hash of keys in the dictionary.
Given the answer, by #rfw, using a set of names, I edited the code as below and tested it against the two methods, using a dict of names and a set.
I built a dummy dataset of over 40 M records and over 5400 names. Using this dataset, the set method consistently had the edge on my machine.
import re
from collections import Counter
import time
# names file downloaded from http://www.tucows.com/preview/520007
# the set contains over 5400 names
f = open('./names.txt', 'r')
names = [ name.rstrip() for name in f.read().split(',') ]
name_set = set(names) # set of unique names
names_dict = Counter(names) # Counter ~= dict of names with counts
# Expect: 246 lalala name="Jack";surname="Smith"
pattern = re.compile(r'.*\sname="([^"]*)"')
def select_rows_set():
f = open('./data.txt', 'r')
out_f = open('./data_out_set.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in name_set:
out_f.write(record)
out_f.close()
f.close()
def select_rows_dict():
f = open('./data.txt', 'r')
out_f = open('./data_out_dict.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in names_dict:
out_f.write(record)
out_f.close()
f.close()
if __name__ == '__main__':
# One round to time the use of name_set
t0 = time.time()
select_rows_set()
t1 = time.time()
time_for_set = t1-t0
print 'Total set: ', time_for_set
# One round to time the use of names_dict
t0 = time.time()
select_rows_dict()
t1 = time.time()
time_for_dict = t1-t0
print 'Total dict: ', time_for_dict
I assumed that a Counter, being at heart a dictionary, and easier to build from the dataset, does not add any overhead to the access time. Happy to be corrected if I am missing something.
Your data is clearly structured as a table so this may be applicable.
Data structure for maintaining tabular data in memory?
You could create a custom data structure with its own "search by name" function. That'd be a list of dictionaries of some sort. This should take less memory than the size of your text file as it'll remove duplicate information you have on each line such as "name" and "surname", which would be dictionary keys. If you know a bit of SQL (very little is required here) then go with Filter large file using python, using contents of another

Find and replace in CSV files with Python

Related to a previous question, I'm trying to do replacements over a number of large CSV files.
The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.
An example of how I've tried to solve this:
# -*- coding: utf-8 -*-
import re
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# clunky replace function
def freplace(str, dict):
rc = re.compile('|'.join(re.escape(k) for k in dict))
def trans(m):
return dict[m.group(0)]
return rc.sub(trans, str)
# get correspondence between dictionary and column
C = []
for i in range(len(Header)):
if Header[i] in refD:
C.append([refD[Header[i]],i])
# loop through lines and make replacements
for line in fileLines:
for i in range(len(line)):
for j in range(len(C)):
if C[j][1] == i:
for dict in C[j][0]:
line[i] = freplace(line[i], dict)
My problem is that this code is quite slow, and I can't figure out how to speed it up. I'm a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don't know how to do that and still preserve what the rest of my code is doing.
There's a ton of things that you can do to speed this up:
First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you're interested in: it will present every row it reads from the file as a dictionary keyed by its column name.
Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you're going to apply them to.
Third, consider that if you apply a hundred regexes to a long string, you're going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.
You don't need re:
# -*- coding: utf-8 -*-
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# now let's have some fun...
for line in fileLines:
for i, (param, word) in enumerate(zip(Header, line)):
for minitranslator in refD[param]:
if word in minitranslator:
line[i] = minitranslator[word]
returns:
[[u'a', u'x'], [u'b', u'y']]
So if that's the case, and all 10 columns have the same names each time, but out of order, (I'm not sure if this is what you're doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex's repeatedly, and not have to recompile a new "command" each time.
I hope that makes sense. I'm sorry i don't know the syntax to help you out, but I hope my idea is what you're looking for
EDIT:
I.E.
initialize all regexes before starting your loops.
then after you read a line (and after the header line)
select array[n]
case "column1"
regex(data[0]);
case "column2"
regex(data[1]);
.
.
.
.
end select
This should call the right regex for the right columns

Categories