Efficiently parsing a large text file in Python? - python

I have a series of large, flat text files that I need to parse in order insert into a SQL database. Each record spans multiple lines and consists of about a hundred fixed-length fields. I am trying to figure out how to efficiently parse them without loading the entire file into memory.
Each record starts with a numeric "1" as the first character on a new line (though not every line that starts with "1" is a new record) and terminates many lines later with a series of 20 spaces. While each field is fixed-width, each record is variable-length because it may or may not contain several optional fields. So I've been using "...20 spaces...\n1" as a record delimiter.
I've been trying to work with something like this to process 1kb at a time:
def read_in_chunks(file_object, chunk_size):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
file = open('test.txt')
for piece in read_in_chunks(file, chunk_size=1024):
# Do stuff
However, the problem I'm running into is when a single record spans multiple chunks. Am I overlooking an obvious design pattern? This problem would seem to be somewhat common. Thanks!

def recordsFromFile(inputFile):
record = ''
terminator = ' ' * 20
for line in inputFile:
if line.startswith('1') and record.endswith(terminator):
yield record
record = ''
record += line
yield record
inputFile = open('test.txt')
for record in recordsFromFile(inputFile):
# Do stuff
BTW, file is a built-in function. It's bad style to change its value.

Related

How to read only a section of a file Python

I have a log file that has what's known as a header section, and then the rest of it is a lot of data. The header section contains certain key value pairs that tells a db table information about said file.
One of my tasks is to parse out some of this header info. The other task is to go through the entire file and parse out counts of when certain strings occur. The later part I have a function for attched below:
with open(filename, 'rb') as f:
time_data_count = 0
while True:
memcap = f.read(102400)
# f.seek(-tdatlength, 1)
poffset_set = set(config_offset.keys())
# need logic to check if key value exists
time_data_count += memcap.count(b'TIME_DATA')
if len(memcap) <= 8:
break
if time_data_count > 20:
print("time_data complete")
else:
print("incomplete time_data data")
print(time_data_count)
The issue now with this is that it is not a line by line processing which would take a lot of time. I want to only get the first 50 lines of this log and then parse them. Then have the rest of the function go through the entire file without goign line by line and doing the counting parts.
Is it possible to extract the first 50 lines without going through the entire file?
The first 50 lines have header info of the form
ProdID: A785X
What I really need is to get the value of ProdID in that log file
You can read line-by-line for the first 50, by using a for loop or a list comprehension to just read the next line 50 times. This moves the read pointer down through the file, so when you call .read() or any other method, you'll not get anything you've already consumed. You can then process the rest as batch, or however else you need to:
with open(filename, 'rb') as f:
first_50_lines = [next(f) for _ in range(50)] # first 50 lines
remainder_of_file = f.read() # however much of the file remains
You can alternate various methods of reading the file, as long as the same file object (f in this case) is in play the entire time. Line-by-line, sized-chunk by chunk, or all at once (though .read() is always going to preclude further processing, on account of consuming the whole thing at once).

How to search through only the first column of a delimited text file using python

Search through the first column of a piped '|' delimited .txt file containing 10 million rows using python. The first column contains phone number. I would like to output the entire row for that phone number
The file is 5GB .txt file, I am unable to open it in either ms excel or ms access. So i want to write a python code that can search through the file and print out the entire row which matches a particular phone number. Phone number is in the first column. I wrote a code but it searches the entire file and is very slow. I just want to search the first column and my search item is the phone number.
f = open("F:/.../master.txt","rt") # open file master.txt
for line in f: # check each line in the file handle f
if '999995555' in line: # if a particular phone number is found
print(line) # print the entire row
f.close() # close file
I expect the entire row to be printed on screen where the first column contains the phone number i am searching. but it is taking a lot of time as I am unable to search for the column as I don t know the code.
Well you are on the correct track there. Since it is a 5GB file, you probably want to avoid allocating 5GB of RAM for this. You probably better of using .readline(), since it is design for exactly your scenario (a big file).
Something like the following should do the trick, note that .readline() will return '' for the end of the file and '\n' for empty lines. The .strip() call is merely to remove the '\n' that .readline() returns at the end of each line actually in the file.
def search_file_line_prefix(path, search_prefix):
with open(path, 'r') as file_handle:
while (True):
line = file_handle.readline()
if line == '':
break
if line.startswith(search_prefix):
yield line.strip()
for result in search_file_line_prefix('file_path', 'phone number'):
print(result)

Text file manipulation with Python

First off, I am very new to Python. When I started to do this it seemed very simple. However I am at a complete loss.
I want to take a text file with as many as 90k entries and put the data groups on a single line separated by a ';' My examples are below. Keep in mind that the groups of data vary in size. They could be two entries, or 100 entries.
Raw Data
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
Formatted Data
group1;data;
group2;data;data;data;
group3;data;data;data;data;data;data;data;data;data;data;data;data;
group4;data;data;
try something like the following. (untested...you can learn a bit of python by debugging!)
create python file "parser.py"
import sys
f = open('filename.txt', 'r')
for line in f:
txt = line.strip()
if txt == '':
sys.stdout.write('\n\n')
sys.stdout.flush()
sys.stdout.write( txt + ';')
sys.stdout.flush()
f.close()
and in a shell, type:
python parser.py > output.txt
and see if output.txt is what you want.
Assuming the groups are separated with an empty line, you can use the following one-liner:
>>> print "\n".join([item.replace('\n', ';') for item in open('file.txt').read().split('\n\n')])
group1;data
group2;data;data;data
group3;data;data;data;data;data;data;data;data;data;data;data;data
group4;data;data;
where file.txt contains
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
First the file content (open().read()) is split on empty lines split('\n\n') to produce a list of blocks, then, in each block [item ... for item in list], newlines are replaced with semi-colons, and finally all blocks are printed separated with a newline "\n".join(list)
Note that the above is not safe for production, that is code that you would write for interactive data transformation, not in production-level scripts.
What have you tried? Text file is for/from what? File manipulation is one of the last "basic" things I plan on learning. I'm saving it for when I understand the nuances of for loops, while loops, dictionaries, lists, appending, and a million other handy functions out there. That's after 2-3 months of research, coding and creating GUI's by the way.
Anyways here's some basic suggestions.
';'.join(group) will put a ";" in between each group, effectively creating one long (semi-colon delimited) string
group.replace("SPACE CHARACTER", ";") : This will replace any spaces or specified character (like a newline) within a group with a semi-colon.
There's a lot of other methods that include loading the txt file into a python script, .append() functions, putting the groups into lists, dictionaries, or matrix's, etc..
These are my bits to throw on the problem:
from collections import defaultdict
import codecs
import csv
res = defaultdict(list)
cgroup = ''
with codecs.open('tmp.txt',encoding='UTF-8') as f:
for line in f:
if line.startswith('group'):
cgroup = line.strip()
continue
res[cgroup].append(line.strip())
with codecs.open('out.txt','w',encoding='UTF-8') as f:
w = csv.writer(f, delimiter=';',quoting=csv.QUOTE_MINIMAL)
for k in res:
w.writerow([k,]+ res[k])
Let me explain a bit on the why I did things, as I did. First, I used the codecs module to open the data file explicitly with the codec, since data should always be treated right and not by just guessing what it might be. Then I used a defaultdict, which has a nice documentation online, cause its more pythonic, at least regarding to mr. hettinger. It is one of the patterns, that can be unlearned if you use python.
At least, I used a csv-writer to generate the output, cause writing CSV files is not as easy as one might think. And to be able to just meet the right criteria, or just to get the data into a correct csv format, it is better to use, what many eyes have seen, instead of reinventing the wheel.

Why can't I repeat the 'for' loop for csv.Reader?

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

How do quickly search through a .csv file in Python

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.
Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.
Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")
I'm using import csv.
(a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line)
Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?
If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.
Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.
import csv
import sqlite3
con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')
f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')
cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()
you can use memory mapping for really big files
import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)
Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).
from bisect import bisect_left
f = open('myfile.csv')
words = []
for line in f:
words.extend(line.strip().split(','))
wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
print '%s was found!' % wordtofind
It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].
Side Note
I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with \n characters that indicate new lines.
You can't go directly to a specific line in the file because lines are variable-length, so the only way to know when line #n starts is to search for the first n newlines. And it's not enough to just look for '\n' characters because CSV allows newlines in table cells, so you really do have to parse the file anyway.
my idea is to use python zodb module to store dictionaty type data and then create new csv file using that data structure. do all your operation at that time.
There is a fairly simple way to do this.Depending on how many columns you want python to print then you may need to add or remove some of the print lines.
import csv
search=input('Enter string to search: ')
stock=open ('FileName.csv', 'wb')
reader=csv.reader(FileName)
for row in reader:
for field in row:
if field==code:
print('Record found! \n')
print(row[0])
print(row[1])
print(row[2])
I hope this managed to help.

Categories