Parse single CSV file with multiple dimension tables

Parse single CSV file with multiple dimension tables - python

I am running Python 3.4 on win 7 64 bits.
Basically I have a CSV file which looks like this:
#Begin
#Column1Title;Column2Title;Column3Title;Column4Title
value1;value2;value3;value4
#End
#Begin
#Column1Title;Column2Title;Column3Title;Column4Title;Column5Title;Column6Title
value1;value2;value3;value4;value5;value6
value1;value2;value3;value4;value5;value6
#End
....
A single CSV file contains several tables (with different number of columns) delimited by a #begin and #end tags. Each table has a header (column title) and it has nothing to do with other tables of the file, the file has almost 14 000 lines.
I'd like to only determine the position of the #Begin and #end tags in order to efficiently extract the data within those tags, I'd like to avoid reading the file line by line unless someone indicates me otherwise.
I tried to get around Pandas, installed the 0.15.2 version. So far, I have not been able to produce anything close to what I want with it.
Since the file is long and the next step will be to parse multiple file like this at the same time, I am looking for the most efficient way in terms of time of execution.

Computational efficiency is likely less important in most cases than storage access speed. Therefore the best performance speedup will be to only read the file once (not find each #begin first then iterate again to process data).
Looping line by line is not in fact terribly inefficient, so an effective approach could be to check each line for a #begin tag, then enter a data processing loop for each array that exits when it finds a #end tag

Related

How can I efficiently open 30gb of file and process pieces of it without slowing down?

I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each slice.
So I have a dictionary with keys as beginning line numbers and values as count of following rows, and I use this dictionary to loop through the file and get slices over it. for each slice, I create a table, make some conversions and averaging, create a new table an convert it into a dictionary. I use islice for slicing and pandas dataframe to create tables from each slice.
however, in time process is getting slower and slower, even the size of the slices are more or less the same.
First 1k slices - processed in 1h
Second 1k slices - processed in 4h
Third 1k slices - processed in 8h
Second 1k slices - processed in 17h
And I am waiting for days to complete the processes.
Right now I am doing this on a windows 10 machine, 1tb SSD, 32 GB ram. Previously I also tried on a Linux machine (ubuntu 18.4) with 250gb SSD and 8gb ram + 8gb virtual ram. Both resulted more or less the same.
What I noticed in windows is, 17% of CPU and 11% of memory is being used, but disk usage is 100%. I do not fully know what disk usage means and how I can improve it.
As a part of the code I was also importing data into mongodb while working on Linux, and I thought maybe it was because of indexing in mongodb. but when I print the processing time and import time I noticed that almost all time is spent on processing, import takes few seconds.
Also to gain time, I am now doing the processing part on a stronger windows machine and writing the docs as txt files. I expect that writing on disk slows down the process a bit but txt file sizes are not more than 600kb.
Below is the piece of code, how I read the file:
with open(infile) as inp:
for i in range(0,len(seg_ids)):
inp.seek(0)
segment_slice = islice(inp,list(seg_ids.keys())[i], (list(seg_ids.keys())[i]+list(seg_ids.values())[i]+1))
segment = list(segment_slice)
for _, line in enumerate(segment[1:]):
#create dataframe and perform calculations
So I want to learn if there is a way to improve the processing time. I suppose my code reads whole file from beginning for each slice, and going through the end of the file reading time goes longer and longer.
As a note, because of the time constraints, I started with the most important slices I have to process first. So the rest will be more random slices on the files. So solution should be applicable for random slices, if there are any (I hope).
I am not experienced in scripting so please forgive me if I am asking a silly question, but I really could not find any answer.

A couple of things come to mind.
First, if you bring the data into a pandas DataFrame there is a 'chunksize' argument for importing large data. It allows you to process / dump what you need / don't, while proving information such as df.describe which will give you summary stats.
Also, I hear great things about dask. It is a scalable platform via parallel, multi-core, multi-machine processing and is almost as simple as using pandas and numpy with very little management of resources required.

Use pandas or dask and pay attention to the options for read_csv(). mainly: chunck_size, nrows, skiprows, usecols, engine (use C), low_memory, memory_map
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html][1]

The problem here is you are rereading a HUGE unindexed file multiple times line by line from the start of the file. No wonder this takes HOURS.
Each islice in your code is starting from the very beginning of the file -- every time -- and reads and discards all the lines from the file before it even reaches the start of the data of interest. That is very slow and inefficient.
The solution is to create a poor man's index to that file and then read smaller chucks for each slice.
Let's create a test file:
from pathlib import Path
p=Path('/tmp/file')
with open(p, 'w') as f:
for i in range(1024*1024*500):
f.write(f'{i+1}\n')
print(f'{p} is {p.stat().st_size/(1024**3):.2f} GB')
That creates a file of roughly 4.78 GB. Not as large as 30 GB but big enough to be SLOW if you are not thoughtful.
Now try reading that entire file line-by-line with the Unix utility wc to count the total lines (wc being the fastest way to count lines, in general):
$ time wc -l /tmp/file
524288000 file
real 0m3.088s
user 0m2.452s
sys 0m0.636s
Compare that with the speed for Python3 to read the file line by line and print the total:
$ time python3 -c 'with open("file","r") as f: print(sum(1 for l in f))'
524288000
real 0m53.880s
user 0m52.849s
sys 0m0.940s
Python is almost 18x slower to read the file line by line than wc.
Now do a further comparison. Check out the speed of the Unix utility tail to print the last n lines of a file:
$ time tail -n 3 file
524287998
524287999
524288000
real 0m0.007s
user 0m0.003s
sys 0m0.004s
The tail utility is 445x faster than wc (and roughly 8,000x faster than Python) to get to the last three lines of the file because it is using a windowed index buffer. ie, tail reads some number of bytes at the end of the file and then gets the last n lines from the buffer it read.
It is possible to use the same tail approach to your application.
Consider this photo:
The approach you are using the equivalent to reading every tape on that rack to find the data that is on two of the middle tapes only -- and doing it over and over and over...
In the 1950's (the era of the photo) each tape was roughly indexed for what it held. Computers would call for a specific tape in the rack -- not ALL the tapes in the rack.
The solution to your issue (in oversight) is to build a tape-like indexing scheme:
Run through the 30 GB file ONCE and create an index of sub-blocks by starting line number for the block. Think of each sub-block as roughly one tape (except it runs easily all the way to the end...)
Instead of using int.seek(0) before every read, you would seek to the block that contains the line number of interest (just like tail does) and then use islice offset adjusted to where that block's starting line number is in relationship to the start of the file.
You have a MASSIVE advantage compared with what they had to do in the 50's and 60's: You only need to calculate the starting block since you have access to the entire remaining file. 1950's tape index would call for tapes x,y,z,... to read data larger than one tape could hold. You only need to find x which contains the starting line number of interest.
BTW, since each IBM tape of this type held roughly 3 MB, your 30 GB file would be more than 10 million of these tapes...
Correctly implemented (and this is not terribly hard to do) it would speed up the read performance by 100x or more.
Constructing a useful index to a text file by line offset might be as easy as something like this:
def index_file(p, delimiter=b'\n', block_size=1024*1024):
index={0:0}
total_lines, cnt=(0,0)
with open(p, 'rb') as f:
while buf:=f.raw.read(block_size):
cnt=buf.count(delimiter)
idx=buf.rfind(delimiter)
key=cnt+total_lines
index[key]=f.tell()-(len(buf)-idx)+len(delimiter)
total_lines+=cnt
return index
# this index is created in about 4.9 seconds on my computer...
# with a 4.8 GB file, there are ~4,800 index entries
That constructs an index correlating starting line number (in that block) with byte offset from the beginning of the file:
>>> idx=index_file(p)
>>> idx
{0: 0, 165668: 1048571, 315465: 2097150, 465261: 3145722,
...
524179347: 5130682368, 524284204: 5131730938, 524288000: 5131768898}
Then if you want access to lines[524179347:524179500] you don't need to read 4.5 GB to get there; you can just do f.seek(5130682368) and start reading instantly.

Summarizing huge amounts of data

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.

There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.

I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6

I am attempting to convert a very large json file to csv. I have been able to convert a small file of this type to a 10 record (for example) csv file. However, when trying to convert a large file (on the order of 50000 rows in the csv file) it does not work. The data was created by a curl command with the -o pointing to the json file to be created. The file that is output does not have newline characters in it. The csv file will be written with csv.DictWriter() and (where data is the json file input) has the form
rowcount = len(data['MainKey'])
colcount = len(data['MainKey'][0]['Fields'])
I then loop through the range of the rows and columns to get the csv dictionary entries
csvkey = data['MainKey'][recno]['Fields'][colno]['name']
cvsval = data['MainKey'][recno][['Fields'][colno]['Values']['value']
I attempted to use the answers from other questions, but they did not work with a big file (du -m bigfile.json = 157) and the files that I want to handle are even larger.
An attempt to get the size of each line shows
myfile = open('file.json','r').
line = readline():
print len(line)
shows that this reads the entire file as a full string. Thus, one small file will show a length of 67744, while a larger file will show 163815116.
An attempt to read the data directly from
data=json.load(infile)
gives the error that other questions have discussed for the large files
An attempt to use the
def json_parse(self, fileobj, decoder=JSONDecoder(), buffersize=2048):
yield results
as shown in another answer, works with a 72 kb file (10 rows, 22 columns) but seems to either lock up or take an interminable amount of time for an intermediate sized file of 157 mb (from du -m bigfile.json)
Note that a debug print shows that each chunk is 2048 in size as specified by the default input argument. It appears that it is trying to go through the entire 163815116 (shown from the len above) in 2048 chunks. If I change the chunk size to 32768, simple math shows that it would take 5,000 cycles through the loop to process the file.
A change to a chunk size of 524288 exits the loop approximately every 11 chunks but should still take approximately 312 chunks to process the entire file
If I can get it to stop at the end of each row item, I would be able to process that row and send it to the csv file based on the form shown below.
vi on the small file shows that it is of the form
{"MainKey":[{"Fields":[{"Value": {'value':val}, 'name':'valname'}, {'Value': {'value':val}, 'name':'valname'}}], (other keys)},{'Fields' ... }] (other keys on MainKey level) }
I cannot use ijson as I must set this up for systems that I cannot import additional software for.

I wound up using a chunk size of 8388608 (0x800000 hex) in order to process the files. I then processed the lines that had been read in as part of the loop, keeping count of rows processed and rows discarded. At each process function, I added the number to the totals so that I could keep track of total records processed.
This appears to be the way that it needs to go.
Next time a question like this is asked, please emphasize that a large chunk size must be specified and not the 2048 as shown in the original answer.
The loop goes
first = True
for data in self.json_parse(inf):
records = len(data['MainKey'])
columns = len(data['MainKey'][0]['Fields'])
if first:
# Initialize output as DictWriter
ofile, outf, fields = self.init_csv(csvname, data, records, columns)
first = False
reccount, errcount = self.parse_records(outf, data, fields, records)
Within the parsing routine
for rec in range(records):
currec = data['MainKey'][rec]
# If each column count can be different
columns = len(currec['Fields'])
retval, valrec = self.build_csv_row(currec, columns, fields)
To parse the columns use
for col in columns:
dataname = currec['Fields'][col]['name']
dataval = currec['Fields'][col]['Values']['value']
Thus the references now work and the processing is handled correctly. The large chunk apparently allows the processing to be fast enough to handle the data while being small enough not to overload the system.

Reading lines in text files using python

I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?

If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()

These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.

What is the best way to do a find and replace of multiple queries on multiple files?

I have a file that has over 200 lines in this format:
name old_id new_id
The name is useless for what I'm trying to do currently, but I still want it there because it may become useful for debugging later.
Now I need to go through every file in a folder and find all the instances of old_id and replace them with new_id. The files I'm scanning are code files that could be thousands of lines long. I need to scan every file with each of the 200+ ids that I have, because some may be used in more than one file, and multiple times per file.
What is the best way to go about doing this? So far I've been creating python scripts to figure out the list of old ids and new ids and which ones match up with each other, but I've been doing it very inefficient because I basically scanned the first file line by line and got the current id of the current line, then I would scan the second file line by line until I found a match. Then I did this over again for each line in the first file, which ended up with my reading the second file a lot. I didn't mind doing this inefficiently because they were small files.
Now that I'm searching probably somewhere around 30-50 files that can have thousands of line of code in it, I want it to be a little more efficient. This is just a hobbyist project, so it doesn't need to be super good, I just don't want it to take more than 5 minutes to find and replace everything, then look at the result and see that I made a little mistake and need to do it all over again. Taking a few minutes is fine(although I'm sure with computers nowadays they can do this almost instantly still) but I just don't want it to be ridiculous.
So what's the best way to go about doing this? So far I've been using python but it doesn't need to be a python script. I don't care about elegance in the code or way I do it or anything, I just want an easy way to replace all of my old ids with my new ids using whatever tool is easiest to use or implement.
Examples:
Here is a line from the list of ids. The first part is the name and can be ignored, the second part is the old id, and the third part is the new id that needs to replace the old id.
unlock_music_play_grid_thumb_01 0x108043c 0x10804f0
Here is an example line in one of the files to be replaced:
const v1, 0x108043c
I need to be able to replace that id with the new id so it looks like this:
const v1, 0x10804f0

Use something like multiwordReplace (I've edited it for your situation) with mmap.
import os
import os.path
import re
from mmap import mmap
from contextlib import closing
id_filename = 'path/to/id/file'
directory_name = 'directory/to/replace/in'
# read the ids into a dictionary mapping old to new
with open(id_filename) as id_file:
ids = dict(line.split()[1:] for line in id_file)
# compile a regex to do the replacement
id_regex = re.compile('|'.join(map(re.escape, ids)))
def translate(match):
return ids[match.group(0)]
def multiwordReplace(text):
return id_regex.sub(translate, text)
for code_filename in os.listdir(directory_name):
with open(os.path.join(directory, code_filename), 'r+') as code_file:
with closing(mmap(code_file.fileno(), 0)) as code_map:
new_file = multiword_replace(code_map)
with open(os.path.join(directory, code_filename), 'w') as code_file:
code_file.write(new_file)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.