python - Issue in processing files with big size

python - Issue in processing files with big size - python

Basically, I wanted to create a Python script for my daily tasks wherein i wanted to compare two file with any size & wanted to generated 2 new files having matching records & non-matching records from both file.
I have written below python script & found it's working properly for file size having few records.
But when i am executing same script with files with 200,000 and 500,000 records then resulting file getting generated is not giving valid output.
So, can you check below script and help to identify issue in it causing wrong output...?
Thanks in advance.
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
Edit-1 : Pls note that above program executes without any error. Its just that output is incorrect and program takes much longer time to get over large file.

I'll take a wild guess and assume that "no valid output" means: "runs forever and does nothing useful".
Which would be logical because of your list comprehensions:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
They perform O(n) lookup, which is okay on a small number of lines but never ends if, say len(file1) == 100000 and so is file2. You then perform 100000*100000 iterations => 10**10 => forever.
Fix is simple: create sets and use intersection & difference, much faster.
file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)

Related

Split huge file into n files keeping first 7 columns + next 3 columns until column n

I have a huge data frame with columns names:
A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,...,GT_n,N_n,E_n
Using unix/bash or python, I want to produce n individual files with the following columns:
A,B,C,D,F,G,H,GT_a,N_a_,E_a
A,B,C,D,F,G,H,GT_b,N_b_,E_b
A,B,C,D,F,G,H,GT_c,N_c_,E_c
....
A,B,C,D,F,G,H,GT_n,N_n_,E_n
Each file should be called: a.txt, b.txt, c.txt,...,n.txt

Here are a couple of solutions with bash tools.
1. bash
Using cut inside a bash loop.This will raise n processes and parse the file n times.
Update for the case we don't have just a sequence of letters as _ids in column names, but many string ids, repeating the same every 3 lines after the first 7 lines. We have to first read the header of the file and extract them, e.g. a quick solution is to use awk and print them every 8th, 11th, etc column into the bash array.
#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )
for i in "${!ids[#]}"; do
cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done
Usage: bash test.sh file
2. awk
Or you can use awk. Here I customize just the number of outputs, but the others can also be done like in the first solution.
BEGIN { FS=OFS=","; times=14 }
{
for (i=1;i<=times;i++) {
print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
}
}
Usage: awk -f test.awk file.
This solution should be fast, as it parses the file once. But it shouldn't be used like this, for a large number of output files, it could throw a "too many files open" error. For the range of the letters, it should be ok.

This should write out the different files, with different headers for each file. You'll have to change the COL_NAMES_TO_WRITE to be the ones that you want.
It uses the standard library, so no pandas. It won't write out more than 26 different files.. but the filename generator could be changed to augment that and allow that.
If I'm interpreting this question correctly, you want to split this into 14 files (a..n)
You'll have to copy this code below into a file, splitter.py
And then run this command:
python3.8 splitter.py --fn largefile.txt -n 14
Where largefile.txt is your huge file that you need to split.
import argparse
import csv
import string
COL_NAMES_TO_WRITE = "A,B,C,D,F,G,H,GT_{letter},N_{letter},E_{letter}"
WRITTEN_HEADERS = set() # place to keep track of whether headers have been written
def output_file_generator(num):
if num > 26: raise ValueError(f"Can only print out 26 different files, not {num}")
i = 0
while True:
prefix = string.ascii_lowercase[i]
i = (i + 1) % num # increment modulo number of files we want
yield f"{prefix}.txt"
def col_name_generator(num):
i = 0
while True:
col_suffix = string.ascii_lowercase[i]
i = (i + 1) % num # increment modulo number of files we want
print( COL_NAMES_TO_WRITE.format(letter=col_suffix).split(','))
yield COL_NAMES_TO_WRITE.format(letter=col_suffix).split(',')
def main(filename, num_files=4):
"""Split a file into multiple files
Args:
filename (str): large filename that needs to be split into multiple files
num_files (int): number of files to split filename into
"""
print(filename)
with open(filename, 'r') as large_file_fp:
reader = csv.DictReader(large_file_fp)
output_files = output_file_generator(num_files)
col_names = col_name_generator(num_files)
for line in reader:
print(line)
filename_for_this_file = output_files.__next__()
print("filename ", filename_for_this_file)
column_names_for_this_file = col_names.__next__()
print("col names:", column_names_for_this_file)
with open(filename_for_this_file, 'a') as output_fp:
writer = csv.DictWriter(output_fp, fieldnames=column_names_for_this_file)
if filename_for_this_file not in WRITTEN_HEADERS:
writer.writeheader()
WRITTEN_HEADERS.add(filename_for_this_file)
just_these_fields = {k:v for k,v in line.items() if k in column_names_for_this_file}
writer.writerow(just_these_fields)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-fn", "--filename", required=True, default='large_file.txt', help="filename of large file to be split")
parser.add_argument("-n", "--num_files", required=False, default=4, help="number of separate files to split large_file into")
args = parser.parse_args()
main(args.filename, int(args.num_files))

import pandas as pd
import numpy as np
c = "A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,GT_d,N_d_,E_d,GT_e,N_e_,E_e".split(',')
df = pd.DataFrame(np.full((30, 22), c), columns=c)
c = None
c = list(df.columns)
default = c[:7]
var = np.matrix(c[7:])
var = pd.DataFrame(var.reshape(var.shape[1]//3, 3))
def dump(row):
cols = default + list(row)
magic = cols[-1][-1]
df[cols].to_csv(magic + '.txt')
var.apply(dump, axis=1)

Drawing multiple sequences from 1 file, based on shared fields in another file

I'm trying to run a python script to draw sequences from a separate file (merged.fas), in respect to a list (gene_fams_eggnog.txt) I have as output from another program.
The code is as follows:
from Bio import SeqIO
import os, sys, re
from collections import defaultdict
sequences = "merged.fas"
all_seqs = SeqIO.index(sequences, "fasta")
gene_fams = defaultdict(list)
gene_fams_file = open("gene_fams_eggnog.txt")
for line in gene_fams_file:
fields = re.split("\t", line.rstrip())
gene_fams[fields[0]].append[fields[1]]
for fam in gene_fams.keys():
output_filename = str(fam) + ".fasta"
outh = open(output_filename, "w")
for id in gene_fams[fam]:
if id in all_seqs:
outh.write(">" + all_seqs[id].description + "\n" + str(all_seqs[id].seq) + "\n")
else:
print "Uh oh! Sequence with ID " + str(id) + " is not in the all_seqs file!"
quit()
outh.close()
The list looks like this:
1 Saccharomycescerevisiae_DAA09367.1
1 bieneu_EED42827.1
1 Asp_XP_749186.1
1 Mag_XP_003717339.1
2 Mag_XP_003716586.1
2 Mag_XP_003709453.1
3 Asp_XP_749329.1
The field 0 denotes a grouping based by a similarity between the sequences. The script was meant to take all the sequences from merged.fas that correspond to the code in the field 1 and write them into a file base on field 0.
So in the case of the portion of the list I have shown, all the sequences that have a 1 in field 0 (Saccharomycescerevisiae_DAA09367.1, bieneu_EED42827.1, Asp_XP_749186.1, Mag_XP_003717339.1) would have been written into a file called 1.fasta. This should continue from 2.fasta-however many groups there are.
So this has worked, however it doesn't include all the sequences that are in the group, it'll only include the last one to be listed as a part of that group. Using my example above, I'd only have a file (1.fasta) with one sequence (Mag_XP_003717339.1), instead of all four.
Any and all help is appreciated,
Thanks,
JT

Although I didn't spot the cause of the issue you complained about, I'm surprised your code runs at all with this error:
gene_fams[fields[0]].append[fields[1]]
i.e. append[...] instead of append(...). But perhaps that's also, "not there in the actual script I'm running". I rewrote your script below, and it works fine for me. One issue was your use of the variable name id which is a Python builtin. You'll see I go to an extreme to avoid such errors:
from Bio import SeqIO
from collections import defaultdict
SEQUENCE_FILE_NAME = "merged.fas"
FAMILY_FILE_NAME = "gene_families_eggnog.txt"
all_sequences = SeqIO.index(SEQUENCE_FILE_NAME, "fasta")
gene_families = defaultdict(list)
with open(FAMILY_FILE_NAME) as gene_families_file:
for line in gene_families_file:
family_id, gene_id = line.rstrip().split()
gene_families[family_id].append(gene_id)
for family_id, gene_ids in gene_families.items():
output_filename = family_id + ".fasta"
with open(output_filename, "w") as output:
for gene_id in gene_ids:
assert gene_id in all_sequences, "Sequence {} is not in {}!".format(gene_id, SEQUENCE_FILE_NAME)
output.write(all_sequences[gene_id].format("fasta"))

Correct mistakes in a python program dealing with CSV

I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)

You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.

Use grep on file in Python

I have searched the grep answers on here and cannot find an answer. They all seem to search for a string in a file, not a list of strings from a file. I already have a search function that works, but grep does it WAY faster. I have a list of strings in a file sn.txt (with one string on each line, no deliminators). I want to search another file (Merge_EXP.exp) for lines that have a match and write it out to a new file. The file I am searching in has a half millions lines, so searching for a few thousand in there takes hours without grep.
When I run it from command prompt in windows, it does it in minutes:
grep --file=sn.txt Merge_EXP.exp > Merge_EXP_Out.exp
How can I call this same process from Python? I don't really want alternatives in Python because I already have one that works but takes a while. Unless you think you can significantly improve the performance of that:
def match_SN(serialnumb, Exp_Merge, output_exp):
fout = open(output_exp,'a')
f = open(Exp_Merge,'r')
# skip first line
f.readline()
for record in f:
record = record.strip().rstrip('\n')
if serialnumb in record:
fout.write (record + '\n')
f.close()
fout.close()
def main(Output_CSV, Exp_Merge, updated_exp):
# create a blank output
fout = open(updated_exp,'w')
# copy header records
f = open(Exp_Merge,'r')
header1 = f.readline()
fout.write(header1)
header2 = f.readline()
fout.write(header2)
fout.close()
f.close()
f_csv = open(Output_CSV,'r')
f_csv.readline()
for rec in f_csv:
rec_list = rec.split(",")
sn = rec_list[2]
sn = sn.strip().rstrip('\n')
match_SN(sn,Exp_Merge,updated_exp)

Here is a optimized version of pure python code:
def main(Output_CSV, Exp_Merge, updated_exp):
output_list = []
# copy header records
records = open(Exp_Merge,'r').readlines()
output_list = records[0:2]
serials = open(Output_CSV,'r').readlines()
serials = [x.split(",")[2].strip().rstrip('\n') for x in serials]
for s in serials:
items = [x for x in records if s in x]
output_list.extend(items)
open(updated_exp, "w").write("".join(output_list))
main("sn.txt", "merge_exp.exp", "outx.txt")
Input
sn.txt:
x,y,0011
x,y,0002
merge_exp.exp:
Header1
Header2
0011abc
0011bcd
5000n
5600m
6530j
0034k
2000lg
0002gg
Output
Header1
Header2
0011abc
0011bcd
0002gg
Try this out and see how much time it takes...

When I use full path to grep location it worked (I pass it the grep_loc, Serial_List, Export):
import os
Export_Dir = os.path.dirname(Export)
Export_Name = os.path.basename(Export)
Output = Export_Dir + "\Output_" + Export_Name
print "\nOutput: " + Output + "\n"
cmd = grep_loc + " --file=" + Serial_List + " " + Export + " > " + Output
print "grep usage: \n" + cmd + "\n"
os.system(cmd)
print "Output created\n"

I think you have not chosen the right title for your question: What you want to do is the equivalent of a database JOIN. You can use grep for that in this particular instance, because one of your files only has keys and no other information. However, I think it is likely (but of course I don't know your case) that in the future your sn.txt may also contain extra information.
So I would solve the generic case. There are multiple solutions:
import all data into a database, then do a LEFT JOIN (in sql) or equivalent
use a python large data tool
For the latter, you could try numpy or, recommended because you are working with strings, pandas. Pandas has an optimized merge routine, which is very fast in my experience (uses cython under the hood).
Here is pandas PSEUDO code to solve your problem. It is close to real code but I need to know the names of the columns that you want to match on. I assumed here the one column in sn.txt is called key, and the matching column in merge_txt is called sn. I also see you have two header lines in merge_exp, read the docs for that.
# PSEUDO CODE (but close)
import pandas
left = pandas.read_csv('sn.txt')
right = pandas.read_csv('merge_exp.exp')
out = pandas.merge(left, right, left_on="key", right_on="sn", how='left')
out.to_csv("outx.txt")

Python: How to extract floating point numbers from a text file with mixed content?

I have a tab delimited text file with the following data:
ahi1
b/se
ahi
test -2.435953
1.218364
ahi2
b/se
ahi
test -2.001858
1.303935
I want to extract the two floating point numbers to a separate csv file with two columns, ie.
-2.435953 1.218264
-2.001858 1.303935
Currently my hack attempt is:
import csv
from itertools import islice
results = csv.reader(open('test', 'r'), delimiter="\n")
list(islice(results,3))
print results.next()
print results.next()
list(islice(results,3))
print results.next()
print results.next()
Which is not ideal. I am a Noob to Python so I apologise in advance and thank you for your time.

Here is the code to do the job:
import re
# this is the same data just copy/pasted from your question
data = """ ahi1
b/se
ahi
test -2.435953
1.218364
ahi2
b/se
ahi
test -2.001858
1.303935"""
# what we're gonna do, is search through it line-by-line
# and parse out the numbers, using regular expressions
# what this basically does is, look for any number of characters
# that aren't digits or '-' [^-\d] ^ means NOT
# then look for 0 or 1 dashes ('-') followed by one or more decimals
# and a dot and decimals again: [\-]{0,1}\d+\.\d+
# and then the same as first..
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
results = []
for line in data.split("\n"):
match = pattern.match(line)
if match:
results.append(match.groups()[0])
pairs = []
i = 0
end = len(results)
while i < end - 1:
pairs.append((results[i], results[i+1]))
i += 2
for p in pairs:
print "%s, %s" % (p[0], p[1])
The output:
>>>
-2.435953, 1.218364
-2.001858, 1.303935
Instead of printing out the numbers, you could save them in a list and zip them together afterwards..
I'm using the python regular expression framework to parse the text. I can only recommend you pick up regular expressions if you don't already know it. I find it very useful to parse through text and all sorts of machine generated output-files.
EDIT:
Oh and BTW, if you're worried about the performance, I tested on my slow old 2ghz IBM T60 laptop and I can parse a megabyte in about 200ms using the regex.
UPDATE:
I felt kind, so I did the last step for you :P

Maybe this can help
zip(*[results]*5)
eg
import csv
from itertools import izip
results = csv.reader(open('test', 'r'), delimiter="\t")
for result1, result2 in (x[3:5] for x in izip(*[results]*5)):
... # do something with the result

Tricky enough but more eloquent and sequential solution:
$ grep -v "ahi" myFileName | grep -v se | tr -d "test\" " | awk 'NR%2{printf $0", ";next;}1'
-2.435953, 1.218364
-2.001858, 1.303935
How it works: Basically remove specific text lines, then remove unwanted text in lines, then join every second line with formatting. I just added the comma for beautification purposes. Leave the comma out of awks printf ", " if you don't need it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - Issue in processing files with big size - python

Related

Split huge file into n files keeping first 7 columns + next 3 columns until column n

Drawing multiple sequences from 1 file, based on shared fields in another file

Correct mistakes in a python program dealing with CSV

Use grep on file in Python

Python: How to extract floating point numbers from a text file with mixed content?

Categories

Resources