Python cross-file search with regexp

Python cross-file search with regexp - python

I have 2 files, and I want to get all lines from file2(fsearch) that contain any given line from file1(forig)
I wrote a simple python script that looks like this:
def search_string(w, file):
global matches
reg = re.compile((r'(^|^.*\|)' + w.strip("\r\n") + r'(\t|\|).*$'), re.M)
match = reg.findall(file)
matches.extend(match)
fsearch_text = fsearch.read()
for fword in forig:
search_string(fword, fsearch_text)
There are about 100,000 lines in file1, and about 200,000 lines in file2, so my script takes about 6 hours to complete.
Is there a better algorithm to achieve the same goal in less time?
Edit:
I should have provided example for why I need regexp:
I am searching a list of words in file1 and trying to match them to translations from file2. If I do not use regexp to limit possible matches, I also match translations for words that only contain the word I search as part of itself, example:
Word I search: 浸し
Matched word: お浸し|御浸し|御したし &n boiled greens in bonito-flavoured soy sauce (vegetable side dish)
So I have to limit start of match by either ^ or |, and end of match by \t or |, but capture the whole line

Assuming you can have both files in memory. You can read them and sort them.
After that, you can compare linearly the lines.
f1 = open('e:\\temp\\file1.txt')
lines1 = sorted([line for line in f1])
f2 = open('e:\\temp\\file2.txt')
lines2 = sorted([line for line in f2])
i1 = 0
i2 = 0
matchCount = 0
while (i1 < len(lines1) and i2 < len(lines2)):
line1 = lines1[i1]
line2 = lines2[i2]
if line1 < line2:
i1 += 1
elif line1 > line2:
i2 += 1
else:
matchCount += 1
i2 += 1
print('matchCount')
print(matchCount)

If it is possible for you to use UNIX/GNU/Linux commands you could do this:
# fill example files
for i in {1..100000}; do echo $RANDOM; done > file1.txt
for i in {1..200000}; do echo $RANDOM; done > file2.txt
# get every line of file2.txt which is also in file1.txt
# for static string matching:
grep -F -x -f file1.txt file2.txt
# for regex matching use (regular expressions in file1.txt):
grep -f file1.txt file2.txt
grep is optimized for such operations so the above call takes less than a second (have a look at this).

Related

How to completely erase the duplicated lines by linux tools?

This question is not equal to How to print only the unique lines in BASH? because that ones suggests to remove all copies of the duplicated lines, while this one is about eliminating their duplicates only, i..e, change 1, 2, 3, 3 into 1, 2, 3 instead of just 1, 2.
This question is really hard to write because I cannot see anything to give meaning to it. But the example is clearly straight. If I have a file like this:
1
2
2
3
4
After to parse the file erasing the duplicated lines, becoming it like this:
1
3
4
I know python or some of it, this is a python script I wrote to perform it. Create a file called clean_duplicates.py and run it as:
import sys
#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():
lines = sys.stdin.readlines()
# print( lines )
clean_duplicates( lines )
#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):
lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )
# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:
sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()
# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:
sys.stdout.write( lines[ 0 ] )
# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):
currentLine = lines[ index ]
nextLine = lines[ index + 1 ]
if currentLine == lastLine:
continue
lastLine = lines[ index ]
if currentLine == nextLine:
continue
sys.stdout.write( currentLine )
# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:
sys.stdout.write( lines[ linesCount - 1 ] )
if __name__ == "__main__":
main()
Although, while searching for duplicates lines remove seems to be easier to use tools as grep, sort, sed, uniq:
How to remove duplicate lines inside a text file?
removing line from list using sort, grep LINUX
Find duplicate lines in a file and count how many time each line was duplicated?
Remove duplicate entries in a Bash script
How to delete duplicate lines in a file without sorting it in Unix?
How to delete duplicate lines in a file...AWK, SED, UNIQ not working on my file

You may use uniq with -u/--unique option. As per the uniq man page:
-u / --unique
Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.
For example:
cat /tmp/uniques.txt | uniq -u
OR, as mentioned in UUOC: Useless use of cat, better way will be to do it like:
uniq -u /tmp/uniques.txt
Both of these commands will return me value:
1
3
4
where /tmp/uniques.txt holds the number as mentioned in the question, i.e.
1
2
2
3
4
Note: uniq requires the content of file to be sorted. As mentioned in doc:
By default, uniq prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.
In case file is not sorted, you need to sort the content first
and then use uniq over the sorted content:
sort /tmp/uniques.txt | uniq -u

No sorting required and output order will be the same as input order:
$ awk 'NR==FNR{c[$0]++;next} c[$0]==1' file file
1
3
4

Europe Finland Office Supplies Online H 5/21/2015 193508565 7/3/2015 2339 651.21 524.96 1523180.19 1227881.44 295298.75
Europe Greece Household Online L 9/11/2015 895509612 9/26/2015 49 668.27 502.54 32745.23 24624.46 8120.77
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
If you have this kind of lines you can use this command.
[isuru#192 ~]$ sort duplines.txt | sed 's/\ /\-/g' | uniq | sed 's/\-/\ /g'
But keep in mind when using special characters. If there dashes in your lines makes sure to use different symbol. Here i keep a space between back & forward slash.
Before applied the code
After applied the code

Kindly use sort command with -u argument for listing unique values of any command's output.
cat file_name |sort -u
1
2
3
4

Identify duplicates in rows of two different CSVs

I need to identify duplicates in column A of CSV1 with Column A of CSV2. If there is a dupe first name identified the entire row from CSV2 needs to get copied to a new CSV3. Can somebody help in python?
CSV1
Adam
Eve
John
George
CSV2
Steve
Mark
Adam Smith
John Smith
CSV3
Adam Smith
John Smith

Here is a quick answer. It's O(n^2) with n the number of lines in your csv, and assumes two equal length CSVs. If you need an O(n) solution (clearly optimal), then let me know. The trick there would be building a set of the elements of column A of csv1.
lines1 = open('csv1.txt').read().split('\n')
delim = ', '
fields1 = [line.split(delim) for line in lines1]
lines2 = open('csv2.txt').read().split('\n')
fields2 = [line.split(delim) for line in lines2]
duplicates = []
for line1 in fields1:
for line2 in fields2:
if line1[0] == line2[0]:
duplicates.append(line2)
print duplicates

Using any of the 3 one-liners:
Option 1: Parse file1 in the BEGIN Block
perl -lane 'BEGIN {$csv2 = pop; $seen{(split)[0]}++ while <>; #ARGV = $csv2 } print if $seen{$F[0]}' csv1 csv2
Option 2: Using a ternary
perl -lane 'BEGIN {($csv1) = #ARGV } $ARGV eq $csv1 ? $seen{$F[0]}++ : ($seen{$F[0]} && print)' csv1 csv2
Option 3: Using a single if
perl -lane 'BEGIN {($csv1) = #ARGV } print if $seen{$F[0]} += $ARGV eq $csv1 and $ARGV ne $csv1' csv1 csv2
Explanation:
Switches:
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Clean and python way to resolve your problem
words_a = set([])
words_b = set([])
with open('csv1') as a:
words_a = set([w.strip()
for l in a.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv2') as b:
words_b = set([ w.strip()
for l in b.readlines()
for w in l.split(" ")
if w.strip()])
with open('csv3','w') as wf:
for w in words_a.intersection(words_b):
wf.write(w)
wf.write('\n')

How to find a word matching the first string in a file consisting of words that are space separated in python?

Let me explain my question better!
I have an input file that is of this format
word1 word2
word3 word4 word5
word4 word6
Given word3, I would like to be able to get the entire row and obtain word4 and word5.
Opening the file, parsing for each line is possible, But my file size is huge and it takes a very long time.
Is there a cost-efficient way in which this can be done?
Any help appreciated!

unless the data are ordered in some predictable way (eg sorted) then you have to read every line to find the relevant one.
with open('/path/file.txt') as input:
for line in input:
words = line.split()
if words and words[0] == 'trigger':
print words[1:]
break # delete this line if you may have multiple matches
the above doesn't read the entire file into memory at once (if it is large) - it processes the lines "one by one" (they will be read in buffer sized chunks).
one possible improvement would be if all lines were the same size and very long. then you could read the start of each line. but they would have to be very long for that to be useful.
if you're on unix then you might find it's quicker to execute a grep comand in a subprocess. but that is still going to scan the entire file (albeit more quickly, in optimized c code).

I don't think using readlines() is really a problem with memory or time. Here is the short example that I have used with a file having 4000 lines each having a minimum of 600 letters in it.
import MyUtils as utils
LOGDIR = '/opt/lsf_events/7.0.6/work/blr_ifx/logdir/lsb.acct.1'
utils.Timer.start()
with open(LOGDIR,'r') as fHeader:
for line in fHeader.readlines():
if '1381671028' in line: #that particular number exists in the last line of the file.
print line
utils.Timer.end()
The Output is...
Started Recording Time for the process...
"JOB_FINISH" "7.06" 1381671036 51303 22965 503578626 1 1381671028 0 0 1381671028 "umashank" "batch" "select[ ((type==X64LIN && osrel==50 && clearcase))]" "" "" "blrlc275" "/home/padbgl/spt9_m5p120_5v0_cm112/nodm/default/units/top/simulation/titan/FE/TPL_100_tx_top_new_ls" "" "" "" "1381671028.51303" 0 1 "blrlc275" 64 225.0 "" "/home/padbgl/bin/prjgate -e -- /home/umashank/.lsbatch/blrlc275.21758.0.1381671027.TITAN" 1.037842 0.119981 10116 0 -1 0 0 21997 0 0 0 0 -1 0 0 0 3735 82 -1 "" "padbgl_spt9_m5p120_5v0_cm112" 0 1 "" "" 0 3068 44332 "" "" "" "" 0 "" 0 "" -1 "/umashank" "" "" "" -1 "" "" 5136 "" 1381671028 "" "" 0
Process ended at : 15-10-13 08:02:56
Total time taken by the process is : 0:00:00.011601
Hope you can comfortably use readlines() as it took very less time and is almost instant for a file of memory 3mb.
This is not an alternative for what you asked for, but just trying to tell you that there wont be any damage if you use the typical traditional procedure in reading a file.

Python's linecache module is the fastest way I know to look up a given line number from a file. You want a line matching the first word in that line, but maybe we can use linecache to get there. So let's create a mapping from words to line numbers:
from linecache import getline, getlines
from collections import defaultdict
first_words = defaultdict(int)
first_words.update(
(line.split()[0], number)
for number, line in enumerate(getlines(filename), 1)
if line
)
From here, to get a line, just do:
>>> getline(filename, first_words['word3'])
'word3 word4 word5\n'
>>> getline(filename, first_words['word4'])
'word4 word6\n'
If you try to get a word that wasn't the first word in a line, you'll just get the empty string.
>>> getline(filename, first_words['word6'])
''
Now, I suppose it's possible you could have the same word beginning some lines, and in that case you might want to get more than one line back. So here's a modified version that accounts for that case:
from linecache import getline, getlines
from collections import defaultdict
from operator import itemgetter
first_words = defaultdict(list)
for number, line in enumerate(getlines(filename), 1):
if line:
first_words[line.split(0)].append(number)
Now to get the lines:
itemgetter(*first_words['word3'])(getlines(filename))

Multi-line Matching in Python

I've read all of the articles I could find, even understood a few of them but as a Python newb I'm still a little lost and hoping for help :)
I'm working on a script to parse items of interest out of an application specific log file, each line begins with a time stamp which I can match and I can define two things to identify what I want to capture, some partial content and a string that will be the termination of what I want to extract.
My issue is multi-line, in most cases every log line is terminated with a newline but some entries contain SQL that may have new lines within it and therefore creates new lines in the log.
So, in a simple case I may have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)
This all appears as one line which I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However in some cases there may be line breaks in the SQL, as such I want to still capture it (and potentially replace the line breaks with spaces). I am currently reading the file a line at a time which obviously isn't going to work so...
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
My overall goal is to parameterize this so I can use it for extracting log entries that match different patterns of the starting string (always the start of a line), the ending string (where I want to capture to) and a value that is between them as an identifier.
Thanks in advance for any help!
Chris.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line):
if lineEndsWith.match(line) :
print 'Full Line Found'
print line
print "- Record Separator -"
else:
print 'Partial Line Found'
print line
print "- Record Separator -"
print "--- DONE ----"
Next step, for my partial line I'll continue reading until I find lineEndsWith and assemble the lines in to one block.
I'm no expert so suggestions are always welcome!
UPDATE - So I have it working, thanks to all the responses that helped direct things, I realize it isn't pretty and I need to clean up my if / elif mess and make it more efficient but IT's WORKING! Thanks for all the help.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
multiLine = False
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
lines.append(line.replace("\n", " "))
elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
#Found the start of a multi-line entry
multiLineString = line
multiLine = True
elif multiLine and not lineEndsWith.match(line):
multiLineString = multiLineString + line
elif multiLine and lineEndsWith.match(line):
multiLineString = multiLineString + line
multiLineString = multiLineString.replace("\n", " ")
lines.append(multiLineString)
multiLine = False
for line in lines:
print line

Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
There are two options here.
You could read the file block by block, making sure to attach any "leftover" bit at the end of each block to the start of the next one, and search each block. Of course you will have to figure out what counts as "leftover" by looking at what your data format is and what your regex can match, and in theory it's possible for multiple blocks to all count as leftover…
Or you could just mmap the file. An mmap acts like a bytes (or like a str in Python 2.x), and leaves it up to the OS to handle paging blocks in and out as necessary. Unless you're trying to deal with absolutely huge files (gigabytes in 32-bit, even more in 64-bit), this is trivial and efficient:
with open('bigfile', 'rb') as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
for match in compiled_re.finditer(m):
do_stuff(match)
In older versions of Python, mmap isn't a context manager, so you'll need to wrap contextlib.closing around it (or just use an explicit close if you prefer).
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
You could use the DOTALL flag, which makes the . match newlines. You could instead use the MULTILINE flag and put appropriate $ and/or ^ characters in, but that makes simple cases a lot harder, and it's rarely necessary. Here's an example with DOTALL (using a simpler regexp to make it more obvious):
>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and
(exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']
As you can see the second .*? matched the newline just as easily as a space.
If you're just trying to treat a newline as whitespace, you don't need either; '\s' already catches newlines.
For example:
>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']

You can read an entire file into a string and then you can use re.split to make a list of all the entries separated by times. Here's an example:
f = open(...)
allLines = ''.join(f.readlines())
entries = re.split(regex, allLines)

Text processing with two files

I have two text files in the following format:
The first is this on every line:
Key1:Value1
The second is this:
Key2:Value2
Is there a way I can replace Value1 in file1 by the Value2 obtained from using it as a key in file2?
For example:
file1:
foo:hello
bar:world
file2:
hello:adam
bar:eve
I would like to get:
foo:adam
bar:eve
There isn't necessarily a match between the two files on every line. Can this be done neatly in awk or something, or should I do it naively in Python?

Create two dictionaries, one for each file. For example:
file1 = {}
for line in open('file1', 'r'):
k, v = line.strip().split(':')
file1[k] = v
Or if you prefer a one-liner:
file1 = dict(l.strip().split(':') for l in open('file1', 'r'))
Then you could do something like:
result = {}
for key, value in file1.iteritems():
if value in file2:
result[key] = file2[value]
Another way is you could generate the key-value pairs in reverse for file1 and use sets. For example, if your file1 contains foo:bar, your file1 dict is {bar: foo}.
for key in set(file1) & set(file2):
result[file1[key]] = file2[key]
Basically, you can quickly find common elements using set intersection, so those elements are guaranteed to be in file2 and you don't waste time checking for their existence.
Edit: As pointed out by #pepr You can use collections.OrderedDict for the first method if order is important to you.

The awk solution:
awk '
BEGIN {FS = OFS = ":"}
NR==FNR {val[$1] = $2; next}
$1 in val {$2 = val[$1]}
{print}
}' file2 file1

join -t : -1 2 -2 1 -o 0 2.2 -a 2 <(sort -k 2 -t : file1) <(sort file2)
The input files must be sorted on the field they are joined on.
The options:
-t : - Use a colon as the delimiter
-1 2 - Join on field 2 of file 1
-2 1 - Join on field 1 of file 2
-o 0 2.2 - Output the join field followed by field 2 from file2 (separated by the delimiter character)
-a 2 - Output unjoined lines from file2

Once you have:
file1 = {'foo':'hello', 'bar':'world'}
file2 = {'hello':'adam', 'bar':'eve'}
You can do an ugly one liner:
print dict([(i,file2[i]) if i in file2 else (i,file2[j]) if j in file2 else (i,j) for i,j in file1.items()])
{'foo': 'adam', 'bar': 'eve'}
As in your example you are using both the keys and values of file1 as keys in file2.

This might work for you (probably GNU sed):
sed 's#\([^:]*\):\(.*\)#/\\(^\1:\\|:\1$\\)/s/:.*/:\2/#' file2 | sed -f - file1

If you do not consider using basic Unix/Linux commands cheating, then here is a solution using paste and awk.
paste file1.txt file2.txt | awk -F ":" '{ print $1":"$3 }'

TXR:
#(next "file2")
#(collect)
#key:#value1
# (cases)
# (next "file1")
# (skip)
#value2:#key
# (or)
# (bind value2 key)
# (end)
# (output)
#value2:#value1
# (end)
#(end)
Run:
$ txr subst.txr
foo:adam
bar:eve

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python cross-file search with regexp - python

Related

How to completely erase the duplicated lines by linux tools?

Identify duplicates in rows of two different CSVs

How to find a word matching the first string in a file consisting of words that are space separated in python?

Multi-line Matching in Python

Text processing with two files

Categories

Resources