grep piping python equivalent - python

I use this bash command for catching a string in a text file
cat a.txt | grep 'a' | grep 'b' | grep 'c' | cut -d" " -f1
How can I implement this solution in python? I don't want to call os commands because it should be a cross platform script.

You may try this,
with open(file) as f: # open the file
for line in f: # iterate over the lines
if all(i in line for i in ('a', 'b', 'c')): # check if the line contain all (a,b,c)
print line.split(" ")[0] # if yes then do splitting on space and print the first value

You can always use the os library to do a system call:
import os
bashcmd = " cat a.txt | grep 'a' | grep 'b' | grep 'c' | cut -d' ' -f1"
print os.system( bashcmd )

Related

Compare strings from a txt with bash or python ignoring pattern

I want to search a txt file for the duplicate lines excluding [p] and the extension in the comparison. Once the equal lines are identified, show only the line that does not contain [p] and with its extension. I have this lines in test.txt:
Peliculas/Desperados (2020)[p].mp4
Peliculas/La Duquesa (2008)[p].mp4
Peliculas/Nueva York Año 2012 (1975).mkv
Peliculas/Acoso en la noche (1980) .mkv
Peliculas/Angustia a Flor de Piel (1982).mkv
Peliculas/Desperados (2020).mkv
Peliculas/Angustia (1947).mkv
Peliculas/Días de radio (1987) BR1080[p].mp4
Peliculas/Mona Lisa (1986) BR1080[p].mp4
Peliculas/La decente (1970) FlixOle WEB-DL 1080p [Buzz][p].mp4
Peliculas/Mona Lisa (1986) BR1080.mkv
In this file lines 1-6 and 9-11 are the same (withouth ext and [p]). Output needed:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
i try this but only shows the same lines deleting extension and pattern [p] but i dont know the correct line and I need the entire line complete
sed 's/\[p\]//' ./test.txt | sed 's\.[^.]*$//' | sort | uniq -d
Error output (missing extension):
Peliculas/Desperados (2020)
Peliculas/Mona Lisa (1986) BR1080
because you mentioned bash...
Remove any line with a p:
cat test.txt | grep -v p
home/folder/house from earth.mkv
home/folder3/window 1.avi
Remove any line with [p]:
cat test.txt | grep -v '\[p\]'
home/folder/house from earth.mkv
home/folder3/window 1.avi
home/folder4/little mouse.mpg
Not likely your need, but just because: Remove [p] from every line, then dedupe:
cat test.txt | sed 's/\[p\]//g' | sort | uniq
home/folder/house from earth.mkv
home/folder/house from earth.mp4
home/folder2/test.mp4
home/folder3/window 1.avi
home/folder3/window 1.mp4
home/folder4/little mouse.mpg
If a 2-pass solution (which reads the test.txt file twice) is acceptable, would you please try:
declare -A ary # associate the filename with the base
while IFS= read -r file; do
if [[ $file != *\[p\]* ]]; then # the filename does not include "[p]"
base="${file%.*}" # remove the extension
ary[$base]="$file" # create a map
fi
done < test.txt
while IFS= read -r base; do
echo "${ary[$base]}"
done < <(sed 's/\[p\]//' ./test.txt | sed 's/\.[^.]*$//' | sort | uniq -d)
Output:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
In the 1st pass, it reads the file line by line to create a map which associates the filename (with an extension) with the base (w/o the extension).
In the 2nd pass, it replace the output (base) with the filename.
If you prefer 1-pass solution (which will be faster), please try:
declare -A ary # associate the filename with the base
declare -A count # count the occurrences of the base
while IFS= read -r file; do
base="${file%.*}" # remove the extension
if [[ $base =~ (.*)\[p\](.*) ]]; then
# "$base" contains the substring "[p]"
(( count[${BASH_REMATCH[1]}${BASH_REMATCH[2]}]++ ))
# increment the counter
else
(( count[$base]++ )) # increment the counter
ary[$base]="$file" # map the filename
fi
done < test.txt
for base in "${!ary[#]}"; do # loop over the keys of ${ary[#]}
if (( count[$base] > 1 )); then
# it duplicates
echo "${ary[$base]}"
fi
done
In Python, you can use itertools.groupby with a function that makes a key that consists of the filename without any [p] and with the extension removed.
For any groups of size 2 or more, any filenames not containing '[p]' are printed.
import itertools
import re
def make_key(line):
return re.sub(r'\.[^.]*$', '', line.replace('[p]', ''))
with open('test.txt') as f:
lines = [line.strip() for line in f]
for key, group in itertools.groupby(lines, make_key):
files = [file for file in group]
if len(files) > 1:
for file in files:
if '[p]' not in file:
print(file)
This gives:
home/folder/house from earth.mkv
home/folder3/window 1.avi

Python to search and print out the whole line just like Linux grep?

Let's use this as example.
>>> t = '''Line 1
... Line 2
... Line 3'''
>>>
re.findall only print out the specific pattern which is similar to Linux grep -o
>>> re.findall('2', t)
['2']
>>>
Linux grep
wolf#linux:~$ echo 'Line 2' | grep 2
Line 2
wolf#linux:~$
Linux grep -o
wolf#linux:~$ echo 'Line 2' | grep 2 -o
2
wolf#linux:~$
I know it's possible to print out the whole output, I just can't think the logic at the moment.
Expected Output in Python
Line 2
If there's better way to do this, please let me know.
print([l for l in t.splitlines() if "2" in l])
Or, if you want it separated as in grep,
print('\n'.join([l for l in t.splitlines() if "2" in l]))
Put .* around what you want to find:
re.findall(r'.*2.*', t)
t = '''Line 1
Line 2
Line 3'''
for line in t.split('\n'):
if(line.find("2")!=-1):
print(line)
This should work for your usecase. find() is used to check if a pattern is present in a string

How to pipe many bash commands from python?

Hi I'm trying to call the following command from python:
comm -3 <(awk '{print $1}' File1.txt | sort | uniq) <(awk '{print $1}' File2.txt | sort | uniq) | grep -v "#" | sed "s/\t//g"
How could I do the calling when the inputs for the comm command are also piped?
Is there an easy and straight forward way to do it?
I tried the subprocess module:
subprocess.call("comm -3 <(awk '{print $1}' File1.txt | sort | uniq) <(awk '{print $1}' File2.txt | sort | uniq) | grep -v '#' | sed 's/\t//g'")
Without success, it says:
OSError: [Errno 2] No such file or directory
Or do I have to create the different calls individually and then pass them using PIPE as it is described in the subprocess documentation:
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
Process substitution (<()) is bash-only functionality. Thus, you need a shell, but it can't be just any shell (like /bin/sh, as used by shell=True on non-Windows platforms) -- it needs to be bash.
subprocess.call(['bash', '-c', "comm -3 <(awk '{print $1}' File1.txt | sort | uniq) <(awk '{print $1}' File2.txt | sort | uniq) | grep -v '#' | sed 's/\t//g'"])
By the way, if you're going to be going this route with arbitrary filenames, pass them out-of-band (as below: Passing _ as $0, File1.txt as $1, and File2.txt as $2):
subprocess.call(['bash', '-c',
'''comm -3 <(awk '{print $1}' "$1" | sort | uniq) '''
''' <(awk '{print $1}' "$2" | sort | uniq) '''
''' | grep -v '#' | tr -d "\t"''',
'_', "File1.txt", "File2.txt"])
That said, the best-practices approach is indeed to set up the chain yourself. The below is tested with Python 3.6 (note the need for the pass_fds argument to subprocess.Popen to make the file descriptors referred to via /dev/fd/## links available):
awk_filter='''! /#/ && !seen[$1]++ { print $1 }'''
p1 = subprocess.Popen(['awk', awk_filter],
stdin=open('File1.txt', 'r'),
stdout=subprocess.PIPE)
p2 = subprocess.Popen(['sort', '-u'],
stdin=p1.stdout,
stdout=subprocess.PIPE)
p3 = subprocess.Popen(['awk', awk_filter],
stdin=open('File2.txt', 'r'),
stdout=subprocess.PIPE)
p4 = subprocess.Popen(['sort', '-u'],
stdin=p3.stdout,
stdout=subprocess.PIPE)
p5 = subprocess.Popen(['comm', '-3',
('/dev/fd/%d' % (p2.stdout.fileno(),)),
('/dev/fd/%d' % (p4.stdout.fileno(),))],
pass_fds=(p2.stdout.fileno(), p4.stdout.fileno()),
stdout=subprocess.PIPE)
p6 = subprocess.Popen(['tr', '-d', '\t'],
stdin=p5.stdout,
stdout=subprocess.PIPE)
result = p6.communicate()
This is a lot more code, but (assuming that the filenames are parameterized in the real world) it's also safer code -- you aren't vulnerable to bugs like ShellShock that are triggered by the simple act of starting a shell, and don't need to worry about passing variables out-of-band to avoid injection attacks (except in the context of arguments to commands -- like awk -- that are scripting language interpreters themselves).
That said, another thing to think about is just implementing the whole thing in native Python.
lines_1 = set(line.split()[0] for line in open('File1.txt', 'r') if not '#' in line)
lines_2 = set(line.split()[0] for line in open('File2.txt', 'r') if not '#' in line)
not_common = (lines_1 - lines_2) | (lines_2 - lines_1)
for line in sorted(not_common):
print line
Also checkout plumbum. Makes life easier
http://plumbum.readthedocs.io/en/latest/
Pipelining
This may be wrong, but you can try this:
from plumbum.cmd import grep, comm, awk, sort, uniq, sed
_c1 = awk['{print $1}', 'File1.txt'] | sort | uniq
_c2 = awk['{print $1}', 'File2.txt'] | sort | uniq
chain = comm['-3', _c1(), _c2() ] | grep['-v', '#'] | sed['s/\t//g']
chain()
Let me know if this goes wrong, Will try to fix it.
Edit: As pointed out, I missed the substitution thing, and I think it would have to be explicitly done by redirecting the above command output to a temporary file and then using that file in the argument to comm.
So the above would now actually become:
from plumbum.cmd import grep, comm, awk, sort, uniq, sed
_c1 = awk['{print $1}', 'File1.txt'] | sort | uniq
_c2 = awk['{print $1}', 'File2.txt'] | sort | uniq
(_c1 > "/tmp/File1.txt")(), (_c2 > "/tmp/File2.txt")()
chain = comm['-3', "/tmp/File1.txt", "/tmp/File2.txt" ] | grep['-v', '#'] | sed['s/\t//g']
chain()
Also, alternatively you can use the method described by #charles by making use of mkfifo.

Fastest way to "grep" big files

I have big log files (from 100MB to 2GB) that contain a (single) particular line I need to parse in a Python program. I have to parse around 20,000 files. And I know that the searched line is within the 200 last lines of the file, or within the last 15000 bytes.
As it is a recurring task, I need it be as fast as possible. What is the fastest way to get it?
I have thought about 4 strategies:
read the whole file in Python and search a regex (method_1)
read only the last 15,000 bytes of the file and search a regex (method_2)
make a system call to grep (method_3)
make a system call to grep after tailing the last 200 lines (method_4)
Here are the functions I created to test these strategies :
import os
import re
import subprocess
def method_1(filename):
"""Method 1: read whole file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
txt = f.read()
match = re.search(regex, txt)
if match:
print match.group()
def method_2(filename):
"""Method 2: read part of the file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
size = min(15000, os.stat(filename).st_size)
f.seek(-size, os.SEEK_END)
txt = f.read(size)
match = re.search(regex, txt)
if match:
print match.group()
def method_3(filename):
"""Method 3: grep the entire file"""
cmd = 'grep "(TEMPS CP :" {} | head -n 1'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
def method_4(filename):
"""Method 4: tail of the file and grep"""
cmd = 'tail -n 200 {} | grep "(TEMPS CP :"'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
I ran these methods on two files ("trace" is 207MB and "trace_big" is 1.9GB) and got the following computation time (in seconds):
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_1 | 2.89E-001 | 2.63 |
| method_2 | 5.71E-004 | 5.01E-004 |
| method_3 | 2.30E-001 | 1.97 |
| method_4 | 4.94E-003 | 5.06E-003 |
+----------+-----------+-----------+
So method_2 seems to be the fastest. But is there any other solution I did not think about?
Edit
In addition to the previous methods, Gosha F suggested a fifth method using mmap :
import contextlib
import math
import mmap
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
ag = mmap.ALLOCATIONGRANULARITY
offset = ag * (int(math.ceil(offset/ag)))
with open(filename, 'r') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)
with contextlib.closing(mm) as txt:
match = regex.search(txt)
if match:
print match.group()
I tested it and get the following results:
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_5 | 2.50E-004 | 2.71E-004 |
+----------+-----------+-----------+
You may also consider using memory mapping (mmap module) like this
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
with open(filename, 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
match = regex.search(txt)
if match:
print match.group()
also some side notes:
in the case of using a shell command, ag may be in some cases orders of magnitude faster than grep (although with only 200 lines of greppable text the difference probably vanishes compared to the overhead of starting a shell)
just compiling your regex in the beginning of the function may make some difference
Probably faster to do the processing in the shell so as to avoid the python overhead. Then you can pipe the result into a python script. Otherwise it looks like you did the fastest thing.
Seeking then regex match should be very fast. Method 2 and 4 are the same but you incur the extra overhead of python making a syscall.
Does it have to be in Python? Why not a shell script?
My guess is that method 4 will be the fastest/most efficient. That's certainly how I'd write it as shell script. And it's got the be faster than 1 or 3. I'd still time it in comparison to method 2 to be 100% sure though.

process a text file using various delimiters

My text file (unfortunately) looks like this...
<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}
<akbar>[akbar-1000#Fem$$$_Y](1){}
<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}
It contain the customer name followed by some information. The sequence is...
text string followed by list, set and then dictionary
<> [] () {}
This is not python compatible file so the data is not as expected. I want to process the file and extract some information.
amar 1000 | 1000 | 1000
akbar 1000
john 0000 | 0100 | 0100
1) name between <>
2) The number between - and # in the list
3 & 4) split dictionary on comma and the numbers between | and # (there can be more than 2 entries here)
I am open to using any tool best suited for this task.
The following Python script will read your text file and give you the desired results:
import re, itertools
with open("input.txt", "r") as f_input:
for line in f_input:
reLine = re.match(r"<(\w+)>\[(.*?)\].*?{(.*?)\}", line)
lNumbers = [re.findall(".*?(\d+).*?", entry) for entry in reLine.groups()[1:]]
lNumbers = list(itertools.chain.from_iterable(lNumbers))
print reLine.group(1), " | ".join(lNumbers)
This prints the following output:
amar 1000 | 1000 | 1000
akbar 1000
john 0000 | 0100 | 0100
As the grammer is quite complex you might find a proper parser the best solution.
#!/usr/bin/env python
import fileinput
from pyparsing import Word, Regex, Optional, Suppress, ZeroOrMore, alphas, nums
name = Suppress('<') + Word(alphas) + Suppress('>')
reclist = Suppress('[' + Optional(Word(alphas)) + '-') + Word(nums) + Suppress(Regex("[^]]+]"))
digit = Suppress('(' + Word(nums) + ')')
dictStart = Suppress('{')
dictVals = Suppress(Word(alphas) + '|') + Word(nums) + Suppress('#' + Regex('[^,}]+') + Optional(','))
dictEnd = Suppress('}')
parser = name + reclist + digit + dictStart + ZeroOrMore(dictVals) + dictEnd
for line in fileinput.input():
print ' | '.join(parser.parseString(line))
This solution uses the pyparsing library and running produces:
$ python parse.py file
amar | 1000 | 1000 | 1000
akbar | 1000
john | 0000 | 0100 | 0100
You can add all delimiters to the FS variable in awk and count fields, like:
awk -F'[<>#|-]' '{ print $2, $4, $6, $8 }' infile
In case you have more than two entries between curly braces, you could use a loop to traverse all fields until the last one, like:
awk -F'[<>#|-]' '{
printf "%s %s ", $2, $4
for (i = 6; i <= NF; i += 2) {
printf "%s ", $i
}
printf "\n"
}' infile
Both commands yield same results:
amar 1000 1000 1000
akbar 1000
john 0000 0100 0100
You could use regex to catch the arguments
sample:
a="<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}"
name=" ".join(re.findall("<(\w+)>[\s\S]+?-(\d+)#",a)[0])
others=re.findall("\|(\d+)#",a)
print name+" | "+" | ".join(others) if others else " "
output:
'john 0000 | 0100 | 0100'
Full code:
with open("input.txt","r") as inp:
for line in inp:
name=re.findall("<(\w+)>[\s\S]+?-(\d+)#",line)[0]
others=re.findall("\|(\d+)#",line)
print name+" | "+" | ".join(others) if others else " "
For one line of your file :
test='<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}'
replace < with empty character and remove everything after > for getting the first name
echo $test | sed -e 's/<//g' | sed -e 's/>.*//g'
get all 4 digit characters suites :
echo $test | grep -o '[0-9]\{4\}'
replace space with your favorite separator
sed -e 's/ /|/g'
This will make :
echo $(echo $test | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $test | grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'
This will output :
amar|1000|1000|1000
with a quick script you got it : your_script.sh input_file output_file
#!/bin/bash
IFS=$'\n' #line delimiter
#empty your output file
cp /dev/null "$2"
for i in $(cat "$1"); do
newline=`echo $(echo $i | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $i | grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'`
echo $newline >> "$2"
done
cat "$2"

Categories