process a text file using various delimiters - python

My text file (unfortunately) looks like this...
<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}
<akbar>[akbar-1000#Fem$$$_Y](1){}
<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}
It contain the customer name followed by some information. The sequence is...
text string followed by list, set and then dictionary
<> [] () {}
This is not python compatible file so the data is not as expected. I want to process the file and extract some information.
amar 1000 | 1000 | 1000
akbar 1000
john 0000 | 0100 | 0100
1) name between <>
2) The number between - and # in the list
3 & 4) split dictionary on comma and the numbers between | and # (there can be more than 2 entries here)
I am open to using any tool best suited for this task.

The following Python script will read your text file and give you the desired results:
import re, itertools
with open("input.txt", "r") as f_input:
for line in f_input:
reLine = re.match(r"<(\w+)>\[(.*?)\].*?{(.*?)\}", line)
lNumbers = [re.findall(".*?(\d+).*?", entry) for entry in reLine.groups()[1:]]
lNumbers = list(itertools.chain.from_iterable(lNumbers))
print reLine.group(1), " | ".join(lNumbers)
This prints the following output:
amar 1000 | 1000 | 1000
akbar 1000
john 0000 | 0100 | 0100

As the grammer is quite complex you might find a proper parser the best solution.
#!/usr/bin/env python
import fileinput
from pyparsing import Word, Regex, Optional, Suppress, ZeroOrMore, alphas, nums
name = Suppress('<') + Word(alphas) + Suppress('>')
reclist = Suppress('[' + Optional(Word(alphas)) + '-') + Word(nums) + Suppress(Regex("[^]]+]"))
digit = Suppress('(' + Word(nums) + ')')
dictStart = Suppress('{')
dictVals = Suppress(Word(alphas) + '|') + Word(nums) + Suppress('#' + Regex('[^,}]+') + Optional(','))
dictEnd = Suppress('}')
parser = name + reclist + digit + dictStart + ZeroOrMore(dictVals) + dictEnd
for line in fileinput.input():
print ' | '.join(parser.parseString(line))
This solution uses the pyparsing library and running produces:
$ python parse.py file
amar | 1000 | 1000 | 1000
akbar | 1000
john | 0000 | 0100 | 0100

You can add all delimiters to the FS variable in awk and count fields, like:
awk -F'[<>#|-]' '{ print $2, $4, $6, $8 }' infile
In case you have more than two entries between curly braces, you could use a loop to traverse all fields until the last one, like:
awk -F'[<>#|-]' '{
printf "%s %s ", $2, $4
for (i = 6; i <= NF; i += 2) {
printf "%s ", $i
}
printf "\n"
}' infile
Both commands yield same results:
amar 1000 1000 1000
akbar 1000
john 0000 0100 0100

You could use regex to catch the arguments
sample:
a="<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}"
name=" ".join(re.findall("<(\w+)>[\s\S]+?-(\d+)#",a)[0])
others=re.findall("\|(\d+)#",a)
print name+" | "+" | ".join(others) if others else " "
output:
'john 0000 | 0100 | 0100'
Full code:
with open("input.txt","r") as inp:
for line in inp:
name=re.findall("<(\w+)>[\s\S]+?-(\d+)#",line)[0]
others=re.findall("\|(\d+)#",line)
print name+" | "+" | ".join(others) if others else " "

For one line of your file :
test='<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}'
replace < with empty character and remove everything after > for getting the first name
echo $test | sed -e 's/<//g' | sed -e 's/>.*//g'
get all 4 digit characters suites :
echo $test | grep -o '[0-9]\{4\}'
replace space with your favorite separator
sed -e 's/ /|/g'
This will make :
echo $(echo $test | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $test | grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'
This will output :
amar|1000|1000|1000
with a quick script you got it : your_script.sh input_file output_file
#!/bin/bash
IFS=$'\n' #line delimiter
#empty your output file
cp /dev/null "$2"
for i in $(cat "$1"); do
newline=`echo $(echo $i | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $i | grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'`
echo $newline >> "$2"
done
cat "$2"

Related

Compare strings from a txt with bash or python ignoring pattern

I want to search a txt file for the duplicate lines excluding [p] and the extension in the comparison. Once the equal lines are identified, show only the line that does not contain [p] and with its extension. I have this lines in test.txt:
Peliculas/Desperados (2020)[p].mp4
Peliculas/La Duquesa (2008)[p].mp4
Peliculas/Nueva York Año 2012 (1975).mkv
Peliculas/Acoso en la noche (1980) .mkv
Peliculas/Angustia a Flor de Piel (1982).mkv
Peliculas/Desperados (2020).mkv
Peliculas/Angustia (1947).mkv
Peliculas/Días de radio (1987) BR1080[p].mp4
Peliculas/Mona Lisa (1986) BR1080[p].mp4
Peliculas/La decente (1970) FlixOle WEB-DL 1080p [Buzz][p].mp4
Peliculas/Mona Lisa (1986) BR1080.mkv
In this file lines 1-6 and 9-11 are the same (withouth ext and [p]). Output needed:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
i try this but only shows the same lines deleting extension and pattern [p] but i dont know the correct line and I need the entire line complete
sed 's/\[p\]//' ./test.txt | sed 's\.[^.]*$//' | sort | uniq -d
Error output (missing extension):
Peliculas/Desperados (2020)
Peliculas/Mona Lisa (1986) BR1080
because you mentioned bash...
Remove any line with a p:
cat test.txt | grep -v p
home/folder/house from earth.mkv
home/folder3/window 1.avi
Remove any line with [p]:
cat test.txt | grep -v '\[p\]'
home/folder/house from earth.mkv
home/folder3/window 1.avi
home/folder4/little mouse.mpg
Not likely your need, but just because: Remove [p] from every line, then dedupe:
cat test.txt | sed 's/\[p\]//g' | sort | uniq
home/folder/house from earth.mkv
home/folder/house from earth.mp4
home/folder2/test.mp4
home/folder3/window 1.avi
home/folder3/window 1.mp4
home/folder4/little mouse.mpg
If a 2-pass solution (which reads the test.txt file twice) is acceptable, would you please try:
declare -A ary # associate the filename with the base
while IFS= read -r file; do
if [[ $file != *\[p\]* ]]; then # the filename does not include "[p]"
base="${file%.*}" # remove the extension
ary[$base]="$file" # create a map
fi
done < test.txt
while IFS= read -r base; do
echo "${ary[$base]}"
done < <(sed 's/\[p\]//' ./test.txt | sed 's/\.[^.]*$//' | sort | uniq -d)
Output:
Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv
In the 1st pass, it reads the file line by line to create a map which associates the filename (with an extension) with the base (w/o the extension).
In the 2nd pass, it replace the output (base) with the filename.
If you prefer 1-pass solution (which will be faster), please try:
declare -A ary # associate the filename with the base
declare -A count # count the occurrences of the base
while IFS= read -r file; do
base="${file%.*}" # remove the extension
if [[ $base =~ (.*)\[p\](.*) ]]; then
# "$base" contains the substring "[p]"
(( count[${BASH_REMATCH[1]}${BASH_REMATCH[2]}]++ ))
# increment the counter
else
(( count[$base]++ )) # increment the counter
ary[$base]="$file" # map the filename
fi
done < test.txt
for base in "${!ary[#]}"; do # loop over the keys of ${ary[#]}
if (( count[$base] > 1 )); then
# it duplicates
echo "${ary[$base]}"
fi
done
In Python, you can use itertools.groupby with a function that makes a key that consists of the filename without any [p] and with the extension removed.
For any groups of size 2 or more, any filenames not containing '[p]' are printed.
import itertools
import re
def make_key(line):
return re.sub(r'\.[^.]*$', '', line.replace('[p]', ''))
with open('test.txt') as f:
lines = [line.strip() for line in f]
for key, group in itertools.groupby(lines, make_key):
files = [file for file in group]
if len(files) > 1:
for file in files:
if '[p]' not in file:
print(file)
This gives:
home/folder/house from earth.mkv
home/folder3/window 1.avi

grep piping python equivalent

I use this bash command for catching a string in a text file
cat a.txt | grep 'a' | grep 'b' | grep 'c' | cut -d" " -f1
How can I implement this solution in python? I don't want to call os commands because it should be a cross platform script.
You may try this,
with open(file) as f: # open the file
for line in f: # iterate over the lines
if all(i in line for i in ('a', 'b', 'c')): # check if the line contain all (a,b,c)
print line.split(" ")[0] # if yes then do splitting on space and print the first value
You can always use the os library to do a system call:
import os
bashcmd = " cat a.txt | grep 'a' | grep 'b' | grep 'c' | cut -d' ' -f1"
print os.system( bashcmd )

How to remove whitespace from file and extract the corresponding indices from another file? - bash

I've three files:
file1.txt:
XYZ与ABC
DFC什么
FBFBBBFde
warlaugh世界
file2.txt:
XYZ 与 ABC
warlaugh 世界
file3.txt:
XYZ with abc
DFC whatever
FBFBBBF
world of warlaugh
file2.txt is a processed file from file1.txt with spaces. The lines of file1.txt aligns with file3.txt, i.e. foobaristhehelloworld <-> XYZ with abc.
The processing threw away lines from file2.txt due to some reason but what's more important is to retrieve the corresponding lines from file3.txt after processing.
How could I check for which lines have been removed in file2.txt and then produce a file4.txt that looks like this:
file4.txt:
XYZ with abc
world of warlaugh
I could do it with python but I'm sure there's a simple way with sed/awk or bash tricks:
with open('file1.txt', 'r') as file1, open('file2.txt') as file2, open('file3.txt', 'r') as file3:
file2_nospace = [i.replace(' ', '') for i in file2.readlines()]
file2_indices = [i for i,j in enumerate(file1.readlines()) if j in file2_nospace]
file4 = [j for i,j in enumerate(file3.readlines()) if i in file2_indices]
open('file4.txt', 'w').write('\n'.join(file4))
How can i create file4.txt with sed/awk/grep or bash tricks?
first remove spaces in file2.txt to make its lines like file1.txt :
sed 's/ //g' file2.txt
then use that as a pattern to match with file1.txt. do this using grep -f command and use -n to see line numbers of file1.txt which matches with the constructed pattern from file2.txt :
$ grep -nf <(sed 's/ //g' file2.txt) file1.txt
1:XYZ与ABC
4:warlaugh世界
now you need to remove any character after : to make a new pattern to matches with file3.txt lines:
$ grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/'
1:
4:
to add line number to each line of file3.txt use this:
$ nl -s':' file3.txt | sed -r 's/^ +//'
1:XYZ with abc
2:DFC whatever
3:FBFBBBF
4:world of warlaugh
now you can use the first output as a pattern to match with the second:
$ grep -f <(grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/') <(nl -s':' file3.txt | sed -r 's/^ +//')
1:XYZ with abc
4:world of warlaugh
and to remove starting line numbers simply use cut:
$ grep -f <(grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/') <(nl -s':' file3.txt | sed -r 's/^ +//') | cut -d':' -f2
XYZ with abc
world of warlaugh
finally save result to file4.txt :
$ grep -f <(grep -nf <(sed 's/ //g' file2.txt) file1.txt | sed 's/:.*/:/') <(nl -s':' file3.txt | sed -r 's/^ +//') | cut -d':' -f2 > file4.txt
You can do it similarly in a single call to awk:
awk 'FILENAME ~ /file2.txt/ { gsub(/ /, ""); a[$0]; next }
FILENAME ~ /file1.txt/ && $0 in a { b[FNR]; next }
FILENAME ~ /file3.txt/ && FNR in b { print }' file2.txt file1.txt file3.txt
You can also use two awks to avoid using the FILENAME variable:
awk 'FNR==NR { gsub(/ /, ""); a[$0]; next }
$0 in a { print FNR }' file2.txt file1.txt |
awk 'FNR==NR { a[$0]; next } FNR in a { print }' - file3.txt
Use > file4.txt to output to file4.txt after either.
Basically it's
take file2.txt and store it in an associative array after stripping spaces.
store the line number form file1.txt compared to that associative array and store that in another associative array by file line number.
test to see if the line number in file3.txt is in the 2nd associative array and print when there's a match.
Loop through the original file, and look for the corresponding line in file2.
When the lines match, print the corresponding line from file3.
linenr=0
filternr=1
for line in $(cat file1.txt); do
(( linenr = linenr + 1 ))
line2=$(sed -n ${filternr}p file2.txt | cut -d" " -f1)
if [[ "${line}" = ${line2}* ]]; then
(( filternr = filternr + 1 ))
sed -n ${linenr}p file3.txt
fi
done > file4.txt
When the files are large (actually when the number of lines in file2 is large), you would like to change this solution, avoiding sed to go through file2 and file3 everytime. The solution would be less simple to write/understad/maintain...
Looking once in every file can be done with diff and redirection of stdin.
This solution only works when you are sure they do not have a '|'-character:
#/bin/bash
function mycheck {
if [ -z "${filteredline}" ]; then
exec 0<file2.txt
read filteredline
fi
line2=${filteredline%% *}
if [[ "${line}" = ${line2}* ]]; then
echo ${line} | sed 's/.*|\t//'
read filteredline
if [ -z "${filteredline}" ]; then
break;
fi
fi
}
IFS="
"
for line in $(diff -y file1.txt file3.txt); do
mycheck "${line}"
done > file4.txt

Append values into key pair from two lines

My text file out put looks like this on two lines:
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
My desired output is too look like this:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
I'm having no luck trying this:
sed '$!N;s/|/\n/' foo
Any advice would be welcomed, thank you.
As you have just two lines, this can be a way:
$ paste -d' ' <(head -1 file | sed 's/|/\n/g') <(tail -1 file | sed 's/|/\n/g')
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
By pieces. Let's get the first line and replace every pipe | with a new line:
$ head -1 file | sed 's/|/\n/g'
DelayTimeThreshold
MaxDelayPerMinute
Name
And do the same with the last line:
$ tail -1 file | sed 's/|/\n/g'
10000
5
rca
Then it is just a matter of pasting both results with a space as delimiter:
paste -d' ' output1 output2
this awk one-liner would work for your requirement:
awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}' file
output:
kent$ echo "DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca"|awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}'
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Using the Array::Transpose module:
perl -MArray::Transpose -F'\|' -lane '
push #a, [#F]
} END {print for map {join " ", #$_} transpose(\#a)
' <<END
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
END
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
As a perl script:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $file1 = ('DelayTimeThreshold|MaxDelayPerMinute|Name');
my $file2 = ('10000|5|rca');
my #file1 = split('\|', $file1);
my #file2 = split('\|', $file2);
my %hash;
#hash{#file1} = #file2;
print Dumper \%hash;
Output:
$VAR1 = {
'Name' => 'rca',
'DelayTimeThreshold' => '10000',
'MaxDelayPerMinute' => '5'
};
OR:
for (my $i = 0; $i < $#file1; $i++) {
print "$file1[$i] $file2[$i]\n";
}
Output:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Suppose you have a file that contains a single header row with column names, followed by multiple detail rows with column values, for example,
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|abc
20001|6|def
30002|7|ghk
40003|8|jkl
50004|9|mnp
The following code would print that file, using the names from the first row, paired with values from each subsequent (detail) row,
#!/bin/perl -w
use strict;
my ($fn,$fh)=("header.csv"); #whatever the file is named...
open($fh,"< $fn") || error "cannot open $fn";
my ($count,$line,#names,#vals)=(0);
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; } #first line is names
for (my $ndx=0; $ndx<=$#names; $ndx++) { #print each
print "$names[$ndx] $vals[$ndx]\n";
}
}
Suppose you want to keep around each row, annotated with names, in an array,
my %row;
my #records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
push(#records,\%row);
}
Maybe you want to refer to the rows by some key column,
my %row;
my %records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
$records{$vals[0]}=\%row;
}

How to cut out the Values from a list of Key-Value pairs

I have a file with multiple KV pairs.
Input:
$ cat input.txt
k1:v1 k2:v2 k3:v3
...
I am only interested in the values. The keys (name) are just to remember what each value meant. Essentially I am looking to cut the keys out so that I can plot the value columns.
Output:
$ ...
v1 v2 v3
Is their a single-liner bash command that can help me achieve this?
UPDATE
This is how I am currently doing it (looks ugly)
>> cat input.txt | python -c "import sys; \
lines = sys.stdin.readlines(); \
values = [[i.split(':')[1] for i in item] for item in \
[line.split() for line in lines]]; \
import os; [os.system('echo %s'%v) for v in \
['\t'.join(value) for value in values]]" > output.txt
is this ok for you?
sed -r 's/\w+://g' yourfile
test:
kent$ echo "k1:v1 k2:v2 k3:v3"|sed -r 's/\w+://g'
v1 v2 v3
update
well, if your key contains "-" etc: see below
kent$ echo "k1#-$%-^=:v1 k2:v2 k3:v3"|sed -r 's/[^ ]+://g'
v1 v2 v3
awk -v FS=':' -v RS=' ' -v ORS=' ' '{print $2}' foo.txt
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
I see sed, awk and python, so here's plain bash:
while IFS=' ' read -a kv ; do printf '%s ' "${kv[#]#*:}" ; done < input.txt
Just for good measure, here's a perl version:
perl -n -e 'print(join(" ",values%{{#{[split(/[:\s]/,$_)]}}})," ")' < input.txt
The order of the values changes, though, so it's probably not going to be what you want.
Solution with awk:
awk '{split($0,p," "); for(kv in p) {split(p[kv],a,":"); printf "%s ",a[2];} print ""}' foo.txt
Try this
Input.txt
k1:v1 k2:v2 k3:v3
Code
awk -F " " '{for( i =1 ; i<=NF ;i+=1) print $i}' Input.txt | cut -d ":" -f 2 | tr '\n' ' '
Output
v1 v2 v3

Categories