Pattern Matching in a csv file and appending to matched lines

Pattern Matching in a csv file and appending to matched lines - python

I want to extract those lines a csv file which match a Pattern and then append the same Pattern to the end of each extracted line as a newly added column of the csv file.
file.csv
file.csv
/var/log/0,33,New file,0
/var/log/0,34,Size increased,2345
/abc/Repli,11,New file,0
/abc/Repli,87,Size Increase,11
In above file file.csv, I executed
sed -n -i"" '/Repli/ s/$/,Repli/p' file.csv
This deletes remaining lines, which I do not want.

Extracting only the lines that match a pattern and modifying them
To select only lines containing pattern and then add pattern as a new column at the end of the line:
awk '/pattern/ {print $0 ",pattern"}' file.csv >tmp$$ && mv tmp$$ file.csv
Or,
sed -b -n -i"" '/pattern/ s/$/,pattern/p' file.csv
Keeping all lines but modifying those that match a pattern
awk '/pattern/ {$0=$0 ",pattern"} 1'
Or,
sed -b -i"" '/pattern/ s/$/,pattern/' file.csv
Remove Windows line endings while keeping all lines and modifying those that match a pattern
sed -i"" 's/\r//; /pattern/ s/$/,pattern/' file.csv
Remove Windows line endings while keeping all lines and modifying those that match a pattern containing slashes
Suppose that the pattern contains slashes like /var/log/abc/file/0/. Then:
sed -i"" 's/\r//; \|pattern| s|$|,pattern|' file.csv
For example:
sed -i"" 's/\r//; \|/var/log/abc/file/0/| s|$|,/var/log/abc/file/0/|' file.csv

I found a solution to match paths using sed. I did it through escap character and it worked.
Pattern="\/var\/log\/Model\/1\/"
Module=BE
sudo sed -i"" "s/\r//; /$Pattern/ s/$/,$Module/" resultFile.csv
Worked Fine!!

Below snippet appends the pattern in case if a match is found. If no match found just prints the line,
awk '{if($0 ~ /pattern/) print $0",pattern"; else print $0;}' file.csv

Related

How can I print a grep result with the matched word?

I have two files, "file A":
Adygei
Albanian
Armenia_C
Armenia_Caucasus
Armenia_EBA
Armenia_LBA
Armenia_MBA
Armenian.DG
Austria_EN_HG_LBK
Austria_EN_LBK
And "fileB":
HG01880.SG Aygei_o1.SG
HG01988.SG Adygei_o2.SG
HG02419.SG Albanian_o2.SG
HG01879.SG Albanian.SG
HG01882.SG Armenia_C.SG
HG01883.SG Armenia_C.SG
HG01885.SG Armenia_EBA.SG
HG01886.SG Armenia_EBA.SG
HG01889.SG Armenia_LBA.SG
HG01890.SG Armenia_MBA.SG
What I want at the end is create a new columne (doesn't matter the position of the column) with the grep word with the word that matched. Like This:
HG01880.SG Aygei_o1.SG Adygei
HG01988.SG Adygei_o2.SG Adygei
HG02419.SG Albanian_o2.SG Albanian
HG01879.SG Albanian.SG Albanian
HG01882.SG Armenia_C.SG Armenia_C
HG01883.SG Armenia_C.SG Armenia_C
HG01885.SG Armenia_EBA.SG Armenia_EBA
HG01886.SG Armenia_EBA.SG Armenia_EBA
HG01889.SG Armenia_LBA.SG Armenia_LBA
HG01890.SG Armenia_MBA.SG Armenia_MBA
What I used to match both files in bash is grep -wFf fileA fileB > newfileA_B.txt. This can be both in python or bash

You can try something like that:
for line in $(cat fileA.txt)
do
echo "$line $(grep $line fileB.txt)"
done

Here is an example (probably inefficient) algorithm in Python (using strings instead of files) https://colab.research.google.com/drive/1bUnFXJg0m6FvXRkPybUqWux_reaJRt1c?usp=sharing

Perform the grep search once more but this time adding the flag -o which only lists the matching words. Then use paste to add it as column (define the delimiter using the -d flag).
paste -d ' ' <(grep -wFf fileA fileB) <(grep -woFf fileA fileB)

vim and wc give different line counts

I have two csv files that give different results when I use wc -l (gives 65 lines for the first, 66 for the second) and when I use vim file.csv and then :$ to go to the bottom of the file (66 lines for both). I have tried viewing newline characters in vim using :set list and they look identical.
I have created the second (which shows one extra line with wc) was created from the first using pandas in Python and to_csv.
Is there anything within pandas that might generate new lines or other bash/vim tools I can use to verify the differences?

If the last character of the file is not a newline, wc won't count the last line:
$ printf 'a\nb\nc' | wc -l
2
In fact, that's how wc -l is documented to work: from man wc
-l, --lines
print the newline counts
^^^^^^^^^^^^^

find and replace after the second column

I have the following lines
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;Sof_voya_Faible_Email_am;30/01/2015;Sof_voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;09/02/2015;Export Trav_Fort Postal
I'm trying to replace strings like Sof_ or _%yyyy%mm%dd% after the 7th field.
I thought about using sed
sed -i 's/<string_to_look_for>/<string_to_replace>/7g' filename
But it is only changing the field delimiter.
I thought about using this
awk -F";" '{ for (i=7; i<=NF; i++) print $i }' filename
but I don't know how to insert a search and replace for the strings I want to replace.
Any help is welcomed.
edit : expected outcome after replacing strings like Sof_ or _%yyyy%mm%dd% after the 7th column.
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal
to Python and Perl gurus, as i'm trying to ramp up my knowledge in these languages, your helps are welcomed:)

You can use this awk:
awk 'BEGIN{FS=OFS=";"} {for (i=7;i<=NF;i++) gsub(/Sof_|_%yyyy%mm%dd%/, "", $i) } 1' file
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal

Through python3.
#!/usr/bin/python3
import sys
fil = sys.argv[1]
with open(fil) as f:
for line in f:
part1 = ';'.join(line.split(';')[:7])
part2 = ';'.join(line.split(';')[7:]).replace('Sof_','').replace('_%yyyy%mm%dd%', '')
print(part1+';'+part2, end="")
save the above text in a file say script.py and then run it by,
python3 script.py inputfile
Through Perl.
$ perl -pe 's/^(?:[^;]*;){7}(*SKIP)(*F)|(?:_%yyyy%mm%dd%|Sof_)//g' file
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal

In Python you would use the re and csv modules to do this:
import re
import csv
with open(fn) as fin:
r=csv.reader(fin, delimiter=';')
for line in r:
result=line[:7]
for field in line[:7]:
if re.search(r'Sof_', field):
field=re.sub(r'Sof_', 'repalcaement for Sof_', field)
if re.search(r'_%yyyy%mm%dd%', field):
field=re.sub(r'Sof_', 'repalcaement for _%yyyy%mm%dd%', field)
result.append(field)
print result

This might work for you (GNU sed):
sed -r ':a;s/^(([^;]*;){7}.*)(Sof_|_%yyyy%mm%dd%)/\1/;ta' file
This stores the first seven fields and following strings (that do not match the required strings) in the first backreference, then replaces the required strings by the said backreference.

Assuming you want the while line from the input file, and note: this starts with field #7. Your data exists earlier in each line.
awk -F";" '{ for (i=7; i<=NF; i++)
{gsub(/Sof_/,"newstring", ($i) } ;
print $0} ' filename
will replace Sof_ with "newstring". I'm not positive this is what you are looking for.
Correct syntax error - removed erratn ' character - thanks

Here is another way using perl's -F -a and autosplit:
perl -F";" -anE 'for ( #F[7..$#F] ) { $_ =~ s/Sof_|_%yyyy%mm%dd%//g }
print join ";", #F;' file.txt
This grabs elements 7 to last ($#F) of the autocreated #F array and removes/substitutes the text.

Compare 2 files and remove any lines in file2 when they match values found in file1

I have two files. i am trying to remove any lines in file2 when they match values found in file1. One file has a listing like so:
File1
ZNI008
ZNI009
ZNI010
ZNI011
ZNI012
... over 19463 lines
The second file includes lines that match the items listed in first:
File2
copy /Y \\server\foldername\version\20050001_ZNI008_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI010_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI012_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI009_162635.xml \\server\foldername\version\folder\
... continues listing until line 51360
What I've tried so far:
grep -v -i -f file1.txt file2.txt > f3.txt
does not produce any output to f3.txt or remove any lines. I verified by running
wc -l file2.txt
and the result is
51360 file2.txt
I believe the reason is that there are no exact matches. When I run the following it shows nothing
comm -1 -2 file1.txt file2.txt
Running
( tr '\0' '\n' < file1.txt; tr '\0' '\n' < file2.txt ) | sort | uniq -c | egrep -v '^ +1'
shows only one match, even though I can clearly see there is more than one match.
Alternatively putting all the data into one file and running the following:
grep -Ev "$(cat file1.txt)" 1>LinesRemoved.log
says argument has too many lines to process.
I need to remove lines matching the items in file1 from file2.
i am also trying this in python:
`
#!/usr/bin/python
s = set()
# load each line of file1 into memory as elements of a set, 's'
f1 = open("file1.txt", "r")
for line in f1:
s.add(line.strip())
f1.close()
# open file2 and split each line on "_" separator,
# second field contains the value ZNIxxx
f2 = open("file2.txt", "r")
for line in f2:
if line[0:4] == "copy":
fields = line.split("_")
# check if the field exists in the set 's'
if fields[1] not in s:
match = line
else:
match = 0
else:
if match:
print match, line,
`
it is not working well.. as im getting
'Traceback (most recent call last):
File "./test.py", line 14, in ?
if fields[1] not in s:
IndexError: list index out of range'

What about:
grep -F -v -f file1 file2 > file3

I like the grep solution from byrondrossos better, but here's another option:
sed $(awk '{printf("-e /%s/d ", $1)}' file1) file2 > file3

this is using Bash and GNU sed because of the -i switch
cp file2 file3
while read -r; do
sed -i "/$REPLY/d" file3
done < file1
there is surely a better way but here's a hack around -i :D
cp file2 file3
while read -r; do
(rm file3; sed "/$REPLY/d" > file3) < file3
done < file1
this exploits shell evaluation order
alright, I guess the correct way with this idea is using ed. This should be POSIX too.
cp file2 file3
while read -r line; do
ed file3 <<EOF
/$line/d
wq
EOF
done < file1
in any case, grep seems to do be the right tool for the job.
#byrondrossos answer should work for you well ;)

This is admittedly ugly but it does work. However, the path must be the same for all of the (except of course the ZNI### portion). All but the ZNI### of the path is removed so the command grep -vf can run correctly on the sorted files.
First Convert "testfile2" to "testfileconverted" to just show the ZNI###
cat /testfile2 | sed 's:^.*_ZNI:ZNI:g' | sed 's:_.*::g' > /testfileconverted
Second use inverse grep of the converted file compared to the "testfile1" and add the reformatted output to "testfile3"
bash -c 'grep -vf <(sort /testfileconverted) <(sort /testfile1)' | sed "s:^:\copy /Y \\\|server\\\foldername\\\version\\\20050001_:g" | sed "s:$:_162635\.xml \\\|server\\\foldername\\\version\\\folder\\\:g" | sed "s:|:\\\:g" > /testfile3

Bash or python for changing spacing in files

I have a set of 10000 files. In all of them, the second line, looks like:
AAA 3.429 3.84
so there is just one space (requirement) between AAA and the two other columns. The rest of lines on each file are completely different and correspond to 10 columns of numbers.
Randomly, in around 20% of the files, and due to some errors, one gets
BBB 3.429 3.84
so now there are two spaces between the first and second column.
This is a big error so I need to fix it, changing from 2 to 1 space in the files where the error takes place.
The first approach I thought of was to write a bash script that for each file reads the 3 values of the second line and then prints them with just one space, doing it for all the files.
I wonder what do oyu think about this approach and if you could suggest something better, bashm python or someother approach.
Thanks

Performing line-based changes to text files is often simplest to do in sed.
sed -e '2s/ */ /g' infile.txt
will replace any runs of multiple spaces with a single space. This may be changing more than you want, though.
sed -e '2s/^\([^ ]*\) /\1 /' infile.txt
should just replace instances of two spaces after the first block of space-free text with a single space (though I have not tested this).
(edit: inserted 2 before s in each instance to tie the edit to the second line, specifically.)

Use sed.
for file in *
do
sed -i '' '2s/ / /' "$file"
done
The -i '' flag means to edit in-place without a backup.
Or use ed!
for file in *
do
printf "2s/ / /\nwq\n" |ed -s "$file"
done

if the error always can occur at 2nd line,
for file in file*
do
awk 'NR==2{$1=$1}1' file >temp
mv temp "$file"
done
or sed
sed -i.bak '2s/ */ /' file* # do 2nd line
Or just pure bash scripting
i=1
while read -r line
do
if [ "$i" -eq 2 ];then
echo $line
else
echo "$line"
fi
((i++))
done <"file"

Since it seems every column is separated by one space, another approach not yet mentioned is to use tr to squeeze all multi spaces into single spaces:
tr -s " " < infile > outfile

I am going to be different and go with AWK:
awk '{print $1,$2,$3}' file.txt > file1.txt
This will handle any number of spaces between fields, and replace them with one space
To handle a specific line you can add line addresses:
awk 'NR==2{print $1,$2,$3} NR!=2{print $0}' file.txt > file1.txt
i.e. rewrite line 2, but leave unchanged the other lines.
A line address can be a regular expression as well:
awk '/regexp/{print $1,$2,$3} !/regexp/{print}' file.txt > file1.txt

This answer assumes you don't want to mess with any except the second line.
#!/usr/bin/env python
import sys, os
for fname in sys.argv[1:]:
with open(fname, "r") as fin:
line1 = fin.readline()
line2 = fin.readline()
fixedLine2 = " ".join(line2.split()) + '\n'
if fixedLine2 == line2:
continue
with open(fname + ".fixed", "w") as fout:
fout.write(line1)
fout.write(line2)
for line in fin:
fout.write(line)
# Enable these lines if you want the old files replaced with the new ones.
#os.remove(fname)
#os.rename(fname + ".fixed", fname)

I don't quite understand, but yes, sed is an option. I don't think any POSIX compliant version of sed has an in file option (-i), so a fully POSIX compliant solution would be.
sed -e 's/^BBB /BBB /' <file> > <newfile>

Use sed:
sed -e 's/[[:space:]][[:space:]]/ /g' yourfile.txt >> newfile.txt
This will replace any two adjacent spaces with one. The use of [[:space:]] just makes it a little bit clearer

sed -i -e '2s/ / /g' input.txt
-i: edit files in place

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pattern Matching in a csv file and appending to matched lines - python

I found a solution to match paths using sed. I did it through escap character and it worked. Pattern="\/var\/log\/Model\/1\/" Module=BE sudo sed -i"" "s/\r//; /$Pattern/ s/$/,$Module/" resultFile.csv Worked Fine!!

Below snippet appends the pattern in case if a match is found. If no match found just prints the line, awk '{if($0 ~ /pattern/) print $0",pattern"; else print $0;}' file.csv

Related

How can I print a grep result with the matched word?

vim and wc give different line counts

find and replace after the second column

Compare 2 files and remove any lines in file2 when they match values found in file1

Bash or python for changing spacing in files

Categories

Resources