find and replace after the second column - python

I have the following lines
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;Sof_voya_Faible_Email_am;30/01/2015;Sof_voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;09/02/2015;Export Trav_Fort Postal
I'm trying to replace strings like Sof_ or _%yyyy%mm%dd% after the 7th field.
I thought about using sed
sed -i 's/<string_to_look_for>/<string_to_replace>/7g' filename
But it is only changing the field delimiter.
I thought about using this
awk -F";" '{ for (i=7; i<=NF; i++) print $i }' filename
but I don't know how to insert a search and replace for the strings I want to replace.
Any help is welcomed.
edit : expected outcome after replacing strings like Sof_ or _%yyyy%mm%dd% after the 7th column.
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal
to Python and Perl gurus, as i'm trying to ramp up my knowledge in these languages, your helps are welcomed:)

You can use this awk:
awk 'BEGIN{FS=OFS=";"} {for (i=7;i<=NF;i++) gsub(/Sof_|_%yyyy%mm%dd%/, "", $i) } 1' file
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal

Through python3.
#!/usr/bin/python3
import sys
fil = sys.argv[1]
with open(fil) as f:
for line in f:
part1 = ';'.join(line.split(';')[:7])
part2 = ';'.join(line.split(';')[7:]).replace('Sof_','').replace('_%yyyy%mm%dd%', '')
print(part1+';'+part2, end="")
save the above text in a file say script.py and then run it by,
python3 script.py inputfile
Through Perl.
$ perl -pe 's/^(?:[^;]*;){7}(*SKIP)(*F)|(?:_%yyyy%mm%dd%|Sof_)//g' file
92520536843;Sof_voya_Faible_Email_am;EMAIL;28/01/2015;1;0;0;voya_Faible_Email_am;30/01/2015;voya_Faible_Email_Relance_am
92515196529;Sof_trav_Fort_Email_pm_%yyyy%mm%dd%;EMAIL;05/02/2015;1;0;0;trav_Fort_Email_pm;09/02/2015;Export Trav_Fort Postal

In Python you would use the re and csv modules to do this:
import re
import csv
with open(fn) as fin:
r=csv.reader(fin, delimiter=';')
for line in r:
result=line[:7]
for field in line[:7]:
if re.search(r'Sof_', field):
field=re.sub(r'Sof_', 'repalcaement for Sof_', field)
if re.search(r'_%yyyy%mm%dd%', field):
field=re.sub(r'Sof_', 'repalcaement for _%yyyy%mm%dd%', field)
result.append(field)
print result

This might work for you (GNU sed):
sed -r ':a;s/^(([^;]*;){7}.*)(Sof_|_%yyyy%mm%dd%)/\1/;ta' file
This stores the first seven fields and following strings (that do not match the required strings) in the first backreference, then replaces the required strings by the said backreference.

Assuming you want the while line from the input file, and note: this starts with field #7. Your data exists earlier in each line.
awk -F";" '{ for (i=7; i<=NF; i++)
{gsub(/Sof_/,"newstring", ($i) } ;
print $0} ' filename
will replace Sof_ with "newstring". I'm not positive this is what you are looking for.
Correct syntax error - removed erratn ' character - thanks

Here is another way using perl's -F -a and autosplit:
perl -F";" -anE 'for ( #F[7..$#F] ) { $_ =~ s/Sof_|_%yyyy%mm%dd%//g }
print join ";", #F;' file.txt
This grabs elements 7 to last ($#F) of the autocreated #F array and removes/substitutes the text.

Related

How to remove starting newlines or the starting new from a binary file?

I see there are discussions about removing trailing newlines.
How can I delete a newline if it is the last character in a file?
But I don't find a discusion about removing starting newlines. Could anybody let me know what is the best way to delete starting newlines (one liner preferred)? Thanks.
The equivalent-opposite Perl code to chomp is s/^\n//. Instead of doing it on the last line (eof), do it on the first line. Even though it will only be an empty line, removing the newline will mean that line will print nothing in the output.
perl -pe 's/^\n// if $. == 1' filename >filename2
or in place:
perl -pi -e 's/^\n// if $. == 1' filename
Since starting newlines are by definition empty lines, you can also just skip printing them by using -n instead of -p (same behavior but without printing, so you can determine which lines to print).
perl -ni -e 'print unless $. == 1 and m/^\n/' filename
If you want to remove potentially multiple starting newlines, you could take another approach; advance the handle yourself in the beginning until you receive a non-empty line.
perl -pi -e 'if ($. == 1) { $_ = <> while m/^\n/ }' filename
It's all much easier if you don't mind reading the entire file into memory at once rather than line by line:
perl -0777 -pi -e 's/^\n+//' filename
To avoid doing any excess work editing the file unless it starts with newline characters, you could condition the edit by prefixing it with another command (reads first line of the file and causes a non-zero exit status if it doesn't start with a newline):
perl -e 'exit 1 unless <> =~ m/^\n/' filename && perl ...
In Python, start reading the file without writing in a loop until you get a non-empty line.
outdata = ""
with open(filename) as infile:
while True:
line = infile.readline()
if line != "\n":
break
if line:
outdata = line # save first non-empty line
outdata += infile.read() # save the rest of the file
with open(filename, "w") as outfile:
outfile.write(outdata)
Simple filter to skip leading empty lines
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $begin = 1;
while( <> ) {
next if /^$/ and $begin;
$begin = 0;
say;
}
One liner version perl -0777 -pe 's/^\n+//' filename
This is what I came up with, I'm sure it can still be improved a bit.
with open('../resources/temp_in.txt', 'r+') as file:
overwrite = False
for line in file:
if line:
overwrite = True
first_line = line
break
if overwrite:
contents = first_line + file.read()
file.seek(0)
file.write(contents)
file.truncate()
Here is an alternative solution, which opens the file twice.
with open('../resources/temp_in.txt') as file:
for line in file:
if line.strip():
contents = line + file.read()
break
else:
contents = ''
with open('../resources/temp_in.txt', 'w') as file:
file.write(contents)
Set a flag when you find a line that's not just a newline and print when that flag is set:
awk '/./{f=1}f' file
e.g.:
$ printf '\n\n\nfoo\nbar\n'
foo
bar
$ printf '\n\n\nfoo\nbar\n' | awk '/./{f=1}f'
foo
bar

URL List Comparison for Redirection

I was given 2 CSV files, with upwards of 3000+ URL's contained in each.
What I am tasked with is to create a .htaccess "redirection" chunk from "old site" to "new site", and rather than go through and manually compare them, I thought I could simply either try a bash/python script, or import them into MySQL to do the comparisons.
So, in Bash I tried the following code:
#!/bin/bash
awk 'BEGIN{FS=OFS="/"} {gsub(/\/$/, ""); $NF=tolower($NF)} NR==FNR{a[$NF]=$0; next} $NF in a {print a[$NF] " " $0 > "combined.csv"}' oldsite.csv newsite.csv
However, it returns me an empty "combined.csv", so I thought maybe "Python"... but alas, I know very little about Python, so then I thought MySQL... if I just import each CSV into a new table, I could run a comparison SQL statement and dump the results out to a 2 column new table... alas again, I am not really sure where to begin with the comparison, figuring on a LIKE comparison statement, but what I am wondering here is what would the "best" (meaning most accurate comparison) method be... and if Python, how?
CSV SAMPLES
NEW URLS
"new-url"
"/product/dangle-hoop-earrings-for-girls-with-cz-and-heart-dangle-in-14k-gold/"
"/product/dangle-hoop-earrings-for-girls-with-cz-and-butterfly-dangle-in-14k-gold/"
"/product/petite-lever-back-earrings-for-little-girls-in-14k-yellow-gold-with-blue-topaz-high-end-childrens-earrings/"
OLD URLS
"old-url"
"/product/0903-HUGGIEGK/Dangle-Hoop-Earrings-for-Girls-with-CZ-and-Heart-Dangle-in-14K-Gold/"
"/product/0954-HUGGIEGK/Dangle-Hoop-Earrings-for-Girls-with-CZ-and-Butterfly-Dangle-in-14K-Gold/"
"/product/10049Y4JBT/Petite-Lever-Back-Earrings-for-Little-Girls-in-14K-Yellow-Gold-with-Blue-Topaz---High-End-Childrens-Earrings/"
EXPECTED COMBINED
"old-url", "new-url"
"/product/0903-HUGGIEGK/Dangle-Hoop-Earrings-for-Girls-with-CZ-and-Heart-Dangle-in-14K-Gold/", "/product/dangle-hoop-earrings-for-girls-with-cz-and-heart-dangle-in-14k-gold/"
"/product/0954-HUGGIEGK/Dangle-Hoop-Earrings-for-Girls-with-CZ-and-Butterfly-Dangle-in-14K-Gold/", "/product/dangle-hoop-earrings-for-girls-with-cz-and-butterfly-dangle-in-14k-gold/"
"/product/10049Y4JBT/Petite-Lever-Back-Earrings-for-Little-Girls-in-14K-Yellow-Gold-with-Blue-Topaz---High-End-Childrens-Earrings/", "/product/petite-lever-back-earrings-for-little-girls-in-14k-yellow-gold-with-blue-topaz-high-end-childrens-earrings/"
As we discovered in our comment thread, you needed to convert your data so it can be processed in awk/unix by removing the \r part of MS-DOS line-endings with
dos2unix file
which converts file line endings from \r\n to \n. Note that you can call dos2unix with multiple filenames and each file will be processed, i.e.
dos2unix old.csv new.csv many_more ...
Here is your revised code which will create a separate file for unmatched records in the "new" file. The only correction I found needed was to change the final ouput to include the , char, so print a[$NF] "," $0 .
#!/bin/bash
awk 'BEGIN{FS=OFS="/"}
{ gsub(/\/$/, "")
# print "#dbg: FILENAME="FILENAME "\tNR="NR "\tFNR="FNR
$NF=tolower($NF)
}
NR==FNR{
a[$NF]=$0; next
}
{
if ($NF in a) {
print a[$NF] "," $0 > "combined.csv"
}
else {
print a[$NF] "," $0 > "unmatched.csv"
}
}
' oldsite.csv newsite.csv
output
cat combined.csv
"/product/10049Y4JBT/Petite-Lever-Back-Earrings-for-Little-Girls-in-14K-Yellow-Gold-with-Blue-Topaz---High-End-Childrens-Earrings/","/product/dangle-hoop-earrings-for-girls-with-cz-and-heart-dangle-in-14k-gold/"
"/product/10049Y4JBT/Petite-Lever-Back-Earrings-for-Little-Girls-in-14K-Yellow-Gold-with-Blue-Topaz---High-End-Childrens-Earrings/","/product/dangle-hoop-earrings-for-girls-with-cz-and-butterfly-dangle-in-14k-gold/"
"/product/10049Y4JBT/Petite-Lever-Back-Earrings-for-Little-Girls-in-14K-Yellow-Gold-with-Blue-Topaz---High-End-Childrens-Earrings/","/product/petite-lever-back-earrings-for-little-girls-in-14k-yellow-gold-with-blue-topaz-high-end-childrens-earrings/"
cat unmatched.csv
,"new-url"
IHTH

python subprocess awk with -F option and using variable for input file

I have a text file that has data delimited with '|'
E.g.
123 | 456 | 789
I want to print the second column only.
I can use awk in the shell like this: awk -F'|' '{print $2}' file.txt
However, I want to use python subprocess to do this. And also the input file must be a variable.
Right now, this is what I have.
import subprocess
file = "file-03-10-2016.txt"
with open('another_file.txt', 'wb') as output:
var = subprocess.check_call(['awk', '{print $2}', file])
print var
This prints the second column but it uses space as a delimiter. I want to change the delimiter to '|' using the -F option for awk.
Try:
var = subprocess.check_call(['awk', '-F|', '{print $2}', file])
However, I feel like I should point out that this task is very easy to do in pure python:
def awk_split(file_name, column, fs=None):
with open(file_name, 'r') as file_stream:
for line in file_stream:
yield line.split(fs)[column]
for val in awk_split(file, 1, fs='|'):
# do something...
subprocess.check_call takes a list of strings that are joined with space characters and passed to the shell. So you can just add the -F'|' argument as an item in the list. The only catch, is that the list is using single quotes. If you want to be consistent, you need to escape the single quotes in your argument:
var = subprocess.check_call(['awk', '-F\'|\'', '{print $2}', file])
Alternatively, python also accepts doublequotes as string delimiters:
var = subprocess.check_call(['awk', "-F'|'", '{print $2}', file])
Hope that helps.

Calling gawk from Python

I am trying to call gawk (the GNU implementation of AWK) from Python in this manner.
import os
import string
import codecs
ligand_file=open( "2WTKA_ab.txt", "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines )
ligand_file.close()
for i in ligand_lines:
os.system ( " gawk %s %s"%( "'{if ($2==""i"") print $0}'", 'unique_count_a_from_ac.txt' ) )
My problem is that "i" is not being replaced by the value it represent. The value "i" represents is an integer and not a string. How can I fix this problem?
That's a non-portable and messy way to check if something is in a file. Imagine you have 1000 lines, you will be making system call to gawk 1000 times. It's super inefficient. You are using Python, so do them in Python.
....
ligand_file=open( "2WTKA_ab.txt", "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( str.strip, ligand_lines )
ligand_file.close()
for line in open("unique_count_a_from_ac.txt"):
sline=line.strip().split()
if sline[1] in ligand_lines:
print line.rstrip()
Or you can also use this one liner if Python is not a must.
gawk 'FNR==NR{a[$0]; next}($2 in a)' 2WTKA_ab.txt unique_count_a_from_ac.txt
Your problem is in the quoting, in python something like "some test "" with quotes" will not give you a quote. Try this instead:
os.system('''gawk '{if ($2=="%s") print $0}' unique_count_a_from_ac.txt''' % i)

Bash or python for changing spacing in files

I have a set of 10000 files. In all of them, the second line, looks like:
AAA 3.429 3.84
so there is just one space (requirement) between AAA and the two other columns. The rest of lines on each file are completely different and correspond to 10 columns of numbers.
Randomly, in around 20% of the files, and due to some errors, one gets
BBB 3.429 3.84
so now there are two spaces between the first and second column.
This is a big error so I need to fix it, changing from 2 to 1 space in the files where the error takes place.
The first approach I thought of was to write a bash script that for each file reads the 3 values of the second line and then prints them with just one space, doing it for all the files.
I wonder what do oyu think about this approach and if you could suggest something better, bashm python or someother approach.
Thanks
Performing line-based changes to text files is often simplest to do in sed.
sed -e '2s/ */ /g' infile.txt
will replace any runs of multiple spaces with a single space. This may be changing more than you want, though.
sed -e '2s/^\([^ ]*\) /\1 /' infile.txt
should just replace instances of two spaces after the first block of space-free text with a single space (though I have not tested this).
(edit: inserted 2 before s in each instance to tie the edit to the second line, specifically.)
Use sed.
for file in *
do
sed -i '' '2s/ / /' "$file"
done
The -i '' flag means to edit in-place without a backup.
Or use ed!
for file in *
do
printf "2s/ / /\nwq\n" |ed -s "$file"
done
if the error always can occur at 2nd line,
for file in file*
do
awk 'NR==2{$1=$1}1' file >temp
mv temp "$file"
done
or sed
sed -i.bak '2s/ */ /' file* # do 2nd line
Or just pure bash scripting
i=1
while read -r line
do
if [ "$i" -eq 2 ];then
echo $line
else
echo "$line"
fi
((i++))
done <"file"
Since it seems every column is separated by one space, another approach not yet mentioned is to use tr to squeeze all multi spaces into single spaces:
tr -s " " < infile > outfile
I am going to be different and go with AWK:
awk '{print $1,$2,$3}' file.txt > file1.txt
This will handle any number of spaces between fields, and replace them with one space
To handle a specific line you can add line addresses:
awk 'NR==2{print $1,$2,$3} NR!=2{print $0}' file.txt > file1.txt
i.e. rewrite line 2, but leave unchanged the other lines.
A line address can be a regular expression as well:
awk '/regexp/{print $1,$2,$3} !/regexp/{print}' file.txt > file1.txt
This answer assumes you don't want to mess with any except the second line.
#!/usr/bin/env python
import sys, os
for fname in sys.argv[1:]:
with open(fname, "r") as fin:
line1 = fin.readline()
line2 = fin.readline()
fixedLine2 = " ".join(line2.split()) + '\n'
if fixedLine2 == line2:
continue
with open(fname + ".fixed", "w") as fout:
fout.write(line1)
fout.write(line2)
for line in fin:
fout.write(line)
# Enable these lines if you want the old files replaced with the new ones.
#os.remove(fname)
#os.rename(fname + ".fixed", fname)
I don't quite understand, but yes, sed is an option. I don't think any POSIX compliant version of sed has an in file option (-i), so a fully POSIX compliant solution would be.
sed -e 's/^BBB /BBB /' <file> > <newfile>
Use sed:
sed -e 's/[[:space:]][[:space:]]/ /g' yourfile.txt >> newfile.txt
This will replace any two adjacent spaces with one. The use of [[:space:]] just makes it a little bit clearer
sed -i -e '2s/ / /g' input.txt
-i: edit files in place

Categories