I want to add a new column at the end, based on the text of another column(with an if statement), and then I want to add a new column/field name.
I am close but I am struggling with the syntax, I am using awk, but apologies its been a while since I used this. and I am wondering if I should use python/anaconda(jupyter notebook), but going with the easiest env I have available to me at the minute, awk .
This is my file:
$ cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
Here I want, based on the text in column 4, create a new column at the end and, but I am winging this a bit, that is I got it to work.
$ awk -F, '{if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}' file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
But here I wnat to add a field/column name at the end, which I am close, I believe.
$ awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1
{
if (substr($4,1,1)=="A")
print $0 (NR>1 ? FS substr($4,1,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,2) : "")
}
' file1
f1,f2,f3,f4,f5,test
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
What I want is this:
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
EDIT1
for my ref: this is the awk I want:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1
outputting it to file2:
awk -F, '{if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}' file1 > file2
$
$
2 files, file2 has other column added:
$ls
file1 file2
$cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
$cat file2
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
EDIT2 -- Correction
file 2 is what I want:
cat file1
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5
cat file2
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
awk -F, -v OFS=, 'NR==1{ print $0, "test"}
NR>1 {
if (substr($4,1,1)=="P")
print $0 (NR>1 ? FS substr($4,5,4) : "")
else
print $0 (NR>1 ? FS substr($4,1,4) : "")
}
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
Newlines matter. Change:
NR>1
{
to
NR>1 {
As written you have 2 independent statements equivalent to:
NR>1 { print }
<true condition> {
if (whatever) print foo; else print bar
}
instead of what you intended:
NR>1 {
if (whatever) print foo; else print bar
}
Having said that, try this instead of what you have:
awk '
BEGIN { FS=OFS="," }
NR == 1 { x = "test" }
NR > 1 { x = substr( $4, 1, ($4 ~ /^A/ ? 4 : 2) ) }
{ print $0, x }
' file1
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Same functionality, just more concise.
Rename x to some mnemonic of whatever you really intend that last column to represent.
You may use this awk command that removes any carriage return if present from each line before computing value of the last column:
awk 'BEGIN {FS=OFS=","}
{
sub(/\r$/, "")
print $0, (NR==1 ? "test" : (substr($4,1,1)=="A" ? substr($4,1,4) : substr($4,1,2)))
}' file
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
awk one-liner :
gawk '$++NF =(_^=!_)==NR ? "test" : substr(__=$4,_++,_^_^(__~"^A"))' FS=, OFS=,
|
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
to deal with Windows/DOS files, do this instead :
mawk 'BEGIN { RS ="\r?\n"; ___ = "test"; OFS = FS = ","
_++ } $++NF = _==NR ? ___ : substr(__=$4,_,++_^_--^(__~"^A"))'
This works because the regex selects whether to take 2^2^1 or 2^2^0, which works out to 4 and 2 respectively
python solution using solely standard library, let file.csv content be
f1,f2,f3,f4,f5
row1_1,row1_2,row1_3,SBCDE,row1_5
row2_1,row2_2,row2_3,AWERF,row2_5
row3_1,row3_2,row3_3,ASDFG,row3_5
then
import csv
with open('file.csv', newline='') as infile:
with open('fileout.csv', 'w', newline='') as outfile:
reader = csv.DictReader(infile)
outfields = reader.fieldnames + ['test']
writer = csv.DictWriter(outfile, outfields)
writer.writeheader()
for row in reader:
row['test'] = row['f4'][:4] if row['f4'][0] == 'A' else row['f4'][:2]
writer.writerow(row)
creates (or overwrite if already exist) file fileout.csv with followinc content
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SB
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
Explanation: I am using csv.DictReader and csv.DictWriter from csv, firstly I create two context managers (with open...) with newline suitable for reader and writer (see linked docs for further explanation), infile is opened for reading (default), whilst outfile for writing (w), I use reader for parsing infile, I concatenate its fieldnames (column names) with list holding single element test to get fieldnames for output file, then I output header (column names), then for each data row in input file I compute value for test using ternary operator (observe it is valueiftrue if condition else valueiffales which is different order than GNU AWK's condtion?valueiftrue:valueiffalse) and string slicing ([:n] means take n first character) and insert that into row dict, which is then written.
(tested in Python 3.8.10)
I'm not really sure how is the best way to do this... I was thinking I might need to do it in python?
filea.html contains data-tx-text="9817db21ccc2d9acc021c4536690b90a_se"
fileb.html contains data-tx-text="0850235fcb0e503150c224dad3156312_se"
There are the exact same occurrences of data-tx-text values from filea.html to fileb.html (171).
I want to be able to use a regex pattern or a simple Python program to
Find data-tx-text="(.*?)" in filea.html
Find data-tx-text="(.*?)" in fileb.html
Replace the value from filea.html with the value found in fileb.html
Move to the next occurrence.
Continue until the end of the file, or until all values in filea.html match those in fileb.html
I have the basics. For instance, I know the regex pattern that I need, and I am guessing I need to loop this in Python or something similar?
Maybe I can do it with sed, but I'm not that good with that, so any help is greatly appreciated.
In awk, you could use something like this:
NR == FNR {
match($0, /data-tx-text="[^"]+"/);
if (RSTART > 0) {
data[++a] = substr($0, RSTART + 14, RLENGTH - 15);
}
next;
}
/data-tx-text/ {
sub(/data-tx-text="[^"]+"/, "data-tx-text=\"" data[++b] "\"");
print;
}
With GNU awk for the 3rd arg to match():
$ cat tst.awk
match($0,/(.*)(data-tx-text="[^"]+")(.*)/,a) {
if (NR==FNR) {
fileb[++bcnt] = a[2]
}
else {
$0 = a[1] fileb[++acnt] a[3]
}
}
NR>FNR
$ awk -f tst.awk fileb filea
data-tx-text="0850235fcb0e503150c224dad3156312_se"
with other awks you'd use 3 calls to substr() after the match():
$ cat tst.awk
match($0,/data-tx-text="[^"]+"/) {
if (NR==FNR) {
fileb[++bcnt] = substr($0,RSTART,RLENGTH)
}
else {
$0 = substr($0,1,RSTART-1) fileb[++acnt] substr($0,RSTART+RLENGTH)
}
}
NR>FNR
$ awk -f tst.awk fileb filea
data-tx-text="0850235fcb0e503150c224dad3156312_se"
open filea find stringa
open fileb find stringb
replace stringa with stringb
replace stringb with stringa
Write files back
In code as below
import re
pattern = 'data-tx-text="(.*?)"'
With open('filea.html', 'r') as f:
filea = f.read()
With open('fileb.html', 'r') as f:
fileb = f.read()
stringa= re.match(pattern, filea).group()
stringb= re.match(pattern, fileb).group()
filea = filea.replace(stringa, stringb)
fileb = fileb.replace(stringb, stringa)
with open('filea.html', 'w') as f:
f.write(filea)
with open('filea.html', 'w') as f:
f.write(fileb)
So this is how I have solved it using python, its a bit manual in that i have to change the names of filea and fileb each time, but it works
I think i can improve the regex with escapes?
import re
import sys
with open('filea.html') as originalFile:
originalFileContents = originalFile.read()
pattern = re.compile(r'[0-9a-f]{32}_se')
originalMatches = pattern.findall(originalFileContents)
counter = 0
def replaceId(match):
global counter
value = match.group()
newValue = originalMatches[counter]
print counter, '=> replacing', value, 'with', newValue
counter = counter + 1
return newValue
with open('fileb.html') as targetFile:
targetFileContents = targetFile.read()
changedTargetFileContents = pattern.sub(replaceId, targetFileContents)
print changedTargetFileContents
new_file = open("Output.html", "w")
new_file.write(changedTargetFileContents)
new_file.close()
Available on Github: https://github.com/timm088/rehjex-py
Here's how I'd do it using Beautiful Soup:
from bs4 import BeautifulSoup as bs
replacements, replaced_html = [], ''
with open('fileb.html') as fileb:
# Extract replacements
soup = bs(fileb, 'html.parser')
tags = soup.find_all(lambda tag: tag.get('data-tx-text'))
replacements = [tag.get('data-tx-text') for tag in tags]
with open('filea.html') as filea:
# Replace values
soup = bs(filea, 'html.parser')
tags = soup.find_all(lambda tag: tag.get('data-tx-text'))
for tag in tags:
tag['data-tx-text'] = replacements.pop(0)
replaced_html = str(soup)
with open('filea.html', 'w') as new_filea:
# Update file
new_filea.write(replaced_html)
I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file. I want to put them into same line. For example:
a foo
a bar
a foo2
b bar
c bar2
After run the script it should become:
a foo;bar;foo2
b bar
c bar2
how can I do this in either a shell script or in Python?
thanks.
With awk you can try this
{ a[$1] = a[$1] ";" $2 }
END { for (item in a ) print item, a[item] }
So if you save this awk script in a file called awkf.awk and if your input file is ifile.txt, run the script
awk -f awkf.awk ifile.txt | sed 's/ ;/ /'
The sed script is to remove out the leading ;
Hope this helps
from collections import defaultdict
items = defaultdict(list)
for line in open('sourcefile'):
key, val = line.split('\t')
items[key].append(val)
result = open('result', 'w')
for k in sorted(items):
result.write('%s\t%s\n' % (k, ';'.join(items[k])))
result.close()
not tested
Tested with Python 2.7:
import csv
data = {}
reader = csv.DictReader(open('infile','r'),fieldnames=['key','value'],delimiter='\t')
for row in reader:
if row['key'] in data:
data[row['key']].append(row['value'])
else:
data[row['key']] = [row['value']]
writer = open('outfile','w')
for key in data:
writer.write(key + '\t' + ';'.join(data[key]) + '\n')
writer.close()
A Perl way to do it:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
open my $fh, '<', 'path/to/file' or die "unable to open file:$!";
my %res;
while(<$fh>) {
my ($k, $v) = split;
push #{$res{$k}}, $v;
}
print Dumper \%res;
output:
$VAR1 = {
'c' => [
'bar2'
],
'a' => [
'foo',
'bar',
'foo2'
],
'b' => [
'bar'
]
};
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*ARGV = *DATA;
my %record;
my #order;
while (<>) {
chomp;
my($key,$combine) = split;
push #order, $key unless exists $record{$key};
push #{ $record{$key} }, $combine;
}
print $_, "\t", join(";", #{ $record{$_} }), "\n" for #order;
__DATA__
a foo
a bar
a foo2
b bar
c bar2
Output (with tabs converted to spaces because Stack Overflow breaks the output):
a foo;bar;foo2
b bar
c bar2
def compress(infilepath, outfilepath):
input = open(infilepath, 'r')
output = open(outfilepath, 'w')
prev_index = None
for line in input:
index, val = line.split('\t')
if index == prev_index:
output.write(";%s" %val)
else:
output.write("\n%s %s" %(index, val))
input.close()
output.close()
Untested, but should work. Please leave a comment if there are any concerns