merge multiple lines into single line by value of column - python
I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file. I want to put them into same line. For example:
a foo
a bar
a foo2
b bar
c bar2
After run the script it should become:
a foo;bar;foo2
b bar
c bar2
how can I do this in either a shell script or in Python?
thanks.
With awk you can try this
{ a[$1] = a[$1] ";" $2 }
END { for (item in a ) print item, a[item] }
So if you save this awk script in a file called awkf.awk and if your input file is ifile.txt, run the script
awk -f awkf.awk ifile.txt | sed 's/ ;/ /'
The sed script is to remove out the leading ;
Hope this helps
from collections import defaultdict
items = defaultdict(list)
for line in open('sourcefile'):
key, val = line.split('\t')
items[key].append(val)
result = open('result', 'w')
for k in sorted(items):
result.write('%s\t%s\n' % (k, ';'.join(items[k])))
result.close()
not tested
Tested with Python 2.7:
import csv
data = {}
reader = csv.DictReader(open('infile','r'),fieldnames=['key','value'],delimiter='\t')
for row in reader:
if row['key'] in data:
data[row['key']].append(row['value'])
else:
data[row['key']] = [row['value']]
writer = open('outfile','w')
for key in data:
writer.write(key + '\t' + ';'.join(data[key]) + '\n')
writer.close()
A Perl way to do it:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
open my $fh, '<', 'path/to/file' or die "unable to open file:$!";
my %res;
while(<$fh>) {
my ($k, $v) = split;
push #{$res{$k}}, $v;
}
print Dumper \%res;
output:
$VAR1 = {
'c' => [
'bar2'
],
'a' => [
'foo',
'bar',
'foo2'
],
'b' => [
'bar'
]
};
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*ARGV = *DATA;
my %record;
my #order;
while (<>) {
chomp;
my($key,$combine) = split;
push #order, $key unless exists $record{$key};
push #{ $record{$key} }, $combine;
}
print $_, "\t", join(";", #{ $record{$_} }), "\n" for #order;
__DATA__
a foo
a bar
a foo2
b bar
c bar2
Output (with tabs converted to spaces because Stack Overflow breaks the output):
a foo;bar;foo2
b bar
c bar2
def compress(infilepath, outfilepath):
input = open(infilepath, 'r')
output = open(outfilepath, 'w')
prev_index = None
for line in input:
index, val = line.split('\t')
if index == prev_index:
output.write(";%s" %val)
else:
output.write("\n%s %s" %(index, val))
input.close()
output.close()
Untested, but should work. Please leave a comment if there are any concerns
Related
awk + adding a column baed on values of another column + adding a field name in the 1 command
I want to add a new column at the end, based on the text of another column(with an if statement), and then I want to add a new column/field name. I am close but I am struggling with the syntax, I am using awk, but apologies its been a while since I used this. and I am wondering if I should use python/anaconda(jupyter notebook), but going with the easiest env I have available to me at the minute, awk . This is my file: $ cat file1 f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5 row2_1,row2_2,row2_3,AWERF,row2_5 row3_1,row3_2,row3_3,ASDFG,row3_5 Here I want, based on the text in column 4, create a new column at the end and, but I am winging this a bit, that is I got it to work. $ awk -F, '{if (substr($4,1,1)=="A") print $0 (NR>1 ? FS substr($4,1,4) : "") else print $0 (NR>1 ? FS substr($4,1,2) : "") }' file1 f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF But here I wnat to add a field/column name at the end, which I am close, I believe. $ awk -F, -v OFS=, 'NR==1{ print $0, "test"} NR>1 { if (substr($4,1,1)=="A") print $0 (NR>1 ? FS substr($4,1,4) : "") else print $0 (NR>1 ? FS substr($4,1,2) : "") } ' file1 f1,f2,f3,f4,f5,test f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5 row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5 row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF What I want is this: f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF EDIT1 for my ref: this is the awk I want: awk -F, '{if (substr($4,1,1)=="P") print $0 (NR>1 ? FS substr($4,5,4) : "") else print $0 (NR>1 ? FS substr($4,1,4) : "") }' file1 outputting it to file2: awk -F, '{if (substr($4,1,1)=="P") print $0 (NR>1 ? FS substr($4,5,4) : "") else print $0 (NR>1 ? FS substr($4,1,4) : "") }' file1 > file2 $ $ 2 files, file2 has other column added: $ls file1 file2 $cat file1 f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5 row2_1,row2_2,row2_3,AWERF,row2_5 row3_1,row3_2,row3_3,ASDFG,row3_5 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5 $cat file2 f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ EDIT2 -- Correction file 2 is what I want: cat file1 f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5 row2_1,row2_2,row2_3,AWERF,row2_5 row3_1,row3_2,row3_3,ASDFG,row3_5 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5 cat file2 f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ awk -F, -v OFS=, 'NR==1{ print $0, "test"} NR>1 { if (substr($4,1,1)=="P") print $0 (NR>1 ? FS substr($4,5,4) : "") else print $0 (NR>1 ? FS substr($4,1,4) : "") } ' file1 f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
Newlines matter. Change: NR>1 { to NR>1 { As written you have 2 independent statements equivalent to: NR>1 { print } <true condition> { if (whatever) print foo; else print bar } instead of what you intended: NR>1 { if (whatever) print foo; else print bar } Having said that, try this instead of what you have: awk ' BEGIN { FS=OFS="," } NR == 1 { x = "test" } NR > 1 { x = substr( $4, 1, ($4 ~ /^A/ ? 4 : 2) ) } { print $0, x } ' file1 f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF Same functionality, just more concise. Rename x to some mnemonic of whatever you really intend that last column to represent.
You may use this awk command that removes any carriage return if present from each line before computing value of the last column: awk 'BEGIN {FS=OFS=","} { sub(/\r$/, "") print $0, (NR==1 ? "test" : (substr($4,1,1)=="A" ? substr($4,1,4) : substr($4,1,2))) }' file f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
awk one-liner : gawk '$++NF =(_^=!_)==NR ? "test" : substr(__=$4,_++,_^_^(__~"^A"))' FS=, OFS=, | f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF to deal with Windows/DOS files, do this instead : mawk 'BEGIN { RS ="\r?\n"; ___ = "test"; OFS = FS = "," _++ } $++NF = _==NR ? ___ : substr(__=$4,_,++_^_--^(__~"^A"))' This works because the regex selects whether to take 2^2^1 or 2^2^0, which works out to 4 and 2 respectively
python solution using solely standard library, let file.csv content be f1,f2,f3,f4,f5 row1_1,row1_2,row1_3,SBCDE,row1_5 row2_1,row2_2,row2_3,AWERF,row2_5 row3_1,row3_2,row3_3,ASDFG,row3_5 then import csv with open('file.csv', newline='') as infile: with open('fileout.csv', 'w', newline='') as outfile: reader = csv.DictReader(infile) outfields = reader.fieldnames + ['test'] writer = csv.DictWriter(outfile, outfields) writer.writeheader() for row in reader: row['test'] = row['f4'][:4] if row['f4'][0] == 'A' else row['f4'][:2] writer.writerow(row) creates (or overwrite if already exist) file fileout.csv with followinc content f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SB row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF Explanation: I am using csv.DictReader and csv.DictWriter from csv, firstly I create two context managers (with open...) with newline suitable for reader and writer (see linked docs for further explanation), infile is opened for reading (default), whilst outfile for writing (w), I use reader for parsing infile, I concatenate its fieldnames (column names) with list holding single element test to get fieldnames for output file, then I output header (column names), then for each data row in input file I compute value for test using ternary operator (observe it is valueiftrue if condition else valueiffales which is different order than GNU AWK's condtion?valueiftrue:valueiffalse) and string slicing ([:n] means take n first character) and insert that into row dict, which is then written. (tested in Python 3.8.10)
Get a string in Shell/Python with subprocess
After this topic Get a string in Shell/Python using sys.argv , I need to change my code, I need to use a subprocess in a main.py with this function : def download_several_apps(self): subproc_two = subprocess.Popen(["./readtext.sh", self.inputFileName_download], stdout=subprocess.PIPE) Here is my file readtext.sh #!/bin/bash filename="$1" counter=1 while IFS=: true; do line='' read -r line if [ -z "$line" ]; then break fi python3 ./download.py \ -c ./credentials.json \ --blobs \ "$line" done < "$filename" And my download.py file if (len(sys.argv) == 2): downloaded_apk_default_location = 'Downloads/' else: readtextarg = os.popen("ps " + str(os.getppid()) + " | awk ' { out = \"\"; for(i = 6; i <= NF; i++) out = out$i\" \" } END { print out } ' ").read() textarg = readtextarg.split(" ")[1 : -1][0] downloaded_apk_default_location = 'Downloads/'+textarg[1:] How can I get and print self.inputFileName_download in my download.py file ? I used sys.argv as answerd by #tripleee in my previous post but it doesn't work as I need.
Ok I changed the last line by : downloaded_apk_default_location = 'Downloads/'+textarg.split("/")[-1] to get the textfile name
The shell indirection seems completely superfluous here. import download with open(self.inputFileName_download) as apks: for line in apks: if line == '\n': break blob = line.rstrip('\n') download.something(blob=blob, credentials='./credentials.json') ... where obviously I had to speculate about what the relevant function from downloads.py might be called.
Python to search keyword starts with and Replace in file
I have file1.txt which has below contents if [ "x${GRUB_DEVICE_UUID}" = "x" ] || [ "x${GRUB_DISABLE_LINUX_UUID}" = "xtrue" ] \ || ! test -e "/dev/disk/by-uuid/${GRUB_DEVICE_UUID}" \ || uses_abstraction "${GRUB_DEVICE}" lvm; then LINUX_ROOT_DEVICE=${GRUB_DEVICE} else LINUX_ROOT_DEVICE=UUID=${GRUB_DEVICE_UUID} fi GRUBFS="`${grub_probe} --device ${GRUB_DEVICE} --target=fs 2>/dev/null || true`" Linux_CMDLINE="nowatchdog rcupdate.rcu_cpu_stall_suppress=1" I want to find string starts with Linux_CMDLINE=" and replace that line with Linux_CMDLINE="" I tried below code and it is not working. Also I am thinking it is not best way to implement. Is there any easy method to achieve this? with open ('/etc/grub.d/42_sgi', 'r') as f: newlines = [] for line in f.readlines(): if line.startswith('Linux_CMDLINE=\"'): newlines.append("Linux_CMDLINE=\"\"") else: newlines.append(line) with open ('/etc/grub.d/42_sgi', 'w') as f: for line in newlines: f.write(line) output expected: if [ "x${GRUB_DEVICE_UUID}" = "x" ] || [ "x${GRUB_DISABLE_LINUX_UUID}" = "xtrue" ] \ || ! test -e "/dev/disk/by-uuid/${GRUB_DEVICE_UUID}" \ || uses_abstraction "${GRUB_DEVICE}" lvm; then LINUX_ROOT_DEVICE=${GRUB_DEVICE} else LINUX_ROOT_DEVICE=UUID=${GRUB_DEVICE_UUID} fi GRUBFS="`${grub_probe} --device ${GRUB_DEVICE} --target=fs 2>/dev/null || true`" Linux_CMDLINE=""
repl = 'Linux_CMDLINE=""' with open ('/etc/grub.d/42_sgi', 'r') as f: newlines = [] for line in f.readlines(): if line.startswith('Linux_CMDLINE='): line = repl newlines.append(line)
Minimal code thanks to open file for both reading and writing? # Read and write (r+) with open("file.txt","r+") as f: find = r'Linux_CMDLINE="' changeto = r'Linux_CMDLINE=""' # splitlines to list and glue them back with join newstring = ''.join([i if not i.startswith(find) else changeto for i in f]) f.seek(0) f.write(newstring) f.truncate()
Compare a regex match from two separate files and replace with values from one of them
I'm not really sure how is the best way to do this... I was thinking I might need to do it in python? filea.html contains data-tx-text="9817db21ccc2d9acc021c4536690b90a_se" fileb.html contains data-tx-text="0850235fcb0e503150c224dad3156312_se" There are the exact same occurrences of data-tx-text values from filea.html to fileb.html (171). I want to be able to use a regex pattern or a simple Python program to Find data-tx-text="(.*?)" in filea.html Find data-tx-text="(.*?)" in fileb.html Replace the value from filea.html with the value found in fileb.html Move to the next occurrence. Continue until the end of the file, or until all values in filea.html match those in fileb.html I have the basics. For instance, I know the regex pattern that I need, and I am guessing I need to loop this in Python or something similar? Maybe I can do it with sed, but I'm not that good with that, so any help is greatly appreciated.
In awk, you could use something like this: NR == FNR { match($0, /data-tx-text="[^"]+"/); if (RSTART > 0) { data[++a] = substr($0, RSTART + 14, RLENGTH - 15); } next; } /data-tx-text/ { sub(/data-tx-text="[^"]+"/, "data-tx-text=\"" data[++b] "\""); print; }
With GNU awk for the 3rd arg to match(): $ cat tst.awk match($0,/(.*)(data-tx-text="[^"]+")(.*)/,a) { if (NR==FNR) { fileb[++bcnt] = a[2] } else { $0 = a[1] fileb[++acnt] a[3] } } NR>FNR $ awk -f tst.awk fileb filea data-tx-text="0850235fcb0e503150c224dad3156312_se" with other awks you'd use 3 calls to substr() after the match(): $ cat tst.awk match($0,/data-tx-text="[^"]+"/) { if (NR==FNR) { fileb[++bcnt] = substr($0,RSTART,RLENGTH) } else { $0 = substr($0,1,RSTART-1) fileb[++acnt] substr($0,RSTART+RLENGTH) } } NR>FNR $ awk -f tst.awk fileb filea data-tx-text="0850235fcb0e503150c224dad3156312_se"
open filea find stringa open fileb find stringb replace stringa with stringb replace stringb with stringa Write files back In code as below import re pattern = 'data-tx-text="(.*?)"' With open('filea.html', 'r') as f: filea = f.read() With open('fileb.html', 'r') as f: fileb = f.read() stringa= re.match(pattern, filea).group() stringb= re.match(pattern, fileb).group() filea = filea.replace(stringa, stringb) fileb = fileb.replace(stringb, stringa) with open('filea.html', 'w') as f: f.write(filea) with open('filea.html', 'w') as f: f.write(fileb)
So this is how I have solved it using python, its a bit manual in that i have to change the names of filea and fileb each time, but it works I think i can improve the regex with escapes? import re import sys with open('filea.html') as originalFile: originalFileContents = originalFile.read() pattern = re.compile(r'[0-9a-f]{32}_se') originalMatches = pattern.findall(originalFileContents) counter = 0 def replaceId(match): global counter value = match.group() newValue = originalMatches[counter] print counter, '=> replacing', value, 'with', newValue counter = counter + 1 return newValue with open('fileb.html') as targetFile: targetFileContents = targetFile.read() changedTargetFileContents = pattern.sub(replaceId, targetFileContents) print changedTargetFileContents new_file = open("Output.html", "w") new_file.write(changedTargetFileContents) new_file.close() Available on Github: https://github.com/timm088/rehjex-py
Here's how I'd do it using Beautiful Soup: from bs4 import BeautifulSoup as bs replacements, replaced_html = [], '' with open('fileb.html') as fileb: # Extract replacements soup = bs(fileb, 'html.parser') tags = soup.find_all(lambda tag: tag.get('data-tx-text')) replacements = [tag.get('data-tx-text') for tag in tags] with open('filea.html') as filea: # Replace values soup = bs(filea, 'html.parser') tags = soup.find_all(lambda tag: tag.get('data-tx-text')) for tag in tags: tag['data-tx-text'] = replacements.pop(0) replaced_html = str(soup) with open('filea.html', 'w') as new_filea: # Update file new_filea.write(replaced_html)
Extracting same lines from two files while disregarding lower/uppercase
The aim is to extrac the same lines from two files while disregarding lower/uppercase and also disregarding punctuations I have two files source.txt Foo bar blah blah black sheep Hello World Kick the, bucket processed.txt foo bar blah sheep black Hello world kick the bucket , Desired output (from source.txt): Foo bar Hello World Kick the, bucket I have been doing it as such: from string import punctuation with open('source.txt', 'r') as f1, open('processed.txt', 'r') as f2: for i,j in zip(f1, f2): lower_depunct_f1 = " ".join("".join([ch.lower() for ch in f1 if f1 not in punctuation]).split()) lower_depunct_f2 = " ".join("".join([ch.lower() for ch in f2 if f2 not in punctuation]).split()) if lower_depunct_f1 == lower_depunct_f2: print f1 else: print Is there a way to do this with bash tools? perl, shell, awk, sed?
Easier to do this using awk: awk 'FNR==NR {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); a[s]++;next} {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); print (s in a)?$0:""}' file2 file1 Foo bar Hello World Kick the, bucket
The Perl solution is quite similar to the Python one: open my $S1, '<', 'source.txt' or die $!; open my $S2, '<', 'processed.txt' or die $!; while (defined(my $s1 = <$S1>) and defined (my $s2 = <$S2>)) { s/[[:punct:]]//g for $s1, $s2; $_ = lc for $s1, $s2; print $s1 eq $s2 ? $s1 : "\n"; } Note that the result is different from yours, as the space after kick the bucket was not removed.
Bash solution, quite sinmilar the Perl one, with the same different result (as the space after kick the bucket was not removed): #!/bin/bash shopt -s nocasematch exec 3<> source.txt # Open source.txt and assign fd 3 to it. exec 4<> processed.txt while read <&3 varline && read <&4 varpro do varline_noPunct=`echo $varline | tr -d '[:punct:]'` varpro_noPunct=`echo $varpro | tr -d '[:punct:]'` [[ $varline_noPunct == $varpro_noPunct ]] && echo "$varline" || echo done exec 3>&- # Close fd 3. exec 4>&-
Check if this solution helps you : use strict; use warnings; my $f1 = $ARGV[0]; open FILE1, "<", $f1 or die $!; my $f2 = $ARGV[1]; open FILE2, "<", $f2 or die $!; open OUTFILE, ">", "cmp.txt" or die $!; my %seen; while (<FILE1>) { $_ =~ s/[[:punct:]]//isg; $seen{lc($_)} = 1; } while (<FILE2>) { my $next_line = <FILE2>; $_ =~ s/[[:punct:]]//isg; if ($seen{lc($_)}) { print OUTFILE $_; } } close OUTFILE;