My text file out put looks like this on two lines:
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
My desired output is too look like this:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
I'm having no luck trying this:
sed '$!N;s/|/\n/' foo
Any advice would be welcomed, thank you.
As you have just two lines, this can be a way:
$ paste -d' ' <(head -1 file | sed 's/|/\n/g') <(tail -1 file | sed 's/|/\n/g')
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
By pieces. Let's get the first line and replace every pipe | with a new line:
$ head -1 file | sed 's/|/\n/g'
DelayTimeThreshold
MaxDelayPerMinute
Name
And do the same with the last line:
$ tail -1 file | sed 's/|/\n/g'
10000
5
rca
Then it is just a matter of pasting both results with a space as delimiter:
paste -d' ' output1 output2
this awk one-liner would work for your requirement:
awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}' file
output:
kent$ echo "DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca"|awk -F'|' '!f{gsub(/\||$/," %s\n");f=$0;next}{printf f,$1,$2,$3}'
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Using the Array::Transpose module:
perl -MArray::Transpose -F'\|' -lane '
push #a, [#F]
} END {print for map {join " ", #$_} transpose(\#a)
' <<END
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|rca
END
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
As a perl script:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $file1 = ('DelayTimeThreshold|MaxDelayPerMinute|Name');
my $file2 = ('10000|5|rca');
my #file1 = split('\|', $file1);
my #file2 = split('\|', $file2);
my %hash;
#hash{#file1} = #file2;
print Dumper \%hash;
Output:
$VAR1 = {
'Name' => 'rca',
'DelayTimeThreshold' => '10000',
'MaxDelayPerMinute' => '5'
};
OR:
for (my $i = 0; $i < $#file1; $i++) {
print "$file1[$i] $file2[$i]\n";
}
Output:
DelayTimeThreshold 10000
MaxDelayPerMinute 5
Name rca
Suppose you have a file that contains a single header row with column names, followed by multiple detail rows with column values, for example,
DelayTimeThreshold|MaxDelayPerMinute|Name
10000|5|abc
20001|6|def
30002|7|ghk
40003|8|jkl
50004|9|mnp
The following code would print that file, using the names from the first row, paired with values from each subsequent (detail) row,
#!/bin/perl -w
use strict;
my ($fn,$fh)=("header.csv"); #whatever the file is named...
open($fh,"< $fn") || error "cannot open $fn";
my ($count,$line,#names,#vals)=(0);
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; } #first line is names
for (my $ndx=0; $ndx<=$#names; $ndx++) { #print each
print "$names[$ndx] $vals[$ndx]\n";
}
}
Suppose you want to keep around each row, annotated with names, in an array,
my %row;
my #records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
push(#records,\%row);
}
Maybe you want to refer to the rows by some key column,
my %row;
my %records;
while(<$fh>)
{
chomp $_;
#vals=split(/\|/,$_);
if($count++<1) { #names=#vals; next; }
#row{#names} = #vals;
$records{$vals[0]}=\%row;
}
Related
I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;
To test my script I made two example files:
$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
What I try to accomplish is to create files for the differences.
A file with the uniques in disk.txt (so I can delete them from disk)
A file with the uniques in database.txt (So I can retrieve them from backup and restore)
Using comm to retrieve differences
I used comm to see the differences between the two files. Sadly comm also returns the duplicates after some uniques.
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
using comm on one of these large files takes 28,38s. This is really fast but is solely not a solution.
using fgrep to strip duplicates from comm result
I can use fgrep to remove the duplicates from the comm result and this works on the example.
$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
On the large files this script just crashed after a while. So it is not a viable option to solve my problem.
Using python difflib to get uniques
I tried using this python script I got from BigSpicyPotato's answer on a different post
import difflib
with open(r'disk.txt','r') as masterdata:
with open(r'database.txt','r') as useddata:
with open(r'uniq-disk.txt','w+') as Newdata:
usedfile = [ x.strip('\n') for x in list(useddata) ]
masterfile = [ x.strip('\n') for x in list(masterdata) ]
for line in masterfile:
if line not in usedfile:
Newdata.write(line + '\n')
this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..
Question
Any faster / better option I can try in bash / python? I was aswell looking into awk / sed to maybe parse the the results form comm.
From man comm, * added by me:
Compare **sorted** files FILE1 and FILE2 line by line.
You have to sort the files for comm.
sort database.txt > database_sorted.txt
sort disk.txt > disk_sorted.txt
comm -13 database_sorted.txt disk_sorted.txt
See man sort for various speed and memory enhancing options, like --batch-size, --temporary-directory --buffer-size --parallel.
A file with the uniques in disk.txt
A file with the uniques in database.txt
After sorting, you can implement your python program that compares line-by-line the files and write to mentioned files, just like comm with custom output. Do not store whole files in memory.
You can also do something along this with join or comm --output-delimiter=' ':
join -v1 -v2 -o 1.1,2.1 disk_sorted.txt database_sorted.txt | tee >(
cut -d' ' -f1 | grep -v '^$' > unique_in_disk.txt) |
cut -d' ' -f2 | grep -v '^$' > unique_in_database.txt
comm does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check #KamilCuk answer for more context about sorting your files and using comm.
# WHINY_USERS=1 isn't trying to insult anyone -
# it's a special shell variable recognized by
# mawk-1 to presort the results
WHINY_USERS=1 {m,g}awk '
function phr(_) {
print \
"\n Uniques for file : { "\
(_)" } \n\n -----------------\n"
}
BEGIN {
split(_,____)
split(_,______)
PROCINFO["sorted_in"] = "#val_num_asc"
FS = "^$"
} FNR==NF {
______[++_____]=FILENAME
} {
if($_ in ____) {
delete ____[$_]
} else {
____[$_]=_____ ":" NR
}
} END {
for(__ in ______) {
phr(______[__])
_____=_<_
for(_ in ____) {
if(+(___=____[_])==+__) {
print " ",++_____,_,
"(line "(substr(___,
index(___,":")+!!__))")"
} } }
printf("\n\n") } ' testfile_disk.txt testfile_database.txt
|
Uniques for file : { testfile_disk.txt }
-----------------
1 07fffeed-5a0b-41f8-86cd-e6d99834c187 (line 7)
2 08ffff24-fb12-488c-87eb-1a07072fc706 (line 8)
3 09ffff29-ba3d-4582-8ce2-80b47ed927d1 (line 9)
Uniques for file : { testfile_database.txt }
-----------------
1 11ffffaf-fd54-49f3-9719-4a63690430d9 (line 18)
2 12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29 (line 19)
(An adaptation of David Erickson's question here)
Given a CSV file with columns A, B, and C and some values:
echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv
We would like to sort by columns a and b:
sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438
And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.
The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.
Is there a quicker way to do this, in awk, Python, or some other scripting language?
Additional info:
I know I can do this in Pandas - I'm looking for a faster way using text processing
Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)
In python:
import pandas as pd
df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)
In awk you can split like so, you will need to put the header back on each resultant file:
awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv
With adding the header line back:
awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv
The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.
Since your input is to be sorted on the key fields all you need is:
sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
close(prev)
print hdr > out
prev = out
}
{ print > out }
'
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an input file which looks like this:
input.txt
THISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I have another file with positions of letters i want to change and the letter i want to change it to, such as this:
textpos.txt
Position Text_Change
1 A
2 B
3 X
(Actually there will be about 10,000 alphabet changes)
And I would like one separate output file for each text change, which should look like this:
output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Next one:
output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
I would like to learn how to do this in an awk command and a pythonic way as well, and was wondering what would be the best and quickest way to do this?
Thanks in advance.
Could you please try following(considering that your actual Input_files will be having same kind of data in them). This solution should take care of error Too many open files error while running awk command since I am closing the output files in awk code.
awk '
FNR==NR{
a[++count]=$0
next
}
FNR>1{
close(file)
file="output"(FNR-1)".txt"
for(i=1;i<=count;i++){
if($1==1){
print $2 substr(a[i],2) > file
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file
}
}
}' input.txt textpos.txt
3 output files named output1.txt, output2.txt and output3.txt and their content will be as follows.
cat output1.txt
AHISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output2.txt
TBISISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
cat output3.txt
THXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Explanation: Adding explanation for above code here.
awk '
FNR==NR{ ##Condition FNR==NR will be TRUE when first file named input.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is increasing value of count and value is current line.
next ##next will skip all further statements from here.
}
FNR>1{ ##This condition will be executed when 2nd Input_file textpos.txt is being read(excluding its header).
close(file) ##Closing file named file whose value will be output file names, getting created further.
file="output"(FNR-1)".txt" ##Creating output file named output FNR-1(line number -1) and .txt in it.
for(i=1;i<=count;i++){ ##Starting a for loop from 1 to till count value.
if($1==1){ ##Checking condition if value of 1st field is 1 then do following.
print $2 substr(a[i],2) > file ##Printing $2 substring of value of a[i] which starts from 2nd position till end of line to output file.
}
else{
print substr(a[i],1,$1-1) $2 substr(a[i],$1+1) > file ##Printing substrings 1st 1 to till value of $1-1 $2 and then substring from $1+1 till end of line.
}
}
}' input.txt textpos.txt ##Mentioning Input_file names here.
Using gawk:
$ awk 'NR > 1 && FNR == NR { r[$1] = $2; next } {
for (i in r) {
print substr($0, 1, i - 1) r[i] substr($0, i + 1) > "output" i ".txt"
}
}' textpos.txt input.txt
Using awk, abusing FS="" for the second file making each letter a column of its own:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash positions and letters to a
{
for(i in a) # for all positions
$i=a[i] # replace the letters in them
}1' textpos FS="" OFS="" file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Another using for and substr to build a variable char by char from a[] and $0:
$ awk '
NR==FNR {
a[$1]=$2; next } # hash textpos to a
{
for(i=1;i<=length($1);i++) # for each position in $0
b=b ((i in a)?a[i]:substr($0,i,1)) # get char from a[] or $0, in that order
print b; b="" # output and reset b for next round
}' textpos file
ABXSISANEXAMPLEOFANINPUTFILEWITHALONGSTRINGOFTEXT
Suppose I have a huge text file like below:
19990231
blabla
sssssssssssss
hhhhhhhhhhhhhh
ggggggggggggggg
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
20010221
fgghgg
sssssssssssss
hhhhhhhhhhhhhhh
ggggggggggggggg
<etc>
How can I randomly remove 100 blocks that start with numeric characters and end with a blank line? Eg:
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
<blank line>
This is not that difficult. The trick is to define the records first and this can be done with the record separator :
RS: The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So the number of records is given by :
$ NR=$(awk 'BEGIN{RS=""}END{print NR}' <file>)
You can then use shuf to get a hundred random numbers between 1 and NR:
$ shuf -i 1-$NR -n 100
This command you feed again in awk to select the records:
$ awk -v n=100 '(NR==n){RS="";ORS="\n\n"} # reset the RS for reading <file>
(NR==FNR){print $1; a[$1];next} # load 100 numbers in memory
!(FNR in a) { print } # print records
' <(shuf -i 1-$NR -n 100) <file>
We can also do this in one go using the Knuth shuffle and doing a double pass of the file
awk -v n=100 '
# Create n random numbers between 1 and m
function shuffle(m,n, b, i, j, t) {
for (i = m; i > 0; i--) b[i] = i
for (i = m; i > 1; i--) {
# j = random integer from 1 to i
j = int(i * rand()) + 1
# swap b[i], b[j]
t = b[i]; b[i] = b[j]; b[j] = t
}
for (i = n; i > 0; i--) a[b[i]]
}
BEGIN{RS=""; srand()}
(NR==FNR) {next}
(FNR==1) {shuffle(NR-1,n) }
!(FNR in a) { print }' <file> <file>
Using awk and shuf to delete 4 blocks out of 6 blocks where each block is 3 lines long:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
NR==FNR { next }
FNR==1 {
cmd = sprintf("shuf -i 1-%d -n %d", NR-FNR, numToDel)
oRS=RS; RS="\n"
while ( (cmd | getline line) > 0 ) {
badNrs[line]
}
RS=oRS
close(cmd)
}
!(FNR in badNrs)
$ awk -v numToDel=4 -f tst.awk file file
1
2
3
10
11
12
Just change numToDel=4 to numToDel=100 for your real input.
The input file used to test against above was generated by:
$ seq 18 | awk '1; !(NR%3){print ""}' > file
which produced:
$ cat file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
here is a solution without shuffle
$ awk -v RS= -v ORS='\n\n' -v n=100 '
BEGIN {srand()}
NR==FNR{next}
FNR==1 {r[0];
while(length(r)<=n) r[int(rand()*NR)]}
!(FNR in r)' file{,}
double pass algorithm, first round is to count number of records, create a random list of index numbers up to required value, print the records not in the list. Note that if the deleted number is closer to number of records, the performance will degrade (probability of getting a new number will be low). For your case of 100 out of 600 will not be a problem. In the alternative case, it would be easier to pick the to be printed records instead of deleted records.
Since shuf is very fast I don't think this will buy you performance gains but perhaps simpler this way.
I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)
In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);
Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'
When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile