I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;
To test my script I made two example files:
$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
What I try to accomplish is to create files for the differences.
A file with the uniques in disk.txt (so I can delete them from disk)
A file with the uniques in database.txt (So I can retrieve them from backup and restore)
Using comm to retrieve differences
I used comm to see the differences between the two files. Sadly comm also returns the duplicates after some uniques.
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
using comm on one of these large files takes 28,38s. This is really fast but is solely not a solution.
using fgrep to strip duplicates from comm result
I can use fgrep to remove the duplicates from the comm result and this works on the example.
$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
On the large files this script just crashed after a while. So it is not a viable option to solve my problem.
Using python difflib to get uniques
I tried using this python script I got from BigSpicyPotato's answer on a different post
import difflib
with open(r'disk.txt','r') as masterdata:
with open(r'database.txt','r') as useddata:
with open(r'uniq-disk.txt','w+') as Newdata:
usedfile = [ x.strip('\n') for x in list(useddata) ]
masterfile = [ x.strip('\n') for x in list(masterdata) ]
for line in masterfile:
if line not in usedfile:
Newdata.write(line + '\n')
this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..
Question
Any faster / better option I can try in bash / python? I was aswell looking into awk / sed to maybe parse the the results form comm.
From man comm, * added by me:
Compare **sorted** files FILE1 and FILE2 line by line.
You have to sort the files for comm.
sort database.txt > database_sorted.txt
sort disk.txt > disk_sorted.txt
comm -13 database_sorted.txt disk_sorted.txt
See man sort for various speed and memory enhancing options, like --batch-size, --temporary-directory --buffer-size --parallel.
A file with the uniques in disk.txt
A file with the uniques in database.txt
After sorting, you can implement your python program that compares line-by-line the files and write to mentioned files, just like comm with custom output. Do not store whole files in memory.
You can also do something along this with join or comm --output-delimiter=' ':
join -v1 -v2 -o 1.1,2.1 disk_sorted.txt database_sorted.txt | tee >(
cut -d' ' -f1 | grep -v '^$' > unique_in_disk.txt) |
cut -d' ' -f2 | grep -v '^$' > unique_in_database.txt
comm does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check #KamilCuk answer for more context about sorting your files and using comm.
# WHINY_USERS=1 isn't trying to insult anyone -
# it's a special shell variable recognized by
# mawk-1 to presort the results
WHINY_USERS=1 {m,g}awk '
function phr(_) {
print \
"\n Uniques for file : { "\
(_)" } \n\n -----------------\n"
}
BEGIN {
split(_,____)
split(_,______)
PROCINFO["sorted_in"] = "#val_num_asc"
FS = "^$"
} FNR==NF {
______[++_____]=FILENAME
} {
if($_ in ____) {
delete ____[$_]
} else {
____[$_]=_____ ":" NR
}
} END {
for(__ in ______) {
phr(______[__])
_____=_<_
for(_ in ____) {
if(+(___=____[_])==+__) {
print " ",++_____,_,
"(line "(substr(___,
index(___,":")+!!__))")"
} } }
printf("\n\n") } ' testfile_disk.txt testfile_database.txt
|
Uniques for file : { testfile_disk.txt }
-----------------
1 07fffeed-5a0b-41f8-86cd-e6d99834c187 (line 7)
2 08ffff24-fb12-488c-87eb-1a07072fc706 (line 8)
3 09ffff29-ba3d-4582-8ce2-80b47ed927d1 (line 9)
Uniques for file : { testfile_database.txt }
-----------------
1 11ffffaf-fd54-49f3-9719-4a63690430d9 (line 18)
2 12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29 (line 19)
I have a file without about 150k rows, and two columns. I need to to run a a python script on the first field, and save its output as a third column, such that the change looks like this:
Original File:
Col1 Col2
d 2
e 4
f 6
New file:
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
I'm not able to run the script from inside awk.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
If I were able to, I would then want to save it as a variable, and print the new variable, plus $1 and $2 to the new file.
thanks in advance (related question here)
the answer to the "related question" you linked (and the one posted in the comments) actually solve your problem,
it just need to be adapted to your specific case.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
cat is useless here because awk can open and read the file by itself
you don't need -F" " because awk will split fields by spaces by default
backticks `` wont run your script, that's a shell (discouraged)
feature, doesn't work in
awk
we can use command | getline var to execute a command and store its
(first line of) output in a variable. from man awk:
command | getline var
pipes a record from command into var.
using your example file:
$ cat original
Col1 Col2
d 2
e 4
f 6
$
and a dummy script.py:
$ cat script.py
#!/bin/python
print("output")
$
we can do something like this:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
the first action runs on the first line of input, adds Col3 to the header and
avoids passing Col1 to the python script.
in the other action, we first build the command concatenating $1 to the
script's path, then we run it and store its first line of output in out
variable (I'm assuming your python script output is just one line). close(cmd) is important because after getline, the pipe reading
from cmd's output would remain open and doing this for many records could lead
to errors like too many open files. at the end we print $0 and cmd's
output.
third's column formatting looks a bit off, you can improve it either from
awk using printf or with an external program like column, e.g:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original | column -t
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
lastly, doing all this on a 150k rows file means calling the python script 150k
times etc.., it probably will be a slow task, I think performance could be
improved by doing everything directly in the python script as already
suggested in the comments, but whether or not it is applicable to your specific case, goes
beyond the scope of this question/answer.
Suppose I have a huge text file like below:
19990231
blabla
sssssssssssss
hhhhhhhhhhhhhh
ggggggggggggggg
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
20010221
fgghgg
sssssssssssss
hhhhhhhhhhhhhhh
ggggggggggggggg
<etc>
How can I randomly remove 100 blocks that start with numeric characters and end with a blank line? Eg:
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
<blank line>
This is not that difficult. The trick is to define the records first and this can be done with the record separator :
RS: The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So the number of records is given by :
$ NR=$(awk 'BEGIN{RS=""}END{print NR}' <file>)
You can then use shuf to get a hundred random numbers between 1 and NR:
$ shuf -i 1-$NR -n 100
This command you feed again in awk to select the records:
$ awk -v n=100 '(NR==n){RS="";ORS="\n\n"} # reset the RS for reading <file>
(NR==FNR){print $1; a[$1];next} # load 100 numbers in memory
!(FNR in a) { print } # print records
' <(shuf -i 1-$NR -n 100) <file>
We can also do this in one go using the Knuth shuffle and doing a double pass of the file
awk -v n=100 '
# Create n random numbers between 1 and m
function shuffle(m,n, b, i, j, t) {
for (i = m; i > 0; i--) b[i] = i
for (i = m; i > 1; i--) {
# j = random integer from 1 to i
j = int(i * rand()) + 1
# swap b[i], b[j]
t = b[i]; b[i] = b[j]; b[j] = t
}
for (i = n; i > 0; i--) a[b[i]]
}
BEGIN{RS=""; srand()}
(NR==FNR) {next}
(FNR==1) {shuffle(NR-1,n) }
!(FNR in a) { print }' <file> <file>
Using awk and shuf to delete 4 blocks out of 6 blocks where each block is 3 lines long:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
NR==FNR { next }
FNR==1 {
cmd = sprintf("shuf -i 1-%d -n %d", NR-FNR, numToDel)
oRS=RS; RS="\n"
while ( (cmd | getline line) > 0 ) {
badNrs[line]
}
RS=oRS
close(cmd)
}
!(FNR in badNrs)
$ awk -v numToDel=4 -f tst.awk file file
1
2
3
10
11
12
Just change numToDel=4 to numToDel=100 for your real input.
The input file used to test against above was generated by:
$ seq 18 | awk '1; !(NR%3){print ""}' > file
which produced:
$ cat file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
here is a solution without shuffle
$ awk -v RS= -v ORS='\n\n' -v n=100 '
BEGIN {srand()}
NR==FNR{next}
FNR==1 {r[0];
while(length(r)<=n) r[int(rand()*NR)]}
!(FNR in r)' file{,}
double pass algorithm, first round is to count number of records, create a random list of index numbers up to required value, print the records not in the list. Note that if the deleted number is closer to number of records, the performance will degrade (probability of getting a new number will be low). For your case of 100 out of 600 will not be a problem. In the alternative case, it would be easier to pick the to be printed records instead of deleted records.
Since shuf is very fast I don't think this will buy you performance gains but perhaps simpler this way.
We wrote an awk one liner to split an input csv file (Assay_51003_target_pairs.csv) into multiple files. For any row if their column 1 is equal to another column column 1, the column 2 is equal to another column 2, etc., these rows will be categorized into a new file. The new file will be named using the column values.
Here is the one liner
awk -F "," 'NF>1 && NR>1 && $1==$1 && $2==$2 && $9==$9 && $10==$10{print $0 >> ("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv");close("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv")}' Assay_51003_target_pairs.csv
This will generate the following example output (Assay_$1_target_$3_assay_$9_bcassay_$10_bcalt_assay.csv):
Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,8888,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,8888,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1688,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1688,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Later on we would like to do, for example,
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
#################################################
for b in 1645 1688
do
for c in 8888 9999
do
awk -F, -f max_min.awk Assay_51003_target_$b_assay_7777_bcassay_$c_bcalt_assay.csv
done
done
However, we don't know if there is any way to write a loop for the followup work because the outfile names are "random". May we know if there is any way for linux/bash to parse part of the file name into loop variables (such as parse 1645 and 1688 into b and 8888 & 9999 into c)?
With Bash it should be pretty much easy granting the values are always numbers:
shopt -s nullglob
FILES=(Assay_*_target_*_assay_*_bcassay_*_bcalt_assay.csv) ## No need to do +([[:digit:]]). The difference is unlikely.
for FILE in "${FILES[#]}"; do
IFS=_ read -a A <<< "$FILE"
# Do something with ${A[1]} ${A[3]} ${A[5]} and ${A[7]}
...
# Or
IFS=_ read __ A __ B __ C __ D __ <<< "$FILE"
# Do something with $A $B $C and $D
...
done
Asking if $1 == $1, etc., is pointless, since it will always be true. The following code is equivalent:
awk -F, '
NF > 1 && NR > 1 {
f = "Assay_" $1 "_target_" $3 "_assay_" $9 \
"_bcassay_" $10 "_bcalt_assay.csv"
print >> f;
close(f)
}' Assay_51003_target_pairs.csv
The reason this works is that the same file is appended to if the fields used in the construction of the filename match. But I wonder if it's an error on your part to be using $3 instead of $2 since you mention $2 in your description.
At any rate, what you are doing seems very odd. If you can give a straightforward description of what you are actually trying to accomplish, there may be a completely different way to do it.
I am new to python and I searched few articles but do not find a correct syntax to read a file and do awk line processing in python . I need your help in solving this problem .
This is how my bash script for build and deploy looks, I read a configurationf file in bash which looks like as below .
backup /apps/backup
oracle /opt/qosmon/qostool/oracle oracle-client-12.1.0.1.0
and the script for bash reading section looks like below
while read line
do
case "$line" in */package*) continue ;; esac
host_file_array+=("$line")
done < ${HOST_FILE}
for ((i=0 ; i < ${#host_file_array[*]}; i++))
do
# echo "${host_file_array[i]}"
host_file_line="${host_file_array[i]}"
if [[ "$host_file_line" != "#"* ]];
then
COMPONENT_NAME=$(echo $host_file_line | awk '{print $1;}' )
DIRECTORY=$(echo $host_file_line | awk '{print $2;}' )
VERSION=$(echo $host_file_line | awk '{print $3;}' )
if [[ ("${COMPONENT_NAME}" == *"oracle"*) ]];
then
print_parameters "Status ${DIRECTORY}/${COMPONENT_NAME}"
/bin/bash ${DIRECTORY}/${COMPONENT_NAME}/current/script/manage-oracle.sh ${FORMAT_STRING} start
fi
etc .........
How the same can be conveted to Python . This is what I have prepared so far in python .
f = open ('%s' % host_file,"r")
array = []
line = f.readline()
index = 0
while line:
line = line.strip("\n ' '")
line=line.split()
array.append([])
for item in line:
array[index].append(item)
line = f.readline()
index+= 1
f.close()
I tried with split in python , since the config file does not have equal number of columns in all rows, I get index bound error. what is the best way to process it .
I think dictionaries might be a good fit here, you can generate them as follows:
>>> result = []
>>> keys = ["COMPONENT_NAME", "DIRECTORY", "VERSION"]
>>> with open(hosts_file) as f:
... for line in f:
... result.append(dict(zip(keys, line.strip().split())))
...
>>> result
[{'DIRECTORY': '/apps/backup', 'COMPONENT_NAME': 'backup'},
{'DIRECTORY': '/opt/qosmon/qostool/oracle', 'VERSION': 'oracle-client-12.1.0.1.0', 'COMPONENT_NAME': 'oracle'}]
As you see this creates a list of dictionaries. Now when you're accessing the dictionaries, you know that some of them might not contain a 'VERSION' key. There are multiple ways of handling this. Either you try/except KeyError or get the value using dict.get().
Example:
>>> for r in result:
... print r.get('VERSION', "No version")
...
...
No version
oracle-client-12.1.0.1.0
result = [line.strip().split() for line in open(host_file)]