Split csv file thousands of times based on groupby - python

(An adaptation of David Erickson's question here)
Given a CSV file with columns A, B, and C and some values:
echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv
We would like to sort by columns a and b:
sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438
And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.
The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.
Is there a quicker way to do this, in awk, Python, or some other scripting language?
Additional info:
I know I can do this in Pandas - I'm looking for a faster way using text processing
Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)

In python:
import pandas as pd
df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)
In awk you can split like so, you will need to put the header back on each resultant file:
awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv
With adding the header line back:
awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv
The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.

Since your input is to be sorted on the key fields all you need is:
sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
close(prev)
print hdr > out
prev = out
}
{ print > out }
'

Related

How to comparing two big files on unique strings using fgrep/comm?

I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;
To test my script I made two example files:
$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
What I try to accomplish is to create files for the differences.
A file with the uniques in disk.txt (so I can delete them from disk)
A file with the uniques in database.txt (So I can retrieve them from backup and restore)
Using comm to retrieve differences
I used comm to see the differences between the two files. Sadly comm also returns the duplicates after some uniques.
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
using comm on one of these large files takes 28,38s. This is really fast but is solely not a solution.
using fgrep to strip duplicates from comm result
I can use fgrep to remove the duplicates from the comm result and this works on the example.
$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
On the large files this script just crashed after a while. So it is not a viable option to solve my problem.
Using python difflib to get uniques
I tried using this python script I got from BigSpicyPotato's answer on a different post
import difflib
with open(r'disk.txt','r') as masterdata:
with open(r'database.txt','r') as useddata:
with open(r'uniq-disk.txt','w+') as Newdata:
usedfile = [ x.strip('\n') for x in list(useddata) ]
masterfile = [ x.strip('\n') for x in list(masterdata) ]
for line in masterfile:
if line not in usedfile:
Newdata.write(line + '\n')
this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..
Question
Any faster / better option I can try in bash / python? I was aswell looking into awk / sed to maybe parse the the results form comm.
From man comm, * added by me:
Compare **sorted** files FILE1 and FILE2 line by line.
You have to sort the files for comm.
sort database.txt > database_sorted.txt
sort disk.txt > disk_sorted.txt
comm -13 database_sorted.txt disk_sorted.txt
See man sort for various speed and memory enhancing options, like --batch-size, --temporary-directory --buffer-size --parallel.
A file with the uniques in disk.txt
A file with the uniques in database.txt
After sorting, you can implement your python program that compares line-by-line the files and write to mentioned files, just like comm with custom output. Do not store whole files in memory.
You can also do something along this with join or comm --output-delimiter=' ':
join -v1 -v2 -o 1.1,2.1 disk_sorted.txt database_sorted.txt | tee >(
cut -d' ' -f1 | grep -v '^$' > unique_in_disk.txt) |
cut -d' ' -f2 | grep -v '^$' > unique_in_database.txt
comm does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check #KamilCuk answer for more context about sorting your files and using comm.
# WHINY_USERS=1 isn't trying to insult anyone -
# it's a special shell variable recognized by
# mawk-1 to presort the results
WHINY_USERS=1 {m,g}awk '
function phr(_) {
print \
"\n Uniques for file : { "\
(_)" } \n\n -----------------\n"
}
BEGIN {
split(_,____)
split(_,______)
PROCINFO["sorted_in"] = "#val_num_asc"
FS = "^$"
} FNR==NF {
______[++_____]=FILENAME
} {
if($_ in ____) {
delete ____[$_]
} else {
____[$_]=_____ ":" NR
}
} END {
for(__ in ______) {
phr(______[__])
_____=_<_
for(_ in ____) {
if(+(___=____[_])==+__) {
print " ",++_____,_,
"(line "(substr(___,
index(___,":")+!!__))")"
} } }
printf("\n\n") } ' testfile_disk.txt testfile_database.txt
|
Uniques for file : { testfile_disk.txt }
-----------------
1 07fffeed-5a0b-41f8-86cd-e6d99834c187 (line 7)
2 08ffff24-fb12-488c-87eb-1a07072fc706 (line 8)
3 09ffff29-ba3d-4582-8ce2-80b47ed927d1 (line 9)
Uniques for file : { testfile_database.txt }
-----------------
1 11ffffaf-fd54-49f3-9719-4a63690430d9 (line 18)
2 12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29 (line 19)

Calling a python script from awk

I have a file without about 150k rows, and two columns. I need to to run a a python script on the first field, and save its output as a third column, such that the change looks like this:
Original File:
Col1 Col2
d 2
e 4
f 6
New file:
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
I'm not able to run the script from inside awk.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
If I were able to, I would then want to save it as a variable, and print the new variable, plus $1 and $2 to the new file.
thanks in advance (related question here)
the answer to the "related question" you linked (and the one posted in the comments) actually solve your problem,
it just need to be adapted to your specific case.
cat original.list | awk -F" " ' {`/homes/script.py $1`}'
cat is useless here because awk can open and read the file by itself
you don't need -F" " because awk will split fields by spaces by default
backticks `` wont run your script, that's a shell (discouraged)
feature, doesn't work in
awk
we can use command | getline var to execute a command and store its
(first line of) output in a variable. from man awk:
command | getline var
pipes a record from command into var.
using your example file:
$ cat original
Col1 Col2
d 2
e 4
f 6
$
and a dummy script.py:
$ cat script.py
#!/bin/python
print("output")
$
we can do something like this:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
the first action runs on the first line of input, adds Col3 to the header and
avoids passing Col1 to the python script.
in the other action, we first build the command concatenating $1 to the
script's path, then we run it and store its first line of output in out
variable (I'm assuming your python script output is just one line). close(cmd) is important because after getline, the pipe reading
from cmd's output would remain open and doing this for many records could lead
to errors like too many open files. at the end we print $0 and cmd's
output.
third's column formatting looks a bit off, you can improve it either from
awk using printf or with an external program like column, e.g:
$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original | column -t
Col1 Col2 Col3
d 2 output
e 4 output
f 6 output
$
lastly, doing all this on a 150k rows file means calling the python script 150k
times etc.., it probably will be a slow task, I think performance could be
improved by doing everything directly in the python script as already
suggested in the comments, but whether or not it is applicable to your specific case, goes
beyond the scope of this question/answer.

Find part of a string in CSV and replace whole cell with new entry?

I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)
In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);
Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'
When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile

Comparing field#1 in two adjacent lines and printing fields#2 from all the lines where field#1 is identical

Can be considered a variant of this bash question that I looked up before asking this question here.
Following is a sample file:
C=b933cda8ce0/0 p=880080
C=b933cdd6580/0 p=880080
C=b933d02a240/0 p=880080
C=b933d059610/0 p=880080
C=b933d1c8690/0 p=880080
C=b933d2c1b60/0 p=1560315
C=b933d2c1b60/0 p=880080
C=b933d32f240/0 p=1229793
C=b933d32f240/0 p=123412
Output here should be:
C=b933d2c1b60/0 p1=1560315 p2=880080
C=b933d32f240/0 p1=1229793 p2=123412
I need to print out all the values of field#2 against field#1 from all the lines where field#1 matches.
Although I got the job done using the following long one-liner but it doesn't really seem elegant/efficient to me:
d=0; q=0; cat file |while read -r c p; do if [[ $c = $d ]]; then printf "$c\t$p\t$q\n"; fi; d=$c; q=$p; done
Code could be in any of the langs/tools tagged.
awk to the rescue
awk '{
c[$1]++;
sub("p","p"c[$1],$2);
sep=(c[$1]>1)?FS:"";
a[$1]=a[$1] sep $2
}
END {
for(i in a) print i, a[i]
}' file
concatenate second fields based on the key (first field). Suffix p with the index of the parameter (counted in c). There is a formatting trick to not to double space the first and second fields.

Linux/bash/awk read parts of the files name as variables

We wrote an awk one liner to split an input csv file (Assay_51003_target_pairs.csv) into multiple files. For any row if their column 1 is equal to another column column 1, the column 2 is equal to another column 2, etc., these rows will be categorized into a new file. The new file will be named using the column values.
Here is the one liner
awk -F "," 'NF>1 && NR>1 && $1==$1 && $2==$2 && $9==$9 && $10==$10{print $0 >> ("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv");close("Assay_"$1"_target_"$3"_assay_" $9 "_bcassay_" $10 "_bcalt_assay.csv")}' Assay_51003_target_pairs.csv
This will generate the following example output (Assay_$1_target_$3_assay_$9_bcassay_$10_bcalt_assay.csv):
Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,8888,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,8888,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1645,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1645,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
51003,666666,1688,11145,EC50,,0.2,uM,7777,9999,IC50,,1,uM,,3,2.0555,3011-02-0100:00:00,1911-04-1100:00:00,Cell,Biochemical
51003,666666,1688,1680,EC50,<,0.1,uM,7777,9999,IC50,,1,uM,,2,2.8579,3004-06-0300:00:00,3000-04-1100:00:00,Cell,Biochemical
Later on we would like to do, for example,
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_8888_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1645_assay_7777_bcassay_9999_bcalt_assay.csv
awk -F, -f max_min.awk Assay_51003_target_1688_assay_7777_bcassay_9999_bcalt_assay.csv
#################################################
for b in 1645 1688
do
for c in 8888 9999
do
awk -F, -f max_min.awk Assay_51003_target_$b_assay_7777_bcassay_$c_bcalt_assay.csv
done
done
However, we don't know if there is any way to write a loop for the followup work because the outfile names are "random". May we know if there is any way for linux/bash to parse part of the file name into loop variables (such as parse 1645 and 1688 into b and 8888 & 9999 into c)?
With Bash it should be pretty much easy granting the values are always numbers:
shopt -s nullglob
FILES=(Assay_*_target_*_assay_*_bcassay_*_bcalt_assay.csv) ## No need to do +([[:digit:]]). The difference is unlikely.
for FILE in "${FILES[#]}"; do
IFS=_ read -a A <<< "$FILE"
# Do something with ${A[1]} ${A[3]} ${A[5]} and ${A[7]}
...
# Or
IFS=_ read __ A __ B __ C __ D __ <<< "$FILE"
# Do something with $A $B $C and $D
...
done
Asking if $1 == $1, etc., is pointless, since it will always be true. The following code is equivalent:
awk -F, '
NF > 1 && NR > 1 {
f = "Assay_" $1 "_target_" $3 "_assay_" $9 \
"_bcassay_" $10 "_bcalt_assay.csv"
print >> f;
close(f)
}' Assay_51003_target_pairs.csv
The reason this works is that the same file is appended to if the fields used in the construction of the filename match. But I wonder if it's an error on your part to be using $3 instead of $2 since you mention $2 in your description.
At any rate, what you are doing seems very odd. If you can give a straightforward description of what you are actually trying to accomplish, there may be a completely different way to do it.

Categories