Getting (row count - 1) of CSV files - python

Mainly have 2 questions in respect to this topic:
I'm looking to get the row counts of a few CSV files. In Bash, I know I can do wc -l < filename.csv. How do I do this and subtract 1 from it (because of headers)?
For anyone familiar with CSV files and possible issues with grabbing raw line count, how plausible is it that a line is wrapped across multiple lines? I know that this is a very possible scenario, but want to say that this never happens. In the event of this being a possibility, would using Python's csv package be better to use? Does it read lines based on delimiters and other column wrappers?

As Barmar points out, (1) it is quite possible for CSV files to have wrapped lines and (2) CSV programming libraries can handle this well. As an example, here is a python program which uses python's CSV module to count the number of lines in file.csv minus 1:
python -c 'import csv; print( sum(1 for line in csv.reader(open("file.csv")))-1 )'
The -c arg option tells python to treat the arg string as a program to execute. In this case, we make the csv module available with the "import" statement. Then, we print out the number of lines minus one. The construct sum(1 for line in csv.reader(open("file.csv"))) counts the lines one at a time.
If your csv file has a non-typical format, you will need to set options. This might be the delimiter or quoting character. See the documentation for details.
Example
Consider this test file:
$ cat file.csv
First name,Last name,Address
John,Smith,"P O Box 1234
Somewhere, State"
Jane,Doe,"Unknown"
This file has two rows plus a header. One of the rows is split over two lines. Python's csv module correctly understands this:
$ python -c 'import csv; print( sum(1 for line in csv.reader(open("file.csv")))-1 )'
2
gzipped files
To open gzip files in python, we use the gzip module:
$ python -c 'import csv, gzip; print( sum(1 for line in csv.reader(gzip.GzipFile("file.csv.gz")))-1 )'
2

For getting the line count, just subtract 1 from the value returned by wc using an arithmetic expression
count=$(($(wc -l < filename.csv) - 1)
CSV format allows fields to contain newlines, by surrounding the field with quotes, e.g.
field1,field2,"field3 broken
across lines",field4
Dealing with this in a plain bash script would be difficult (indeed, any CSV processing that needs to handle quoted fields is tricky). If you need to deal with the full generality of CSV, you should probably use a programming language with a CSV library.
But if you know that your CSV files will never be like this, you can ignore it.

As an alternative to subtracting one from the total row count, you can discard the first line from the file before
row_count=$( { read; wc -l; } < filename.csv )
(This is in no way better than simply using $(($(wc -l < filename.csv) - 1)); it's just a useful trick to know.)

Related

tools for appending column(s) to large CSV-file (merging CSV-files by column(s))

To create two csv-files:
echo -e "123\n456" > t0.txt
echo -e '"foo","bar"\n"foo\"bar\"","baz"' > t1.txt
Now, I want append the columns in t1.txt to t0.txt, so that the result becomes this:
123,"foo","bar"
456,"foo\"bar\"","baz"
First try, using csvtool
csvtool paste t0.txt t1.txt
Fatal error: exception Csv.Failure(2, 1, "Bad '"' in quoted field")
So, csvtool doesn't seem to handle the escaped quotation mark in "foo\"bar\"".
My real world use case has two CSV-files with +150.000.000 rows and 11 columns so I need a tool which can do the task without having all the data simultaneously in RAM.
Can csvtool be used with escaped quotation marks, or is there another tool that could solve this?
The final target for the CSV-file is a database in mariadb, so an efficient import to mariadb using t0.txt and t1.txt directly would be even better, but as far as I know LOAD DATA INFILE only works on a single CSV-file.
I definitely prefer a ready-made tool, but if there is none, then some C, Perl or Python snippets would be appreciated too.
Here's a quick perl script that reads your broken CSV files, merges them, and outputs properly escaped CSV all in one pass:
#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
# Install through your OS package manager or CPAN client.
# libtext-csv-xs-perl on Debian/Ubuntu and family.
use Text::CSV_XS;
open my $file0, "<", $ARGV[0];
open my $file1, "<", $ARGV[1];
my $csv = Text::CSV_XS->new({ binary => 1, escape_char => "\\",
auto_diag => 2, strict => 0});
my $out = Text::CSV_XS->new({ binary => 1 });
while ((my $row0 = $csv->getline($file0)) &&
(my $row1 = $csv->getline($file1))) {
push #$row0, #$row1;
$out->say(\*STDOUT, $row0);
}
Example:
$ perl mergecsv.pl t0.txt t1.txt
123,foo,bar
456,"foo""bar""",baz
CSV files generally escape double quotes by repetition ("" rather than \"), so your files could be considered invalid.
You could use a find and replace tool, such as sed on Unix, to fix the escaped quotes to this more common format.
If you're looking for some other command line tool to work with CSV files, I've authored one that's available at https://github.com/pjshumphreys/querycsv

Read a python variable in a shell script?

my python file has these 2 variables:
week_date = "01/03/16-01/09/16"
cust_id = "12345"
how can i read this into a shell script that takes in these 2 variables?
my current shell script requires manual editing of "dt" and "id". I want to read the python variables into the shell script so i can just edit my python parameter file and not so many files.
shell file:
#!/bin/sh
dt="01/03/16-01/09/16"
cust_id="12345"
In a new python file i could just import the parameter python file.
Consider something akin to the following:
#!/bin/bash
# ^^^^ NOT /bin/sh, which doesn't have process substitution available.
python_script='
import sys
d = {} # create a context for variables
exec(open(sys.argv[1], "r").read()) in d # execute the Python code in that context
for k in sys.argv[2:]:
print "%s\0" % str(d[k]).split("\0")[0] # ...and extract your strings NUL-delimited
'
read_python_vars() {
local python_file=$1; shift
local varname
for varname; do
IFS= read -r -d '' "${varname#*:}"
done < <(python -c "$python_script" "$python_file" "${#%%:*}")
}
You might then use this as:
read_python_vars config.py week_date:dt cust_id:id
echo "Customer id is $id; date range is $dt"
...or, if you didn't want to rename the variables as they were read, simply:
read_python_vars config.py week_date cust_id
echo "Customer id is $cust_id; date range is $week_date"
Advantages:
Unlike a naive regex-based solution (which would have trouble with some of the details of Python parsing -- try teaching sed to handle both raw and regular strings, and both single and triple quotes without making it into a hairball!) or a similar approach that used newline-delimited output from the Python subprocess, this will correctly handle any object for which str() gives a representation with no NUL characters that your shell script can use.
Running content through the Python interpreter also means you can determine values programmatically -- for instance, you could have some Python code that asks your version control system for the last-change-date of relevant content.
Think about scenarios such as this one:
start_date = '01/03/16'
end_date = '01/09/16'
week_date = '%s-%s' % (start_date, end_date)
...using a Python interpreter to parse Python means you aren't restricting how people can update/modify your Python config file in the future.
Now, let's talk caveats:
If your Python code has side effects, those side effects will obviously take effect (just as they would if you chose to import the file as a module in Python). Don't use this to extract configuration from a file whose contents you don't trust.
Python strings are Pascal-style: They can contain literal NULs. Strings in shell languages are C-style: They're terminated by the first NUL character. Thus, some variables can exist in Python than cannot be represented in shell without nonliteral escaping. To prevent an object whose str() representation contains NULs from spilling forward into other assignments, this code terminates strings at their first NUL.
Now, let's talk about implementation details.
${#%%:*} is an expansion of $# which trims all content after and including the first : in each argument, thus passing only the Python variable names to the interpreter. Similarly, ${varname#*:} is an expansion which trims everything up to and including the first : from the variable name passed to read. See the bash-hackers page on parameter expansion.
Using <(python ...) is process substitution syntax: The <(...) expression evaluates to a filename which, when read, will provide output of that command. Using < <(...) redirects output from that file, and thus that command (the first < is a redirection, whereas the second is part of the <( token that starts a process substitution). Using this form to get output into a while read loop avoids the bug mentioned in BashFAQ #24 ("I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?").
The IFS= read -r -d '' construct has a series of components, each of which makes the behavior of read more true to the original content:
Clearing IFS for the duration of the command prevents whitespace from being trimmed from the end of the variable's content.
Using -r prevents literal backslashes from being consumed by read itself rather than represented in the output.
Using -d '' sets the first character of the empty string '' to be the record delimiter. Since C strings are NUL-terminated and the shell uses C strings, that character is a NUL. This ensures that variables' content can contain any non-NUL value, including literal newlines.
See BashFAQ #001 ("How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?") for more on the process of reading record-oriented data from a string in bash.
Other answers give a way to do exactly what you ask for, but I think the idea is a bit crazy. There's a simpler way to satisfy both scripts - move those variables into a config file. You can even preserve the simple assignment format.
Create the config itself: (ini-style)
dt="01/03/16-01/09/16"
cust_id="12345"
In python:
config_vars = {}
with open('the/file/path', 'r') as f:
for line in f:
if '=' in line:
k,v = line.split('=', 1)
config_vars[k] = v
week_date = config_vars['dt']
cust_id = config_vars['cust_id']
In bash:
source "the/file/path"
And you don't need to do crazy source parsing anymore. Alternatively you can just use json for the config file and then use json module in python and jq in shell for parsing.
I would do something like this. You may want to modify it little bit for minor changes to include/exclude quotes as I didn't really tested it for your scenario:
#!/bin/sh
exec <$python_filename
while read line
do
match=`echo $line|grep "week_date ="`
if [ $? -eq 0 ]; then
dt=`echo $line|cut -d '"' -f 2`
fi
match=`echo $line|grep "cust_id ="`
if [ $? -eq 0 ]; then
cust_id=`echo $line|cut -d '"' -f 2`
fi
done

Script to compare a string in two different files

I am brand new to stackoverflow and to scripting. I was looking for help to get started in a script, not necessarily looking for someone to write it.
Here's what I have:
File1.csv - contains some information, I am only interested in MAC addresses.
File2.csv - has some different information, but also contains MAC address.
I need a script that parses the MAC addresses from file1.csv and logs a report if any MAC address shows up in file2.csv.
The questions:
Any tips on the language I use, preferably perl, python or bash?
Can anyone suggest some structure for the logic needed (even if just in psuedo-code)?
update
Using #Adam Wagner's approach, I am really close!
import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
def next(self):
return file.next(self).replace('\x00','').replace('\xff\xfe','')
reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)
inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)
shared_items = s1.intersection(s2)
print shared_items
This always outputs:(even if I doctor the .csv files to have matching MAC addresses)
set([])
Contents of the csv files
wifi_clients.csv
macNames, First time seen, Last time seen,Power, # packets, BSSID, Probed ESSIDs
inventory.csv
Name,Manufacturer,Device Type,Model,Serial Number,IP Address,MAC Address,...
Here's the approach I'd take:
Iterate over each csv file (python has a handy csv module for accomplishing this), capturing the mac-address and placing it in a set (one per file). And once again, python has a great builtin set type. Here's a good example of using the csv module and of-course, the docs.
Next, you can get the intersection of set1 (file1) and set2 (file2). This will show you mac-addresses that exist in both files one and two.
Example (in python):
s1 = set([1,2,3]) # You can add things incrementally with "s1.add(value)"
s2 = set([2,3,4])
shared_items = s1.intersection(s2)
print shared_items
Which outputs:
set([2, 3])
Logging these shared items could be done with anything from printing (then redirecting output to a file), to using the logging module, to saving directly to a file.
I'm not sure how in-depth of an answer you were looking for, but this should get you started.
Update: CSV/Set usage example
Assuming you have a file "foo.csv", that looks something like this:
bob,123,127.0.0.1,mac-address-1
fred,124,127.0.0.1,mac-address-2
The simplest way to build the set, would be something like this:
import csv
set1 = set()
for record in csv.reader(open('foo.csv', 'rb')):
user, machine_id, ip_address, mac_address = record
set1.add(mac_address)
# or simply "set1.add(record[3])", if you don't need the other fields.
Obviously, you'd need something like this for each file, so you may want to put this in a function to make life easier.
Finally, if you want to go the less-verbose-but-cooler-python-way, you could also build the set like this:
csvfile = csv.reader(open('foo.csv', 'rb'))
set1 = set(rec[3] for rec in csvfile) # Assuming mac-address is the 4th column.
I strongly recommend python to do this.
'Cause you didn't give the structure of the csv file, I can only show a framework:
def get_MAC_from_file1():
... parse the file to get MAC
return a_MAC_list
def get_MAC_from_file2():
... parse the file to get MAC
return a_MAC_list
def log_MACs():
MAC_list1, MAC_list2 = get_MAC_from_file1(), get_MAC_from_file2()
for a_MAC in MAC_list1:
if a_MAC in MAC_list2:
...write your logs
if the data set is large, use a dict or set instead of the list and the intersect operation. But as it's MAC address, I guess your dataset is not that large. So keeping the script easy to read is the most important thing.
Awk is perfect for this
{
mac = $1 # assuming the mac addresses are in the first column
do_grep = "grep " mac " otherfilename" # we'll use grep to check if the mac address is in the other file
do_grep | getline mac_in_other_file # pipe the output of the grep command into a new variable
close(do_grep) # close the pipe
if(mac_in_other_file != ""){ # if grep found the mac address in the other file
print mac > "naughty_macs.log" # append the mac address to the log file
}
}
Then you'd run that on the first file:
awk -f logging_script.awk mac_list.txt
(this code is untested and I'm not the greatest awk hacker, but it should give the general idea)
For the example purpose generate 2 files that that look like yours.
File1:
for i in `seq 100`; do
echo -e "user$i\tmachine$i\t192.168.0.$i\tmac$i";
done > file1.csv
File2 (contains random entries of "mac addresses" numbered from 1-200)
for j in `seq 100`; do
i=$(($RANDOM % 200)) ;
echo -e "mac$i\tmachine$i\tuser$i";
done > file2.csv
Simplest approach would be to use join command and do a join on the appropriate field. This approach has the advantage that fields from both files would be available in the output.
Based on the example files above, the command would look like this:
join -1 4 -2 1 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
join needs the input to be sorted by the field you are matching, that's why the sort is there (-k tells which column to use)
The command above matches rows from file1.csv with rows from file2.csv if column 4 in the first file is equal with column 1 from the second file.
If you only need specific fields, you can specify the output format to the join command:
join -1 4 -2 1 -o1.4 1.2 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
This would print only the mac address and the machine field from the first file.
If you only need a list of matching mac addresses, you can use uniq or sort -u. Since the join output will be sorted by mac, uniq is faster. But if you need a unique list of another field, sort -u is better.
If you only need the mac addresses that match, grep can accept patterns from a file, and you can use cut to extract only the forth field.
fgrep -f<(cut -f4 file1.csv) file2.csv
The above would list all the lines in file2.csv that contain a mac address from file1
Note that I'm using fgrep which doesn't do pattern matching. Also, if file1 is big, this may be slower than the first approach. Also, it assumes that the mac is present only in the field1 of file2 and the other fields don't contain mac addresses.
If you only need the mac, you can either use -o option on fgrep but there are grep variants that don't have it, or you can pipe the output trough cut and then sort -u
fgrep -f<(cut -f4 file1.csv) file2.csv | cut -f1 | sort -u
This would be the bash way.
Python and awk hints have been shown above, I will take a stab at perl:
#!/usr/bin/perl -w
use strict;
open F1, $ARGV[0];
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
close F1;
open F2, $ARGV[1];
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
close F2
First you create a dictionary containing all the mac addresses from the first file:
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
reads all the lines from the file1
chomp removes the end of line
split splits the line based on tab, you can use a more complex regexp if needed
() around split force an array context
[3] selects the forth field
map runs a piece of code for all elements of the array
=> generates a dictionary (hash in perl's terminology) element instead of an array
Then you read line by line the second file, and check if the mac exists in the above dictionary:
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
while () will read the file F2, and put each line in the $_ variable
print without any parameters prints the default variable $_
if can postfix a instruction
dictionary elements can be accessed via {}
split by default splits the $_ default variable

Automating a directory diff while ignoring some particular lines in files

I need to compare two directories, and produce some sort of structured output (text file is fine) of the differences. That is, the output might looks something like this:
file1 exists only in directory2
file2 exists only in directory1
file3 is different between directory1 and directory2
I don't care about the format, so long as the information is there. The second requirement is that I need to be able to ignore certain character sequences when diffing two files. Araxis Merge has this ability: you can type in a Regex and any files whose only difference is in character sequences matching that Regex will be reported as identical.
That would make Araxis Merge a good candidate, BUT, as of yet I have found no way to produce a structured output of the diff. Even when launching consolecompare.exe with command-line argumetns, it just opens an Araxis GUI window showing the differences.
So, does either of the following exist?
A way to get Araxis Merge to print a diff result to a text file?
Another utility that do a diff while ignoring certain character
sequences, and produce structured output?
Extra credit if such a utility exists as a module or plugin for Python. Please keep in mind this must be done entirely from a command line / python script - no GUIs.
To some extent, the plain old diff command can do just that, i.e. compare directory contents and ignoring changes that match a certain regex pattern (Using the -I option).
From man bash:
-I regexp
Ignore changes that just insert or delete lines that match regexp.
Quick demo:
[me#home]$ diff images/ images2
Only in images2: x
Only in images/: y
diff images/z images2/z
1c1
< zzz
---
> zzzyy2
[me#home]$ # a less verbose version
[me#home]$ diff -q images/ images2
Only in images2: x
Only in images/: y
Files images/z and images2/z differ
[me#home]$ # ignore diffs on lines that contain "zzz"
[me#home]$ diff -q -I ".*zzz.*" images/ images2/
Only in images2/: x
Only in images/: y

File Manipulation: Scripting Question

I have a script which connects to database and gets all records which statisfy the query. These record results are files present on a server, so now I have a text file which has all file names in it.
I want a script which would know:
What is the size of each file in the output.txt file?
What is the total size of all the files present in that text file?
Update:
I would like to know how can I achieve my task using Perl programming language, any inputs would be highly appreciated.
Note: I do not have any specific language constraint, it could be either Perl or Python scripting language which I can run from the Unix prompt. Currently I am using the bash shell and have sh and py script. How can this be done?
My scripts:
#!/usr/bin/ksh
export ORACLE_HOME=database specific details
export PATH=$ORACLE_HOME/bin:path information
sqlplus database server information<<EOF
SET HEADING OFF
SET ECHO OFF
SET PAGESIZE 0
SET LINESIZE 1000
SPOOL output.txt
select * from my table_name;
SPOOL OFF
EOF
I know du -h would be the command which I should be using but I am not sure how should my script be, I have tried something in python. I am totally new to Python and it's my first time effort.
Here it is:
import os
folderpath='folder_path'
file=open('output file which has all listing of query result','r')
for line in file:
filename=line.strip()
filename=filename.replace(' ', '\ ')
fullpath=folderpath+filename
# print (fullpath)
os.system('du -h '+fullpath)
File names in the output text file for example are like: 007_009_Bond Is Here_009_Yippie.doc
Any guidance would be highly appreciated.
Update:
How can I move all the files which are present in output.txt file to some other folder location using Perl ?
After doing step1, how can I delete all the files which are present in output.txt file ?
Any suggestions would be highly appreciated.
In perl, the -s filetest operator is probaby what you want.
use strict;
use warnings;
use File::Copy;
my $folderpath = 'the_path';
my $destination = 'path/to/destination/directory';
open my $IN, '<', 'path/to/infile';
my $total;
while (<$IN>) {
chomp;
my $size = -s "$folderpath/$_";
print "$_ => $size\n";
$total += $size;
move("$folderpath/$_", "$destination/$_") or die "Error when moving: $!";
}
print "Total => $total\n";
Note that -s gives size in bytes not blocks like du.
On further investigation, perl's -s is equivalent to du -b. You should probably read the man pages on your specific du to make sure that you are actually measuring what you intend to measure.
If you really want the du values, change the assignment to $size above to:
my ($size) = split(' ', `du "$folderpath/$_"`);
Eyeballing, you can make YOUR script work this way:
1) Delete the line filename=filename.replace(' ', '\ ') Escaping is more complicated than that, and you should just quote the full path or use a Python library to escape it based on the specific OS;
2) You are probably missing a delimiter between the path and the file name;
3) You need single quotes around the full path in the call to os.system.
This works for me:
#!/usr/bin/python
import os
folderpath='/Users/andrew/bin'
file=open('ft.txt','r')
for line in file:
filename=line.strip()
fullpath=folderpath+"/"+filename
os.system('du -h '+"'"+fullpath+"'")
The file "ft.txt" has file names with no path and the path part is '/Users/andrew/bin'. Some of the files have names that would need to be escaped, but that is taken care of with the single quotes around the file name.
That will run du -h on each file in the .txt file, but does not give you the total. This is fairly easy in Perl or Python.
Here is a Python script (based on yours) to do that:
#!/usr/bin/python
import os
folderpath='/Users/andrew/bin/testdir'
file=open('/Users/andrew/bin/testdir/ft.txt','r')
blocks=0
i=0
template='%d total files in %d blocks using %d KB\n'
for line in file:
i+=1
filename=line.strip()
fullpath=folderpath+"/"+filename
if(os.path.exists(fullpath)):
info=os.stat(fullpath)
blocks+=info.st_blocks
print `info.st_blocks`+"\t"+fullpath
else:
print '"'+fullpath+"'"+" not found"
print `blocks`+"\tTotal"
print " "+template % (i,blocks,blocks*512/1024)
Notice that you do not have to quote or escape the file name this time; Python does it for you. This calculates file sizes using allocation blocks; the same way that du does it. If I run du -ahc against the same files that I have listed in ft.txt I get the same number (well kinda; du reports it as 25M and I get the report as 24324 KB) but it reports the same number of blocks. (Side note: "blocks" are always assumed to be 512 bytes under Unix even though the actual block size on larger disc is always larger.)
Finally, you may want to consider making your script so that it can read a command line group of files rather than hard coding the file and the path in the script. Consider:
#!/usr/bin/python
import os, sys
total_blocks=0
total_files=0
template='%d total files in %d blocks using %d KB\n'
print
for arg in sys.argv[1:]:
print "processing: "+arg
blocks=0
i=0
file=open(arg,'r')
for line in file:
abspath=os.path.abspath(arg)
folderpath=os.path.dirname(abspath)
i+=1
filename=line.strip()
fullpath=folderpath+"/"+filename
if(os.path.exists(fullpath)):
info=os.stat(fullpath)
blocks+=info.st_blocks
print `info.st_blocks`+"\t"+fullpath
else:
print '"'+fullpath+"'"+" not found"
print "\t"+template % (i,blocks,blocks*512/1024)
total_blocks+=blocks
total_files+=i
print template % (total_files,total_blocks,total_blocks*512/1024)
You can then execute the script (after chmod +x [script_name].py) by ./script.py ft.txt and it will then use the path to the command line file as the assumed path to the files "ft.txt". You can process multiple files as well.
You can do it in your shell script itself.
You have all the files names in your spooled file output.txt, all you have to add at the end of existing script is:
< output.txt du -h
It will give size of each file and also a total at the end.
You can use the Python skeleton that you've sketched out and add os.path.getsize(fullpath) to get the size of individual file.
For example, if you wanted a dictionary with the file name and size you could:
dict((f, os.path.getsize(f)) for f in file)
Keep in mind that the result from os.path.getsize(...) is in bytes so you'll have to convert it to get other units if you want.
In general os.path is a key module for manipulating files and paths.

Categories