I'm using perl to convert a comma separated file to a tab separated file with this command:
perl -e ' $sep=","; while(<>) { s/\Q$sep\E/\t/g; print $_; } warn "Changed $sep to tab on $. lines\n" ' csvfile.csv > tabfile.tab
However, my file has additional commas that I do not want to be separated in specific columns. Here's and example of my file:
ADNP, "descript1, descript2", 1
PTB, "descriptA, descriptB", 5
I only want to convert the comma's outside of the quotations to tabs as so:
ADNP descript1, descript2 1
PTB descriptA, descriptB 5
Is there anyway to go about doing this with either perl, python, or bash?
Trivial in Perl, using Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
#configure our read format using the default separator of ","
my $input_csv = Text::CSV->new( { binary => 1 } );
#configure our output format with a tab as separator.
my $output_csv = Text::CSV->new( { binary => 1, sep_char => "\t", eol => "\n" } );
#open input file
open my $input_fh, '<', "sample.csv" or die $!;
#iterate input file - reading in 'comma separated'
#printing out (to stdout -can use filehandle) tab separated.
while ( my $row = $input_csv->getline($input_fh) ) {
$output_csv->print( \*STDOUT, $row );
}
In python
import csv
with open('input', 'rb') as inf:
reader = csv.reader(inf)
with open('output', 'wb') as out:
writer = csv.writer(out, delimiter='\t')
writer.writerows(reader)
You need regular expressions to help you. In python it would simply be:
>>> re.split(r'(?!\B"[^"]*),(?![^"]*"\B)', 'ADNP, "descript1, descript2", 1'
['ADNP', ' "descript1, descript2"', ' 1']
Building off rll's regex answer, you can turn it into a perl oneliner like you're currenly doing
perl -ne 'BEGIN{$,="\t";}#a=split(/(?!\B"[^"]*),(?![^"]*"\B)/);print #a' csvfile.csv > tabfile.tab
This'll work:
perl -e '$sep=","; while(<STDIN>) { #data = split(/(\Q$sep\E?\s*"[^"]+"\s*\Q$sep\E?)/); foreach(#data){if(/"/){s/^\Q$sep\E\s*"//;s/"\s*\Q$sep\E$//;}else{s/\Q$sep\E/\t/g;}}print(join("\t",#data));} warn "Changed $sep to tab on $. lines\n"' < csvfile.csv > tabfile.tab
Putting parens in the pattern to split, returns the captured separators along with the split elements and effectively separates the strings containing quotes into separate list elements that can be treated differently when quotes are detected. You just strip off the commas and quotes for the quoted strings and substitute for tabs in the other elements, then join the elements with tabs (so that the quoted strings get joined with tabs to the other already tabbed strings.
The Text::CSV module is what you're looking for. There are a lot of considerations when parsing CSV files, and you really don't want to handle all of them yourself.
Related
Im newbie of perl and python.
I need to file handling in python(dataframe), and that file need to calculated in Perl.
At first, I tried to use python subprocess, and it was not working(borken pipe)
i need to multiple lines from python, and perl code need to read it and processing.
I just use | in command line, and it was work, but perl skip odds number line and just read even number line.
how can i fix it?
my python code is :
import pandas as pd
data = pd.read_csv('./data.txt', sep = '\t', header = None)
datalist = list(data[0] + '_' + data[1])
for line in kinase_list:
print(line)
and my perl code is :
//
use strict;
my %new_list = ();
while (<STDIN>){
my $line = <STDIN>;
# print STDERR $line;
# chomp $line;
my ($name, $title) = split('_', <STDIN>);
$new_list{$title} = $name;
print STDERR $name, "\t", $title, "\n";
}
print STDERR scalar(keys %new_list);
my python output 657 lines, but perl just out 329.
how can i fix it?
The expression <STDIN> reads a line from standard input, so your Perl code reads two lines for every iteration of the while loop.
It is sufficient to say
while (<STDIN>) {
my $line = $_;
...
or just
while (my $line = <STDIN>) {
...
I've got a CSV file with a column which I want to sift through. I want to use a pattern file to find all entries where the pattern exists even in part of the column's value, and replace the whole cell value with this "pattern".
I made a list of keywords that I want to use as my "pattern" bank;
So, if a cell in this column (this case only second) has this "pattern" as part of its string, then I want to replace the whole cell with this "pattern".
so for example:
my target file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis & Private Hire,moreinfo2
id3,Tax Services,moreinfo3
id4,Tools & Hardware,moreinfo4
id5,Tool Sharpening,moreinfo5
id6,Tool Shops,moreinfo6
id7,Video Conferencing,moreinfo7
id8,Video & DVD Shops,moreinfo8
id9,Woodworking Equipment & Supplies,moreinfo9
my "pattern" file:
Taxidermy Equipment & Supplies
Taxis
Tax Services
Tool
Video
Wood
output file:
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
I came up with the usual "find and replace" sed:
sed -i 's/PATTERN/REPLACE/g' file.csv
but I want it to run on a specific column, so I came up with:
awk 'BEGIN{OFS=FS="|"}$2==PATTERN{$2=REPLACE}{print}' file.csv
but it doesn't work on "part of string" ([Video]:"Video & DVD Shops" -> "Video") and I can't seem to get it how awk takes input as a file for the "Pattern" block.
Is there an awk script for this? Or do I have to write something (in python with the built in csv suit for example?)
In awk, using index. It only prints record if a replacement is made but it's easy to modify to printing even if there is no match (for example replace the print $1,i,$3} with $0=$1 OFS i OFS $3} 1):
$ awk -F, -v OFS=, '
NR==FNR { a[$1]; next } # store "patterns" to a arr
{ for(i in a) # go thru whole a for each record
if(index($2,i)) # if "pattern" matches $2
print $1,i,$3 # print with replacement
}
' pattern_file target_file
id1,Taxidermy Equipment & Supplies,moreinfo1
id2,Taxis,moreinfo2
id3,Tax Services,moreinfo3
id4,Tool,moreinfo4
id5,Tool,moreinfo5
id6,Tool,moreinfo6
id7,Video,moreinfo7
id8,Video,moreinfo8
id9,Wood,moreinfo9
Perl solution, using Text::CSV_XS:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my ($input_file, $pattern_file) = #ARGV;
open my $pfh, '<', $pattern_file or die $!;
chomp( my #patterns = <$pfh> );
my $aoa = csv(in => $input_file);
for my $line (#$aoa) {
for my $pattern (#patterns) {
if (-1 != index $line->[1], $pattern) {
$line->[1] = $pattern;
last
}
}
}
csv(in => $aoa, quote_space => 0, eol => "\n", out => \*STDOUT);
Here's a (mostly) awk solution:
#/bin/bash
patterns_regex=`cat patterns_file | tr '\n' '|'`
cat target_file | awk -F"," -v patterns="$patterns_regex" '
BEGIN {
OFS=",";
split(patterns, patterns_split, "|");
}
{
for (pattern_num in patterns_split) {
pattern=patterns_split[pattern_num];
if (pattern != "" && $2 ~ pattern) {
print $1,pattern,$3
}
}
}'
When you want to solve this with sed, you will need some steps.
For each pattern you will need a command like
sed 's/^\([^,]*\),\(.*Tool.*\),/\1,Tool,/' inputfile
You will need each pattern twice, you can translate the patternfile with
sed 's/.*/"&" "&"/' patternfile
# Change the / into #, thats easier for the final command
sed 's#.*#"&" "&"#' patternfile
When you instruct sed to read a commandfile, you do need to start each line with sed. The commandfile will look like
sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile
You can store this is a file and use the file, but with process substitution you can do things like
cat <(echo "Now this line from echo is handled as a file")
Nice. Lets test the solution
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#' patternfile) inputfile
Almost there! Only the first output line is strange. Whats happening?
The first pattern has a &, and that has a special meaning.
We can patch our command by adding a backslash in the pattern:
sed -f <(sed 's#.*#s/^\\([^,]*\\),\\(.*&.*\\),/\\1,&,/#;s#&#\\&#g' patternfile) inputfile
I'm learning Python and wanted to automate one of my assignments in a cybersecurity class.
I'm trying to figure out how I would look for the contents of a file that are bound by a set of parenthesis. The contents of the (.txt) file look like:
cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`
And here is my code so far:
import sys, os, subprocess, glob, shutil
# Finding the .jpg files that will be copied.
sourcepath = os.getcwd() + '\\imgs\\'
destpath = 'stegdetect'
rawjpg = glob.glob(sourcepath + '*.jpg')
# Copying the said .jpg files into the destpath variable
for filename in rawjpg:
shutil.copy(filename, destpath)
# Asks user for what password file they want to use.
passwords = raw_input("Enter your password file with the .txt extension:")
shutil.copy(passwords, 'stegdetect')
# Navigating to stegdetect. Feel like this could be abstracted.
os.chdir('stegdetect')
# Preparing the arguments then using subprocess to run
args = "stegbreak.exe -r rules.ini -f " + passwords + " -t p *.jpg"
# Uses open to open the output file, and then write the results to the file.
with open('cracks.txt', 'w') as f: # opens cracks.txt and prepares to w
subprocess.call(args, stdout=f)
# Processing whats in the new file.
f = open('cracks.txt')
If it should just be bound by ( and ) you can use the following regex, which ensures starting ( and closing ) and you can have numbers and characters between them. You can add any other symbol also that you want to include.
[\(][a-z A-Z 0-9]*[\)]
[\(] - starts the bracket
[a-z A-Z 0-9]* - all text inside bracket
[\)] - closes the bracket
So for input sdfsdfdsf(sdfdsfsdf)sdfsdfsdf , the output will be (sdfdsfsdf)
Test this regex here: https://regex101.com/
I'm learning Python
If you are learning you should consider alternative implementations, not only regexps.
TO iterate line by line of a text file you just open the file and for over the file handle:
with open('file.txt') as f:
for line in f:
do_something(line)
Each line is a string with the line contents, including the end-of-line char '/n'. To find the start index of a specific substring in a string you can use find:
>>> A = "hello (world)"
>>> A.find('(')
6
>>> A.find(')')
12
To get a substring from the string you can use the slice notation in the form:
>>> A[6:12]
'(world'
You should use regular expressions which are implemented in the Python re module
a simple regex like \(.*\) could match your "parenthesis string"
but it would be better with a group \((.*)\) which allows to get only the content in the parenthesis.
import re
test_string = """cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`"""
REGEX = re.compile(r'\((.*)\)', re.MULTILINE)
print(REGEX.findall(test_string))
# ['asdfl;kj88876', '65498ghjk;0-', 'poi098*/8!##', 'sJ*=tT#&Ve!2', 'nKFdFX+C!:V9' , '!~rFX3FXszx6', 'X&aC$|mg!wC2', 'pe8f%yC$V6Z3']
I have a string that I want to pass to a python script, e.g.
$printf "tas\nty\n"
yields
tas
ty
however when I pipe (e.g. printf "tas\nty\n" | ./pumpkin.py) where pumpkin.py is :
#!/usr/bin/python
import sys
data = sys.stdin.readlines()
print data
I get the output
['tas\n', 'ty\n']
How do I prevent the newline character from being read by python?
You can strip all white spaces (at the beginning and in the end) using strip :
data = [s.strip() for s in sys.stdin.readlines()]
If you need to strip only \n in the end you can do:
data = [s.rstrip('\n') for s in sys.stdin.readlines()]
Or use splitlines method:
data = sys.stdin.read().splitlines()
http://www.tutorialspoint.com/python/string_splitlines.htm
I have an S19 file looking something like below:
S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8
I want to separate the first two characters and also the next two characters, and so on... I want it to look like below (last two characters are also to be separated for each line):
S0, 03, 0000, FC
S3, 0D, 0003C000, 0F00000000000000, 20
S3, FD, 00000000, 782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, ED, 000000F8, 3D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B0000, 3D
S3, 15, 00000400, FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF, 7D
S3, FD, 00000410, 10B5DFF828000468012147F22C10C4F20300016047F22010C4F20300, 00
S7, 05, 00008EB4, B8
How can I do this in Python?
I have something like this:
#!/usr/bin/python
import string,os,sys,re,fileinput
print "hi"
inputfile = "k60.S19"
outputfile = "k60_out.S19"
# open the source file and read it
fh = file(inputfile, 'r')
subject = fh.read()
fh.close()
# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters
pattern2 = re.compile(r'S3')
pattern3 = re.compile(r'S7')
pattern1 = re.compile(r'S0')
# do the replace
result1 = pattern1.sub("S0, ", subject)
result2 = pattern2.sub("S3, ", subject)
result3 = pattern3.sub("S7, ", subject)
# write the file
f_out = file(outputfile, 'w')
f_out.write(result1)
f_out.write(result2)
f_out.write(result3)
f_out.close()
#EoF
but it is not working as I like!! Can someone help me with how to come up with proper regular expression use for this?
try package bincopy, maybe you need it.
bincopy - Interpret strings as packed binary data
Mangling of various file formats that conveys binary information (Motorola S-Record, Intel HEX and binary files).
import bincopy
f = bincopy.BinFile()
f.add_srec_file("path/to/your/s19/flie.s19")
f.as_binary() # print s19 as binary
or you can easily use open() for a file:
with open("path/to/your/s19/flie.s19") as s19:
for line in s19:
type = line[0:2]
count = line[2:4]
adress = line[4:12]
data = line[12:-2]
crc = line[-2:]
print type + ", "+ count + ", " + adress + ", " + data + ", " + crc + "\n"
hope it helps.
Motorola S-record file format
You can do it using a callback function as replacement with re.sub:
#!/usr/bin/python
import re
data = r'''S0030000FC
S30D0003C0000F0000000000000020
S3FD00000000782EFF1FB58E00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S3ED000000F83D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D2B00003D
S31500000400FFFFFFFFFFFFFFFFFFFFFFFF7EF9FFFF7D
S3FD0000041010B5DFF828000468012147F22C10C4F20300016047F22010C4F2030000
S70500008EB4B8'''
pattern = re.compile(r'^(..)(..)((?:.{4}){1,2})(.*)(?=..)', re.M)
def repl(m):
repstr = ''
for g in m.groups():
if (g):
repstr += g + ', '
return repstr
print re.sub(pattern, repl, data)
However, as Mark Setchell notices it, there is probably a nice way to do it with slicing.
I know you are thinking Python and regexes, but this was made for awk and the following will maybe help you work out the way to do it using slicing:
awk '{r=length($0);print substr($0,1,2),substr($0,3,2),substr($0,5,8),substr($0,13,r-14),substr($0,r-1)}' OFS=, k60.s19
That says "get the length of the line in variable r, then print the first two characters, the next two characters, the next 8 characters and so on... and use a comma as the field separator".
EDITED
Here are a few more hints to get you started...
if you want to avoid printing line 1, you can do
awk 'FNR==1{next} ...rest of awk script above ... '
If you want to only process lines longer than 40 characters, you can do
awk 'length($0)>40 {print}' yourfile
If you only want to process lines where the second field is "xx", you can do
awk '$2 ~ "xx" {print}' yourfile