How can I sort out this data? - python

http://img32.imageshack.us/img32/6649/workspace1001.png
big version
I have this product data in a csv file, but some of the fields are wrong.
Look at the screenshot. Some of the images are like this:
image.jpg#foobar
When they need to be
image.jpg
Not all of them have this. They are all .jpg
Is there something I can do in Sed or Python/Perl to fix this?

sed -i.bk -e 's/jpg#[^,]*/jpg/g' filename

So all you want to do is strip the #... from column S, the image column right?
Perl can do this neatly. Handles quoted cols in CSV and only updates the columns you specifiy.
my $in = IO::File->new( "<old.csv" );
my $out = IO::File->new( ">new.csv" );
my $csv = Text::CSV_XS->new();
while( my $rec = $csv->getline($fh) )
{
$rec->[18] =~ s/\#.*$//s;
$csv->print( $row );
}

Related

Adding text in the middle of the file

I have these files:
actions.js - append before }
import {constants} from "./constants";
export const setUser = (value) => ({
type: constants.SET_USER,
payload: value,
});
//here
constants.js - append to the end
export const constants = {
SET_USER: "SET_USER",
//here
};
reducers.js - add a const above export and inside the combineReducers object
import {constants} from "./constants";
import {combineReducers} from "redux";
const user = (state = null, action) => action.type === constants.SET_USER ? action.payload : state;
//here
export const reducers = combineReducers({
user,
// here
})
And I want to add code into these files in the places where I put //here. How can I do that with Python? I know I can write over a file with open('file', 'w').write('string') but how can I actually add text without loosing and overwriting the file? I want to add the text to the existing file, not to create the file, or overwrite it. I want it to have the old text, and add the new text to it. How can I achieve this with Python?
I made it append to the actions.js like this:
import sys
import os
reducer = sys.argv[1]
open("actions.js","a").write("""export const set{reducer} = (value) => ({{
type: constants.{constant},
payload: value,
}});
""".format(reducer=reducer.capitalize(), constant=constant))
But I have no idea how to get the others done
Read the file, slice the string at index you want, concatenate in order, and then write to the file with cursor at 0. Let x.txt be your file. "export" in the index() method here refers to a unique non repeating word. You can use unique comments to slice the string at respective positions!
with open("x.txt","r+") as f:
old=f.read()
print(old)
constant_text= "What you want to add??"
result=old[0:old.index("export")] + constant_text + old[old.index("export"):]
# print(result)
f.seek(0)
f.write(result)
print("######################################")
print(result)
Make sure the index keywords are unique if you want to slice in multiple locations using keywords!
To my knowledge, this is not possible in the way you suggest in a single operation. My solution of choice would be to iterate over the file’s lines, and once you hit your // here - marker, insert the code.
new_content = ""
with open(file_name) as f:
for line in f.readlines():
new_content += line
if line.strip() == "// here":
new_content += text_to_insert
After this loop, new_content should hold the old text and the new* inserted at the right place, which you can then write to any file you like.
*assuming that your input is properly formatted, including line breaks and so on.

Collapse rows based on column 1

I want to parse InterProScan results for TopGO R package.
I would like to have a file in a format a bit distant of what I have.
# input file (gene_ID GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95 GO:0004349, GO:0005737, GO:0006561
Q97R95 GO:0004349, GO:0006561
Q97R95 GO:0005737, GO:0006561
Q97R95 GO:0006561
# desired output (removed duplicates and rows collapsed)
Q97R95 GO:0004349,GO:0005737,GO:0006561
You can test your tool with the whole data file here:
https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing
You can make use of 2-d array of gnu awk:
awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[$1][$i]}
END{for(x in r){
printf "%s ",x;b=0;
for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
print ""}
}' file
It gives:
Q97R95 GO:0005737,GO:0006561,GO:0004349
The duplicated fields are removed, however the order was not kept.
Here is a, hopefully tidy, Perl solution. It preserves order of keys and values as far as possible, and doesn't keep the whole file contents in memory, only as much as necessary to do the job.
#!perl
use strict;
use warnings;
my ($prev_key, #seen_values, %seen_values);
while (<>) {
# Parse the input
chomp;
my ($key, $values) = split /\s+/, $_, 2;
my #values = split /,\s*/, $values;
# If we have a new key...
if ($key ne $prev_key) {
# output the old data, as long as there is some,
if (#seen_values) {
print "$prev_key\t", join(", ", #seen_values), "\n";
}
# clear it out,
#seen_values = %seen_values = ();
# and remember the new key for next time.
$prev_key = $key;
}
# Merge this line's values with previous ones, de-duplicating
# but preserving order.
for my $value (#values) {
push #seen_values, $value unless $seen_values{$value}++;
}
}
# Output what's left after the last line
if (#seen_values) {
print "$prev_key\t", join(", ", #seen_values), "\n";
}

NumPy load file interspersed with headers

I'm trying to parse files with repeating blocks of the following format:
ITEM: TIMESTEP
5000
ITEM: NUMBER OF ATOMS
4200
ITEM: BOX BOUNDS pp pp ff
0 47.6892
0 41.3
-11.434 84.1378
ITEM: ATOMS id type z vx
5946 27 11.8569 0.00180946
5948 28 11.1848 -0.0286474
5172 27 12.1796 0.00202046
...
where ... will be NUMBER OF ATOMS entries (4200 for this particular file). Each file contains many of these blocks in succession and will range from 1-5 million lines.
I want to completely ignore all of the header data contained in the first 9 lines of each block and only need an array containing all of the "z" values (3rd column in a data entry) and an array containing the "vx" values (4th column in a data entry).
The headers for each block will always be the same within a file except for the number following the ITEM: TIMESTEP entry. The header format will remain the same across files and the files differ only in the number of entries (atoms).
I wrote some incredibly dirty code that did the trick for some shorter files I was working with previously but it's very slow for these files. I tried using the genfromtxt function but I haven't found a way to bend it to do what I want in this case. Any tips on making this faster?
EDIT:
The following worked for me:
grep -E '^[.-0123456789]+ [.-0123456789]+ [.-0123456789]+ [.-0123456789]'
As did this:
with open(data, 'r') as fh:
wrapper = (i for i in fh if re.match(r'^[-.1234567890]+ [-.1234567890]+ [-.1234567890]+ [-.1234567890]',i))
z_vx = np.genfromtxt(wrapper, usecols=(2,3))
This ended up being the fastest for my case:
regexp = r'\d+\s+\d+\s+([0-9]*\.?[0-9]+)\s+([-+]?[0-9]*\.?[0-9]+)\s+\n'
data = np.fromregex(file_path, regexp, dtype=[('z', float), ('vx', float)])
If you want speed, you can grep only the relevant lines and then use np.genfromtxt().
grep something like this (you assumed the relevant rows have 4 fields of numbers right?):
grep -P '^[-.0123456789]+ [-.0123456789]+ [-.0123456789]+ [-.0123456789]+$'
A more pythonic solution would be to wrap the file handle with a generator like this:
wrapper = (i for i in fh if re.match(r'^[-.1234567890]+ [-.1234567890]+ [-.1234567890]+ [-.1234567890]+$',i))
np.genfromtxt(wrapper,...)
I had a similar problem. Ended up using sed so add a # in front of the header and then use np.loadtxt.
So the bash script was
for i in $( ls *.data )
do
b = `basename $i .data`
sed '1,9{/^#/!s/^/#/}' $i > $b.tmp
rm $i
mv $b.tmp $i
done
and the in python
from numpy import loadtxt
data = loadtxt("atoms.data")

How can I elegantly combine/concat files by section with python?

Like many an unfortunate programmer soul before me, I am currently dealing with an archaic file format that refuses to die. I'm talking ~1970 format specification archaic. If it were solely up to me, we would throw out both the file format and any tool that ever knew how to handle it, and start from scratch. I can dream, but that unfortunately that won't resolve my issue.
The format: Pretty Loosely defined, as years of nonsensical revisions have destroyed almost all back compatibility it once had. Basically, the only constant is that there are section headings, with few rules about what comes before or after these lines. The headings are sequential (e.g. HEADING1, HEADING2, HEADING3,...), but not numbered and are not required (e.g HEADING1, HEADING3, HEADING7). Thankfully, all possible heading permutations are known. Here's a fake example:
# Bunch of comments
SHOES # First heading
# bunch text and numbers here
HATS # Second heading
# bunch of text here
SUNGLASSES # Third heading
...
My problem: I need to concatenate multiple of these files by these section headings. I have a perl script that does this quite nicely:
while(my $l=<>) {
if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
...
...
elsif($l=~/^END/i) { $r=\$end; name($r);}
else {
$$r .= $l;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
As you can see, with the perl script I basically just change where a reference points to when I get to a certain pattern match, and concatenate each line of the file to its respective string until I get to the next pattern match. These are then printed out later as one big concated file.
I would and could stick with perl, but my needs are becoming more complex every day and I would really like to see how this problem can be solved elegantly with python (can it?). As of right now my method in python is basically to load the entire file as a string, search for the heading locations, then split up the string based on the heading indices and concat the strings. This requires a lot of regex, if-statements and variables for something that seems so simple in another language.
It seems that this really boils down to a fundamental language issue. I found a very nice SO discussion about python's "call-by-object" style as compared with that of other languages that are call-by-reference.
How do I pass a variable by reference?
Yet, I still can't think of an elegant way to do this in python. If anyone can help kick my brain in the right direction, it would be greatly appreciated.
That's not even elegant Perl.
my #headers = qw( shoes hats sunglasses dress );
my $header_pat = join "|", map quotemeta, #headers;
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) { name( $section = \$sections{$1 } ); }
elsif (/skirt/i) { name( $section = \$sections{'dress'} ); }
else { $$section .= $_; }
print STDERR "Finished processing $ARGV\n" if eof;
}
Or if you have many exceptions:
my #headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );
my $header_pat = join "|", map quotemeta, #headers, keys(%aliases);
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) {
name( $section = \$sections{ $aliases{$1} // $1 } );
} else {
$$section .= $_;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
Using a hash saves the countless my declarations you didn't show.
You could also do $header_name = $1; name(\$sections{$header_name}); and $sections{$header_name} .= $_ for a bit more readability.
I'm not sure if I understand your whole problem, but this seems to do everything you need:
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]
for arg in sys.argv[1:]:
section_index = 0
with open(arg) as f:
for line in f:
if line.startswith(headers[section_index + 1]):
section_index = section_index + 1
else:
sections[section_index].append(line)
Obviously you could change this to read or mmap the whole file, then re.search or just buf.find for the next header. Something like this (untested pseudocode):
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)
for arg in sys.argv[1:]:
with open(arg) as f:
buf = f.read()
section = None
start = 0
for header in headers[1:]:
idx = buf.find('\n'+header, start)
if idx != -1:
sections[section].append(buf[start:idx])
section = header
start = buf.find('\n', idx+1)
if start == -1:
break
else:
sections[section].append(buf[start:])
And there are plenty of other alternatives, too.
But the point is, I can't see anywhere where you'd need to pass a variable by reference in any of those solutions, so I'm not sure where you're stumbling on whichever one you've chosen.
So, what if you want to treat two different headings as the same section?
Easy: create a dict mapping headers to sections. For example, for the second version:
headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}
Now, in the code that doessections[section], just do sections[headers_to_sections[section]].
For the first, just make this a mapping from strings to indices instead of strings to strings, or replace sections with a dict. Or just flatten the two collections by using a collections.OrderedDict.
My deepest sympathies!
Here's some code (please excuse minor syntax errors)
def foundSectionHeader(l, secHdrs):
for s in secHdrs:
if s in l:
return True
return False
def main():
fileList = ['file1.txt', 'file2.txt', ...]
sectionHeaders = ['SHOES', 'HATS', ...]
sectionContents = dict()
for section in sectionHeaders:
sectionContents[section] = []
for file in fileList:
fp = open(file)
lines = fp.readlines()
idx = 0
while idx < len(lines):
sec = foundSectionHeader(lines[idx]):
if sec:
idx += 1
while not foundSectionHeader(lines[idx], sectionHeaders):
sectionContents[sec].append(lines[idx])
idx += 1
This assumes that you don't have content lines which look like "SHOES"/"HATS" etc.
Assuming you're reading from stdin, as in the perl script, this should do it:
import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
sline = line.strip()
if sline not in headings:
sections[headings.get(key)].append(sline)
else:
key = sline
You'll end up with a dictionary where like this:
{
None: <all lines as a single string before any heading>
'HATS' : <all lines as a single string below HATS heading and before next heading> ],
etc...
}
The headings list does not have to be defined in the some order as the headings appear in the input.

Search/Replace/Delete Jekyll YAML Front Matter Category Tags

I've inherited a Jekyll website and I'm coming from a .NET world so it's a learning curve for me.
This Jekyll site takes forever to build and I think it is because there are literally thousands of category tags that require those pages to be removed. I'm able to get a list of all the categories and created a CSV that I'd like to loop through and figure out if a category tag is still needed. The structure of the CSV is:
old_tag,new_tag
Clearly I'd like to update the tags based on those (e.g. make all C#, C-Sharp, C # and C Sharp categories just C-Sharp). But, I'd also like to delete some where the old tag field exists and the new one is blank:
old_tag,new_tag
C#, C-Sharp
C Sharp, C-Sharp
Crazy,
C #, C-Sharp
Using Ruby or Python I'd like to figure out how to loop through over 4000 markdown files and use the CSV to conditionally update each one. The database person in me just can't think how this would work with flat files.
I'd recommend starting with a Hash, using it like a translation table. Hash lookups are very fast, and can organize your tags and their replacements nicely.
hash = {
# old_tag => new_tag
'C#' => 'C-Sharp',
'C Sharp' => 'C-Sharp',
'Crazy' => '',
'C #' => 'C-Sharp',
}
You can see there's a lot of redundancy in the values, which could be fixed by reversing the hash, which reduces it nicely:
hash = {
# new_tag => old_tag
'C-Sharp' => ['C#', 'C Sharp', 'C #'],
}
'Crazy' is an outlier, but we will deal with that.
Ruby's String.gsub has a nice, but little used feature, where we can pass it a regular expression, and a hash, and it'll replace all regex matches with the equivalent value in the hash. We can build that regex easily:
regex = /(?:#{ Regexp.union(hash.keys).source })/
=> /(?:C\-Sharp)/
Now, you're probably saying, "but wait, I have a lot more tags to find!", and, because of the way the hash is built, they're hidden in the values. To remedy that, we'll reverse the hash's keys and values, exploding the value arrays into their individual elements:
reversed_hash = Hash[hash.flat_map{ |k,v| v.map{ |i| [i,k] } }]
=> {
"C#" => "C-Sharp",
"C Sharp" => "C-Sharp",
"C #" => "C-Sharp",
}
Adding in 'Crazy' is easy, by merging a second hash of the "special cases":
special_cases = {
'Crazy' => ''
}
reversed_hash = Hash[hash.flat_map{ |k,v| v.map{ |i| [i,k] } }].merge(special_cases)
=> {
"C#" => "C-Sharp",
"C Sharp" => "C-Sharp",
"C #" => "C-Sharp",
"Crazy" => ""
}
Using that with the regex buildin' code:
regex = /(?:#{ Regexp.union(reversed_hash.keys).source })/
=> /(?:C\#|C\ Sharp|C\ \#|Crazy)/
That will find the tags using a auto-generated regex. If it needs to be case-insensitive, use:
regex = /(?:#{ Regexp.union(reversed_hash.keys).source })/i
Creating some text to test against:
text =<<EOT
This is "#C#"
This is "C Sharp"
This is "C #"
This is "Crazy"
EOT
=> "This is \"#C#\"\nThis is \"C Sharp\"\nThis is \"C #\"\nThis is \"Crazy\"\n"
And testing the gsub:
puts text.gsub(regex, reversed_hash)
Which outputs:
This is "#C-Sharp"
This is "#C-Sharp"
This is "#C-Sharp"
This is "#"
Now, I'm not a big fan of slurping big files into memory, because that doesn't scale well. Today's machines usually have many GB of memory, but I see files that still exceed the RAM in a machine. So, instead of using a File.read to load the file, then a single gsub to process it, I recommend using File.foreach. Using that changes the code.
Here's how I'd do it:
file_to_read = '/path/to/file/to/read'
File.open(file_to_read + '.new', 'w') do |fo|
File.foreach(file_to_read) do |li|
fo.puts li.gsub(regex, reversed_hash)
end
end
File.rename(file_to_read, file_to_read + '.bak')
File.rename(file_to_read + '.new', file_to_read)
This will create a .bak version of each file processed, so if something goes wrong you have a fall-back, which is always a good practice.
Edit: I forgot about the CSV file:
You can read/create one easily with Ruby using the CSV module, however I'd go with a YAML file because it allows you to easily create your hash layout in a file that is easy to edit by hand, or generate from a file.
Edit: More about CSV, YAML and generating one from the other
Here's how to read the CSV and convert it into the recommended hash format:
require 'csv'
text = <<EOT
C#, C-Sharp
C Sharp, C-Sharp
Crazy,
C #, C-Sharp
EOT
hash = Hash.new{ |h,k| h[k] = [] }
special_cases = []
CSV.parse(text) do |k,v|
(
(v.nil? || v.strip.empty?) ? special_cases : hash[v.strip]
) << k.strip
end
Picking up from before:
reversed_hash = Hash[hash.flat_map{ |k,v| v.map{ |i| [i,k] } }].merge(Hash[special_cases.map { |k| [k, ''] }])
puts reversed_hash
# => {"C#"=>"C-Sharp", "C Sharp"=>"C-Sharp", "C #"=>"C-Sharp", "Crazy"=>""}
To convert the CSV file to something more editable and useful, use the above code to create hash and special_cases, then:
require 'yaml'
puts ({
'hash' => hash,
'special_cases' => special_cases
}).to_yaml
Which looks like:
---
hash:
C-Sharp:
- C#
- C Sharp
- ! 'C #'
special_cases:
- Crazy
The rest you can figure out from the YAML docs.
Here's one possible approach; not sure how well it will work for large amounts of data:
require "stringio"
require "csv"
class MarkdownTidy
def initialize(rules)
#csv = CSV.new(rules.is_a?(IO) ? rules : StringIO.new(rules))
#from_to = {}.tap do |hsh|
#csv.each do |from, to|
re = Regexp.new(Regexp.escape(from.strip))
hsh[re] = to.strip
end
end
end
def tidy(str)
cpy = str.dup
#from_to.each do |re, canonical|
cpy.gsub! re, canonical
end
cpy
end
end
csv = <<-TEXT
C#, C-Sharp
C Sharp, C-Sharp
Crazy,
C #, C-Sharp
TEXT
markdown = <<-TEXT
C# some text C # some text Crazy
C#, C Sharp
TEXT
mt = MarkdownTidy.new(csv)
[markdown].each do |str|
puts mt.tidy(markdown)
end
The idea is that you would replace the loop at the very end with one that opens up the files, reads them and then saves them back to disk.

Categories