Python file delimited with double tabs - python

I'm new to python and am working on a script to read from a file that is delimited by double tabs (except for the first row wich is delimited by single tabs"
I tried the following:
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row
The thing is that csv.reader won't take a two character delimeter. Is there a good way to make a double-tab delimeter work?
The output currently looks like this:
['2011-11-28 10:25:44', '', '2011-11-28 10:33:00', '', 'Showering', '']
['2011-11-28 10:34:23', '', '2011-11-28 10:43:00', '', 'Breakfast', '']
['2011-11-28 10:49:48', '', '2011-11-28 10:51:13', '', 'Grooming','']
There should only be three columns of data, however, it is picking up the extra empty fields because of the double tabs that separate the fields.

If performance is not an issue here , would you be fine with this quick and hacky solution.
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row[::2]
row[::2] does a stride on list row for indexes that are multiples of 2. For the above mentioned output, index striding by an offset (here its 2) is one way to go!

How much do you know about your data? Is it ever possible that an entry contains a double-tab? If not, I'd abandon the csv module and use the simple approach:
with open('data.csv') as data:
for line in data:
print line.strip().split('\t\t')
The csv module is nice for doing tricky things, like determining when a delimiter should split a string, and when it shouldn't because it is part of an entry. For example, say we used spaces as delimiters, and we had a row such as:
"this" "is" "a test"
We surround each entry with quotes, giving three entries. Clearly, if we use the approach of splitting on spaces, we'll get
['"this"', '"is"', '"a', 'test"']
which is not what we want. The csv module is useful here. But if we can guarantee that whenever a space shows up, it is a delimiter, then there's no need to use the power of the csv module. Just use str.split and call it a day.

Related

Python CSV output - additional formatting

To start...Python noob...
My first goal is to read the first row of a CSV and output. The following code does that nicely.
import csv
csvfile = open('some.csv','rb')
csvFileArray = []
for row in csv.reader(csvfile, delimiter = ','):
csvFileArray.append(row)
print(csvFileArray[0])
Output looks like...
['Date', 'Time', 'CPU001 User%', 'CPU001 Sys%',......
My second and third tasks deal with formatting.
Thus, if I want the print(csvFileArray[0]) output to contain 'double quotes' for the delimiter how best can I handle that?
I'd like to see...
["Date","Time", "CPU001 User%", "CPU001 Sys%",......
I have played with formatting the csvFileArray field and all I can get it to do is to prefix or append data.
I have also looked into the 'dialect', 'quoting', etc., but am just all over the place.
My last task is to add text into each value (into the array).
Example:
["Test Date","New Time", "Red CPU001 User%", "Blue CPU001 Sys%",......
I've researched a number of methods to do this but am awash in the multiple ways.
Should I ditch the Array as this is too constraining?
Looking for direction not necessarily someone to write it for me.
Thanks.
OK.....refined the code a bit and am looking for direction, not direct solution (need to learn).
import csv
with open('ba200952fd69 - Copy.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print (row)
break
The code nicely reads the first line of the CSV and outputs the first row as follows:
['Date', 'Time', 'CPU001 User%', 'CPU001 Sys%',....
If I want to add formatting to each/any item within that row, would I be performing those actions within the quotes of the print command? Example: If I wanted each item to have double-quotes, or have a prefix of 'XXXX', etc.
I have read through examples of .join type commands, etc., and am sure that there are much easier ways to format print output than I'm aware of.
Again, looking for direction, not immediate solutions.
For your first task, I'd recommend using the next function to grab the first row rather than iterating through the whole csv. Also, it might be useful to take a look at with blocks as they are the standard way of dealing with opening and closing files.
For your second question, it looks like you want to change the format of the print statement. Note that it is printing strings, which is indicated by the single quotes around each element in the array. This has nothing to do with the csv module, but simply because you are print an array of strings. To print with double quotes, you would have to reformat the print statement. You could take a look at this for some ways on doing that.
For your last question, I'd recommend looking at list comprehensions. E.g.,
["Test " + word for word in words].
If words = ["word1", "word2"], then this would return ["Test word1", "Test word2"].
Edit: If you want to add a different value to each value in the array, you could do something similar. Let prefixes be an array of prefixes you want to add to the word in words at the same index location. You could then use the list comprehension:
[prefix + " " + word for prefix, word in zip(prefixes, words)]

Retaining Blank Fields in Python Using the Split Method

I have a tab delimited file that has entries that look like this:
strand1 strand2 genename ID
AGCTCTG AGCTGT Erg1 ENSG010101
However, some of them have blank fields, for example:
strand1 strand2 genename ID
AGCGTGT AGTTGTT ENSG12955729
When I read in the lines in python:
data = [line.strip().split() for line in filename]
The second example becomes collapsed into a list of 3 indices:
['AGCGTGT', 'AGTTGTT', 'ENSG12955729']
I would rather that the empty field be retained so that the second example becomes a list of 4 indices:
['AGCGTGT', 'AGTTGTT', '', 'ENSG12955729']
How can I do this?
You can split explicitly on tab:
>>> "foo\tbar\t\tbaz".split('\t')
['foo', 'bar', '', 'baz']
By default, split() is going to split on any amount of whitespace.
Unless you can ensure that the first and last columns won't be blank, strip() is going to cause problems. If the data is otherwise well-formatted, this solution will work.
If you know that the only tabs are field delimiters, and you still want to strip other whitespace (spaces) from around individual column values:
map(str.strip, line.split('\t'))
As others have stated you can explicitly split on tabs, but you would still need to cleanup the line endings.
Better would be to use the csv module which handles delimited files:
import csv
with open('filename.txt', newline='') as f:
reader = csv.reader(f, delimiter='\t')
headers = next(reader)
data = list(reader)
When you don't give a parameter to the str.split() method, it treats any contiguous sequence of whitespace characters as a single separator. When you do give it a parameter, .split('\t') perhaps, it treats each individual instance of that string as a separator.
Split method without any argument considers continuous stream of spaces as a single character so it splits all amount of whitespaces. You need to specify an argumnet to the method which in your case is \t.
I'm always looking for puzzles to which I can apply pyparsing, no matter how impractical the results might prove to be. If nothing else, I can always look through my old answers to see what I've tried.
Don't judge me too harshly. :)
import pyparsing as pp
item = pp.Word(pp.alphanums) | pp.Empty().setParseAction(lambda x: '')
TAB = pp.Suppress(r'\t')
process_line = pp.Group(item('item') + TAB + item('item') + TAB + item('item') + TAB + item('item'))
with open('tab_delim.txt', 'rb') as tabbed:
while True:
line = tabbed.readline()
if line:
line = line.decode().strip().replace('\t', '\\t' )+3*'\\t'
print (line.replace('\\t', ' '))
print('\t', process_line.parseWithTabs().parseString(line))
else:
break
Output:
strand1 strand2 genename ID
[['strand1', 'strand2', 'genename', 'ID']]
AGCTCTG AGCTGT Erg1 ENSG010101
[['AGCTCTG', 'AGCTGT', 'Erg1', 'ENSG010101']]
AGCGTGT AGTTGTT ENSG12955729
[['AGCGTGT', 'AGTTGTT', '', 'ENSG12955729']]
ABC DEF
[['ABC', 'DEF', '', '']]
Edit: Altered the line TAB = pp.Suppress(r'\t') to what was suggested by PaulMcG in a comment (from a construction with a double-slash prior to the 't' in a non-raw string).

Python Regex split string into 5 pieces

I'm playing around with Python, and i have run into a problem.
I have a large data file where each string is structured like this:
"id";"userid";"userstat";"message";"2013-10-19 06:33:20 (date)"
I need to split each line into 5 pieces, semicolon being the delimiter. But at the same time within the quotations.
It's hard to explain, so i hope you understand what i mean.
That format looks a lot like ssv: semicolon-separated valued (like "csv", but semicolons instead of commas). We can use the csv module to handle this:
import csv
with open("yourfile.txt", "rb") as infile:
reader = csv.reader(infile, delimiter=";")
for row in reader:
print row
produces
['id', 'userid', 'userstat', 'message', '2013-10-19 06:33:20 (date)']
One advantage of this method is that it will correctly handle the case of semicolons within the quoted data automatically.
Use str.split, no need of regex:
>>> strs = '"id";"userid";"userstat";"message";"2013-10-19 06:33:20 (date)"'
>>> strs.split(';')
['"id"', '"userid"', '"userstat"', '"message"', '"2013-10-19 06:33:20 (date)"']
If you don't want the double quotes as well, then:
>>> [x.strip('"') for x in strs.split(';')]
['id', 'userid', 'userstat', 'message', '2013-10-19 06:33:20 (date)']
you can split by ";" in you case, also consider using of regexp, like ^("[^"]+");("[^"]+");("[^"]+");("[^"]+");("[^"]+")$

loading strings with spaces as numpy array

I would like to load a csv file as a numpy array. Each row contains string fields with spaces.
I tried with both loadtxt() and genfromtxt() methods available in numpy. By default both methods consider space as a delimiter and separates each word in the string as a separate column. Is there anyway to load this sort of data using loadtxt() or genfromtxt() or will I have to write my own code for it?
Sample row from my file:
826##25733##Emanuele Buratti## ##Mammalian cell expression
Here ## is the delimiter and space denotes missing values.
I think your problem is that the default comments character # is conflicting with your delimiter. I was able to load your data like this:
>>> import numpy as np
>>> np.loadtxt('/tmp/sample.txt', dtype=str, delimiter='##', comments=None)
array(['826', '25733', 'Emanuele Buratti', ' ', 'Mammalian cell expression'],
dtype='|S25')
You can see that the dtype has been automatically set to whatever the maximum length string was. You can use dtype=object if that is troublesome. As an aside, since your data is not numeric, I would probably recommend using csv module rather than numpy for this job.
Here is the csv equivalent, as wim suggested:
import csv
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter='##')
rows = list(reader)
As #wim pointed out the comments, this doesn't really work since the delimiter must be one character. So if you change the above so that delimiter='#', you get this as the result:
[['826', '', '25733', '', 'Emanuele Buratti', '', ' ', '', 'Mammalian cell expression']]

How to use python csv module for splitting double pipe delimited data

I have got data which looks like:
"1234"||"abcd"||"a1s1"
I am trying to read and write using Python's csv reader and writer.
As the csv module's delimiter is limited to single char, is there any way to retrieve data cleanly? I cannot afford to remove the empty columns as it is a massively huge data set to be processed in time bound manner. Any thoughts will be helpful.
The docs and experimentation prove that only single-character delimiters are allowed.
Since cvs.reader accepts any object that supports iterator protocol, you can use generator syntax to replace ||-s with |-s, and then feed this generator to the reader:
def read_this_funky_csv(source):
# be sure to pass a source object that supports
# iteration (e.g. a file object, or a list of csv text lines)
return csv.reader((line.replace('||', '|') for line in source), delimiter='|')
This code is pretty effective since it operates on one CSV line at a time, provided your CSV source yields lines that do not exceed your available RAM :)
>>> import csv
>>> reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
>>> for row in reader:
... assert not ''.join(row[1::2])
... row = row[0::2]
... print row
...
['1234', 'abcd', 'a1s1']
>>>
Unfortunately, delimiter is represented by a character in C. This means that it is impossible to have it be anything other than a single character in Python. The good news is that it is possible to ignore the values which are null:
reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
#iterate through the reader.
for x in reader:
#you have to use a numeric range here to ensure that you eliminate the
#right things.
for i in range(len(x)):
#Odd indexes will be discarded.
if i%2 == 0: x[i] #x[i] where i%2 == 0 represents the values you want.
There are other ways to accomplish this (a function could be written, for one), but this gives you the logic which is needed.
If your data literally looks like the example (the fields never contain '||' and are always quoted), and you can tolerate the quote marks, or are willing to slice them off later, just use .split
>>> '"1234"||"abcd"||"a1s1"'.split('||')
['"1234"', '"abcd"', '"a1s1"']
>>> list(s[1:-1] for s in '"1234"||"abcd"||"a1s1"'.split('||'))
['1234', 'abcd', 'a1s1']
csv is only needed if the delimiter is found within the fields, or to delete optional quotes around fields

Categories