loading strings with spaces as numpy array - python

I would like to load a csv file as a numpy array. Each row contains string fields with spaces.
I tried with both loadtxt() and genfromtxt() methods available in numpy. By default both methods consider space as a delimiter and separates each word in the string as a separate column. Is there anyway to load this sort of data using loadtxt() or genfromtxt() or will I have to write my own code for it?
Sample row from my file:
826##25733##Emanuele Buratti## ##Mammalian cell expression
Here ## is the delimiter and space denotes missing values.

I think your problem is that the default comments character # is conflicting with your delimiter. I was able to load your data like this:
>>> import numpy as np
>>> np.loadtxt('/tmp/sample.txt', dtype=str, delimiter='##', comments=None)
array(['826', '25733', 'Emanuele Buratti', ' ', 'Mammalian cell expression'],
dtype='|S25')
You can see that the dtype has been automatically set to whatever the maximum length string was. You can use dtype=object if that is troublesome. As an aside, since your data is not numeric, I would probably recommend using csv module rather than numpy for this job.

Here is the csv equivalent, as wim suggested:
import csv
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter='##')
rows = list(reader)
As #wim pointed out the comments, this doesn't really work since the delimiter must be one character. So if you change the above so that delimiter='#', you get this as the result:
[['826', '', '25733', '', 'Emanuele Buratti', '', ' ', '', 'Mammalian cell expression']]

Related

parsing using multiple delimiters in python

I have a data file in which data is stored with comma and tab and newline delimiter like this
[32135, 311351, 88686
123152, 3153131, 131513
....]
i want to extract a nx3 array out of it
how could i do that ?
have tried using split in splitlines but it just parsed the file partially
import numpy as np
filename="Elem_Output.inp"
f = open(filename,"r")
pmax=f.read()
p1=pmax.split()
i expect to extract an array with every line a row and the numbers in each column in the arrays' column
After pmax=f.read(), you may want to write:
#Replace tab and newline as comma separater
pmax = pmax.replace("\n",",").replace("\t", ",")
#Replace repeated delimiter by a single instance
pmax = pmax.replace(",,,",",").replace(",,",",")
Needless to say, this can be coded much better using regex (import re).
Secondly, if your file starts and ends with square brackets, you may want to additionally add:
pmax = pmax.replace("[","").replace("]","")
Now, if you want this output as an array instead of list, try this:
from array import array
array_pmax = array("B", pmax)
The first argument in the array() function indicates the typecode. To know more, just use help(array)
Hope that helps!!

Python file delimited with double tabs

I'm new to python and am working on a script to read from a file that is delimited by double tabs (except for the first row wich is delimited by single tabs"
I tried the following:
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row
The thing is that csv.reader won't take a two character delimeter. Is there a good way to make a double-tab delimeter work?
The output currently looks like this:
['2011-11-28 10:25:44', '', '2011-11-28 10:33:00', '', 'Showering', '']
['2011-11-28 10:34:23', '', '2011-11-28 10:43:00', '', 'Breakfast', '']
['2011-11-28 10:49:48', '', '2011-11-28 10:51:13', '', 'Grooming','']
There should only be three columns of data, however, it is picking up the extra empty fields because of the double tabs that separate the fields.
If performance is not an issue here , would you be fine with this quick and hacky solution.
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row[::2]
row[::2] does a stride on list row for indexes that are multiples of 2. For the above mentioned output, index striding by an offset (here its 2) is one way to go!
How much do you know about your data? Is it ever possible that an entry contains a double-tab? If not, I'd abandon the csv module and use the simple approach:
with open('data.csv') as data:
for line in data:
print line.strip().split('\t\t')
The csv module is nice for doing tricky things, like determining when a delimiter should split a string, and when it shouldn't because it is part of an entry. For example, say we used spaces as delimiters, and we had a row such as:
"this" "is" "a test"
We surround each entry with quotes, giving three entries. Clearly, if we use the approach of splitting on spaces, we'll get
['"this"', '"is"', '"a', 'test"']
which is not what we want. The csv module is useful here. But if we can guarantee that whenever a space shows up, it is a delimiter, then there's no need to use the power of the csv module. Just use str.split and call it a day.

Python Regex split string into 5 pieces

I'm playing around with Python, and i have run into a problem.
I have a large data file where each string is structured like this:
"id";"userid";"userstat";"message";"2013-10-19 06:33:20 (date)"
I need to split each line into 5 pieces, semicolon being the delimiter. But at the same time within the quotations.
It's hard to explain, so i hope you understand what i mean.
That format looks a lot like ssv: semicolon-separated valued (like "csv", but semicolons instead of commas). We can use the csv module to handle this:
import csv
with open("yourfile.txt", "rb") as infile:
reader = csv.reader(infile, delimiter=";")
for row in reader:
print row
produces
['id', 'userid', 'userstat', 'message', '2013-10-19 06:33:20 (date)']
One advantage of this method is that it will correctly handle the case of semicolons within the quoted data automatically.
Use str.split, no need of regex:
>>> strs = '"id";"userid";"userstat";"message";"2013-10-19 06:33:20 (date)"'
>>> strs.split(';')
['"id"', '"userid"', '"userstat"', '"message"', '"2013-10-19 06:33:20 (date)"']
If you don't want the double quotes as well, then:
>>> [x.strip('"') for x in strs.split(';')]
['id', 'userid', 'userstat', 'message', '2013-10-19 06:33:20 (date)']
you can split by ";" in you case, also consider using of regexp, like ^("[^"]+");("[^"]+");("[^"]+");("[^"]+");("[^"]+")$

Slice specific characters in CSV using python

I have data in tab delimited format that looks like:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). I figured the best way to do this would be to use match and the genfromtxt in numpy. This example is as far as I have gotten:
import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
m = re.match('[0-9]/[0-9]', i)
if m:
print m.group(0),
else:
print "NA",
This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file.
Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this?
Unless you really want to use NumPy, try this:
file = open('home/python/batch1.hg19.table')
for line in file:
for cell in line.split('\t'):
print(cell[:3])
Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for.
Numpy is great when you want to load in an array of numbers.
The format you have here is too complicated for numpy to recognize, so you just get an array of strings. That's not really playing to numpy's strength.
Here's a simple way to do it without numpy:
result=[]
with open(csvfile,'r') as f:
for line in f:
row=[]
for text in line.split('\t'):
match=re.search('([0-9]/[0-9])',text)
if match:
row.append(match.group(1))
else:
row.append("NA")
result.append(row)
print(result)
yields
# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]
on this data:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
Its pretty easy to parse the whole file without regular expressions:
for line in open('yourfile').read().split('\n'):
for token in line.split('\t'):
print token[:3] if token else 'N\A'
I haven't written python in a while. But I would probably write it as such.
file = open("home/python/batch1.hg19.table")
for line in file:
columns = line.split("\t")
for column in columns:
print column[:3]
file.close()
Of course if you need to validate the first three characters, you'll still need the regex.

Reading CSV files in numpy where delimiter is ","

I've got a CSV file with a format that looks like this:
"FieldName1", "FieldName2", "FieldName3", "FieldName4"
"04/13/2010 14:45:07.008", "7.59484916392", "10", "6.552373"
"04/13/2010 14:45:22.010", "6.55478493312", "9", "3.5378543"
...
Note that there are double quote characters at the start and end of each line in the CSV file, and the "," string is used to delimit fields within each line. The number of fields in the CSV file can vary from file to file.
When I try to read this into numpy via:
import numpy as np
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True)
all the data gets read in as string values, surrounded by double-quote characters. Not unreasonable, but not much use to me as I then have to go back and convert every column to its correct type
When I use delimiter='","' instead, everything works as I'd like, except for the 1st and last fields. As the start of line and end of line characters are a single double-quote character, this isn't seen as a valid delimiter for the 1st and last fields, so they get read in as e.g. "04/13/2010 14:45:07.008 and 6.552373" - note the leading and trailing double-quote characters respectively. Because of these redundant characters, numpy assumes the 1st and last fields are both String types; I don't want that to be the case
Is there a way of instructing numpy to read in files formatted in this fashion as I'd like, without having to go back and "fix" the structure of the numpy array after the initial read?
The basic problem is that NumPy doesn't understand the concept of stripping quotes (whereas the csv module does). When you say delimiter='","', you're telling NumPy that the column delimiter is literally a quoted comma, i.e. the quotes are around the comma, not the value, so the extra quotes you get on he first and last columns are expected.
Looking at the function docs, I think you'll need to set the converters parameter to strip quotes for you (the default does not):
import re
import numpy as np
fieldFilter = re.compile(r'^"?([^"]*)"?$')
def filterTheField(s):
m = fieldFilter.match(s.strip())
if m:
return float(m.group(1))
else:
return 0.0 # or whatever default
#...
# Yes, sorry, you have to know the number of columns, since the NumPy docs
# don't say you can specify a default converter for all columns.
convs = dict((col, filterTheField) for col in range(numColumns))
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True,
converters=convs)
Or abandon np.genfromtxt() and let csv.csvreader give you the file's contents a row at a time, as lists of strings, then you just iterate through the elements and build the matrix:
reader = csv.csvreader(csvfile)
result = np.array([[float(col) for col in row] for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT: Okay, so it looks like your file isn't all floats. In that case, you can set convs as needed in the genfromtxt case, or create a vector of conversion functions in the csv.csvreader case:
reader = csv.csvreader(csvfile)
converters = [datetime, float, int, float]
result = np.array([[conv(col) for col, conv in zip(row, converters)]
for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT 2: Okay, variable column count... Your data source just wants to make life difficult. Luckily, we can just use magic...
reader = csv.csvreader(csvfile)
result = np.array([[magic(col) for col in row] for row in reader])
... where magic() is just a name I got off the top of my head for a function. (Psyche!)
At worst, it could be something like:
def magic(s):
if '/' in s:
return datetime(s)
elif '.' in s:
return float(s)
else:
return int(s)
Maybe NumPy has a function that takes a string and returns a single element with the right type. numpy.fromstring() looks close, but it might interpret the space in your timestamps as a column separator.
P.S. One downside with csvreader I see is that it doesn't discard comments; real csv files don't have comments.

Categories