Python Regex split string into 5 pieces - python

I'm playing around with Python, and i have run into a problem.
I have a large data file where each string is structured like this:
"id";"userid";"userstat";"message";"2013-10-19 06:33:20 (date)"
I need to split each line into 5 pieces, semicolon being the delimiter. But at the same time within the quotations.
It's hard to explain, so i hope you understand what i mean.

That format looks a lot like ssv: semicolon-separated valued (like "csv", but semicolons instead of commas). We can use the csv module to handle this:
import csv
with open("yourfile.txt", "rb") as infile:
reader = csv.reader(infile, delimiter=";")
for row in reader:
print row
produces
['id', 'userid', 'userstat', 'message', '2013-10-19 06:33:20 (date)']
One advantage of this method is that it will correctly handle the case of semicolons within the quoted data automatically.

Use str.split, no need of regex:
>>> strs = '"id";"userid";"userstat";"message";"2013-10-19 06:33:20 (date)"'
>>> strs.split(';')
['"id"', '"userid"', '"userstat"', '"message"', '"2013-10-19 06:33:20 (date)"']
If you don't want the double quotes as well, then:
>>> [x.strip('"') for x in strs.split(';')]
['id', 'userid', 'userstat', 'message', '2013-10-19 06:33:20 (date)']

you can split by ";" in you case, also consider using of regexp, like ^("[^"]+");("[^"]+");("[^"]+");("[^"]+");("[^"]+")$

Related

Retaining Blank Fields in Python Using the Split Method

I have a tab delimited file that has entries that look like this:
strand1 strand2 genename ID
AGCTCTG AGCTGT Erg1 ENSG010101
However, some of them have blank fields, for example:
strand1 strand2 genename ID
AGCGTGT AGTTGTT ENSG12955729
When I read in the lines in python:
data = [line.strip().split() for line in filename]
The second example becomes collapsed into a list of 3 indices:
['AGCGTGT', 'AGTTGTT', 'ENSG12955729']
I would rather that the empty field be retained so that the second example becomes a list of 4 indices:
['AGCGTGT', 'AGTTGTT', '', 'ENSG12955729']
How can I do this?
You can split explicitly on tab:
>>> "foo\tbar\t\tbaz".split('\t')
['foo', 'bar', '', 'baz']
By default, split() is going to split on any amount of whitespace.
Unless you can ensure that the first and last columns won't be blank, strip() is going to cause problems. If the data is otherwise well-formatted, this solution will work.
If you know that the only tabs are field delimiters, and you still want to strip other whitespace (spaces) from around individual column values:
map(str.strip, line.split('\t'))
As others have stated you can explicitly split on tabs, but you would still need to cleanup the line endings.
Better would be to use the csv module which handles delimited files:
import csv
with open('filename.txt', newline='') as f:
reader = csv.reader(f, delimiter='\t')
headers = next(reader)
data = list(reader)
When you don't give a parameter to the str.split() method, it treats any contiguous sequence of whitespace characters as a single separator. When you do give it a parameter, .split('\t') perhaps, it treats each individual instance of that string as a separator.
Split method without any argument considers continuous stream of spaces as a single character so it splits all amount of whitespaces. You need to specify an argumnet to the method which in your case is \t.
I'm always looking for puzzles to which I can apply pyparsing, no matter how impractical the results might prove to be. If nothing else, I can always look through my old answers to see what I've tried.
Don't judge me too harshly. :)
import pyparsing as pp
item = pp.Word(pp.alphanums) | pp.Empty().setParseAction(lambda x: '')
TAB = pp.Suppress(r'\t')
process_line = pp.Group(item('item') + TAB + item('item') + TAB + item('item') + TAB + item('item'))
with open('tab_delim.txt', 'rb') as tabbed:
while True:
line = tabbed.readline()
if line:
line = line.decode().strip().replace('\t', '\\t' )+3*'\\t'
print (line.replace('\\t', ' '))
print('\t', process_line.parseWithTabs().parseString(line))
else:
break
Output:
strand1 strand2 genename ID
[['strand1', 'strand2', 'genename', 'ID']]
AGCTCTG AGCTGT Erg1 ENSG010101
[['AGCTCTG', 'AGCTGT', 'Erg1', 'ENSG010101']]
AGCGTGT AGTTGTT ENSG12955729
[['AGCGTGT', 'AGTTGTT', '', 'ENSG12955729']]
ABC DEF
[['ABC', 'DEF', '', '']]
Edit: Altered the line TAB = pp.Suppress(r'\t') to what was suggested by PaulMcG in a comment (from a construction with a double-slash prior to the 't' in a non-raw string).

Trying to copy column1 from a csv file to another empty file using python

I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)

Python file delimited with double tabs

I'm new to python and am working on a script to read from a file that is delimited by double tabs (except for the first row wich is delimited by single tabs"
I tried the following:
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row
The thing is that csv.reader won't take a two character delimeter. Is there a good way to make a double-tab delimeter work?
The output currently looks like this:
['2011-11-28 10:25:44', '', '2011-11-28 10:33:00', '', 'Showering', '']
['2011-11-28 10:34:23', '', '2011-11-28 10:43:00', '', 'Breakfast', '']
['2011-11-28 10:49:48', '', '2011-11-28 10:51:13', '', 'Grooming','']
There should only be three columns of data, however, it is picking up the extra empty fields because of the double tabs that separate the fields.
If performance is not an issue here , would you be fine with this quick and hacky solution.
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row[::2]
row[::2] does a stride on list row for indexes that are multiples of 2. For the above mentioned output, index striding by an offset (here its 2) is one way to go!
How much do you know about your data? Is it ever possible that an entry contains a double-tab? If not, I'd abandon the csv module and use the simple approach:
with open('data.csv') as data:
for line in data:
print line.strip().split('\t\t')
The csv module is nice for doing tricky things, like determining when a delimiter should split a string, and when it shouldn't because it is part of an entry. For example, say we used spaces as delimiters, and we had a row such as:
"this" "is" "a test"
We surround each entry with quotes, giving three entries. Clearly, if we use the approach of splitting on spaces, we'll get
['"this"', '"is"', '"a', 'test"']
which is not what we want. The csv module is useful here. But if we can guarantee that whenever a space shows up, it is a delimiter, then there's no need to use the power of the csv module. Just use str.split and call it a day.

Effective way to get part of string until token

I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions

How to use python csv module for splitting double pipe delimited data

I have got data which looks like:
"1234"||"abcd"||"a1s1"
I am trying to read and write using Python's csv reader and writer.
As the csv module's delimiter is limited to single char, is there any way to retrieve data cleanly? I cannot afford to remove the empty columns as it is a massively huge data set to be processed in time bound manner. Any thoughts will be helpful.
The docs and experimentation prove that only single-character delimiters are allowed.
Since cvs.reader accepts any object that supports iterator protocol, you can use generator syntax to replace ||-s with |-s, and then feed this generator to the reader:
def read_this_funky_csv(source):
# be sure to pass a source object that supports
# iteration (e.g. a file object, or a list of csv text lines)
return csv.reader((line.replace('||', '|') for line in source), delimiter='|')
This code is pretty effective since it operates on one CSV line at a time, provided your CSV source yields lines that do not exceed your available RAM :)
>>> import csv
>>> reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
>>> for row in reader:
... assert not ''.join(row[1::2])
... row = row[0::2]
... print row
...
['1234', 'abcd', 'a1s1']
>>>
Unfortunately, delimiter is represented by a character in C. This means that it is impossible to have it be anything other than a single character in Python. The good news is that it is possible to ignore the values which are null:
reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
#iterate through the reader.
for x in reader:
#you have to use a numeric range here to ensure that you eliminate the
#right things.
for i in range(len(x)):
#Odd indexes will be discarded.
if i%2 == 0: x[i] #x[i] where i%2 == 0 represents the values you want.
There are other ways to accomplish this (a function could be written, for one), but this gives you the logic which is needed.
If your data literally looks like the example (the fields never contain '||' and are always quoted), and you can tolerate the quote marks, or are willing to slice them off later, just use .split
>>> '"1234"||"abcd"||"a1s1"'.split('||')
['"1234"', '"abcd"', '"a1s1"']
>>> list(s[1:-1] for s in '"1234"||"abcd"||"a1s1"'.split('||'))
['1234', 'abcd', 'a1s1']
csv is only needed if the delimiter is found within the fields, or to delete optional quotes around fields

Categories