I have a tab delimited file that has entries that look like this:
strand1 strand2 genename ID
AGCTCTG AGCTGT Erg1 ENSG010101
However, some of them have blank fields, for example:
strand1 strand2 genename ID
AGCGTGT AGTTGTT ENSG12955729
When I read in the lines in python:
data = [line.strip().split() for line in filename]
The second example becomes collapsed into a list of 3 indices:
['AGCGTGT', 'AGTTGTT', 'ENSG12955729']
I would rather that the empty field be retained so that the second example becomes a list of 4 indices:
['AGCGTGT', 'AGTTGTT', '', 'ENSG12955729']
How can I do this?
You can split explicitly on tab:
>>> "foo\tbar\t\tbaz".split('\t')
['foo', 'bar', '', 'baz']
By default, split() is going to split on any amount of whitespace.
Unless you can ensure that the first and last columns won't be blank, strip() is going to cause problems. If the data is otherwise well-formatted, this solution will work.
If you know that the only tabs are field delimiters, and you still want to strip other whitespace (spaces) from around individual column values:
map(str.strip, line.split('\t'))
As others have stated you can explicitly split on tabs, but you would still need to cleanup the line endings.
Better would be to use the csv module which handles delimited files:
import csv
with open('filename.txt', newline='') as f:
reader = csv.reader(f, delimiter='\t')
headers = next(reader)
data = list(reader)
When you don't give a parameter to the str.split() method, it treats any contiguous sequence of whitespace characters as a single separator. When you do give it a parameter, .split('\t') perhaps, it treats each individual instance of that string as a separator.
Split method without any argument considers continuous stream of spaces as a single character so it splits all amount of whitespaces. You need to specify an argumnet to the method which in your case is \t.
I'm always looking for puzzles to which I can apply pyparsing, no matter how impractical the results might prove to be. If nothing else, I can always look through my old answers to see what I've tried.
Don't judge me too harshly. :)
import pyparsing as pp
item = pp.Word(pp.alphanums) | pp.Empty().setParseAction(lambda x: '')
TAB = pp.Suppress(r'\t')
process_line = pp.Group(item('item') + TAB + item('item') + TAB + item('item') + TAB + item('item'))
with open('tab_delim.txt', 'rb') as tabbed:
while True:
line = tabbed.readline()
if line:
line = line.decode().strip().replace('\t', '\\t' )+3*'\\t'
print (line.replace('\\t', ' '))
print('\t', process_line.parseWithTabs().parseString(line))
else:
break
Output:
strand1 strand2 genename ID
[['strand1', 'strand2', 'genename', 'ID']]
AGCTCTG AGCTGT Erg1 ENSG010101
[['AGCTCTG', 'AGCTGT', 'Erg1', 'ENSG010101']]
AGCGTGT AGTTGTT ENSG12955729
[['AGCGTGT', 'AGTTGTT', '', 'ENSG12955729']]
ABC DEF
[['ABC', 'DEF', '', '']]
Edit: Altered the line TAB = pp.Suppress(r'\t') to what was suggested by PaulMcG in a comment (from a construction with a double-slash prior to the 't' in a non-raw string).
Related
I want to create a program in Python which reads the text from a text file and returns a string where everything except punctuation (period, comma, colon, semicolon, exclamation point, question mark) has been removed. This is my code:
def punctuation(filename):
with open(filename, mode='r') as f:
s = ''
punctations = '''.,;:!?'''
for line in f:
for c in line:
if c == punctations:
c.append(s)
return s
But it only returns '', I have also tried with s = + c instead of s.append(c) since append might not work on strings but the problem still remains. Does anyone want to help me find out why?
How it should work:
If we have a text file named hello.txt with the text "Hello, how are you today?" then punctation('hello.txt') should give us the output string ',?'
You were comparing each character to the whole string when you should have been checking if it belonged in punctuations. Also, append was not the appropriate method here, because you were not returning a list instead you could concatenate the characters into s.
def punctuation(filename):
with open(filename, mode='r') as f:
s = ''
punctations = '''.,;:!?'''
text = f.read()
words = text.split()
for line in text:
if line in set(punctations):
s+=line
return s
Another approach you could take to check if it's a symbol is the isalnum() method since it will consider all values that aren't characters or numbers incase you miss any symbols out.
if line!= " " and line!= "\n" and not line.isalnum():
The problem is that c == punctutations will never be True since c is a character and punctutations is a longer string. Another problem is that append doesn't work on strings, you should use + to concat strings instead.
def punctuation(filename):
with open(filename, mode='r') as f:
s = ''
punctations = '''.,;:!?'''
for line in f:
for c in line:
if c in punctations:
s += c
return s
Issues
Some statements seem to have issues:
if c == punctations: # 1
c.append(s) # 2
A single character is never equal to a string of many characters like your punctations (e.g. '.' == '.?' is never true). So we have to use a different boolean comparison-operator: in, because a character can be an element in a collection of characters, a string, list or set.
You spotted already: since c is a character and s a str , not lists we can not use method append. So we have to use s = s + c or shortcut s += c (your solution was almost right)
Extract a testable & reusable function
Why not extract and test the part that fails:
def extract_punctuation(line):
punctuation_chars = set('.,;:!?') # typo in name, unique thus set
symbols = []
for char in line:
if char in punctuation_chars:
symbols.append(char)
return symbols
# test
symbol_list = extract_punctuation('Hello, how are you today?')
print(symbol_list) # [',', '?']
print(''.join(symbol_list)) # ',?'
Solution: use a function on file-read
Then you could reuse that function on any text, or a file like:
def punctuation(filename):
symbols = []
with open(filename, mode='r') as f:
symbols + extract_punctuation(f.read())
return symbols.join()
Explained:
The default result is defined first as empty list [] (returned if file is empty).
The list of extracted is added to symbols using + for each file-read inside with block (here the whole file is read at once).
Returns either empty [].join() giving '' or not, e.g. ,?.
See:
How do I concatenate two lists in Python?
Extend: return a list to play with
For a file with multiple sentences like dialogue.txt:
Hi, how are you?
Well, I am fine!
What about you .. ready to start, huh?
You could get a list (ordered by appearance) like:
[',', '?', ',', '!', '.', '.', ',', '?']
which will result in a string with ordered duplicates:
,?,!..,?
To extend, a list might be a better return type:
Filter unique as set: set( list_punctuation(filename) )
Count frequency using pandas: pd.Series(list_punctuation(filename)).value_counts()
def list_punctuation(filename):
with open(filename, mode='r') as f:
return extract_punctuation(f.read())
lp = list_punctuation('dialogue.txt')
print(lp)
print(''.join(lp))
unique = set(lp)
print(unique)
# pass the list to pandas to easily do statistics
import pandas as pd
frequency = pd.Series(lp).value_counts()
print(frequency)
Prints above list, string. plus following set
{',', '?', '!', '.'}
as well as the ranked frequency for each punctuation symbol:
, 3
? 2
. 2
! 1
Today I learned - by playing with
punctuation & Python's data structures
I have a txt file, with these contents:
a,b
c,d
e,f
g,h
i,j
k,l
And i am putting them into a list, using these lines:
keywords=[]
solutions=[]
for i in file:
keywords.append((i.split(","))[0])
solutions.append(((i.split(","))[1]))
but when I print() the solutions, here is what it displays:
['b\n', 'd\n', 'f\n', 'h\n', 'j\n', 'l']
How do I make it, so that the \n-s are removed from the ends of the first 5 elements, bu the last element is left unaltered, using as few lines as possible.
You can use str.strip() in order to trimming the last whitespace. But as a more pythonic approach you better to use csv module for loading your file content which will accept a delimiter and return an iterable contain tuples of separated items (here, the characters). The use zip() function to get the columns.
import csv
with open(file_name) as f:
reader_obj = csv.reader(f, delimiter=',') # here passing the delimiter is optional because by default it will consider comma as delimiter.
first_column, second_column = zip(*reader_obj)
You need to string.strip() the whitespace/new-line characters from the string after reading it to remove the \n:
keywords=[]
solutions=[]
for i_raw in file:
i = i_raw.strip() # <-- removes extraneous spaces from start/end of string
keywords.append((i.split(","))[0])
solutions.append(((i.split(","))[1]))
[[u'the', u'terse', u'announcement', u'state-run', u'news', u'agency', u'didnt', u'identify', u'the', u'aggressor', u'but', u'mister', u'has', u'accused', u'neighboring', u'country', u'of', u'threatening', u'to', u'attack', u'its', u'nuclear', u'installations'], [], [u'government', u'officials']]
I want to remove the empty lines, represented by empty brackets above[]. I am currently using:
with codecs.open("textfile.txt", "r", "utf-8") as f:
for line in f:
dataset.append(line.lower().strip().split()) #dataset contains the data above in the format shown
lines=[sum((line for line in dataset if line), [])]
This statement takes pretty long to remove the empty lines. Is there a better way to remove empty lines from a list of lists and still maintain the format shown?
You can skip the whitespace-only lines when reading the file:
with codecs.open("textfile.txt", "r", "utf-8") as f:
dataset = [line.lower().split() for line in f if not line.isspace()]
Note that split() ignores leading/trailing whitespace, so strip() is redundant.
EDIT:
Your question is very unclear, but it seems from the comments that all you want to do is read a file and remove all the empty lines. If that is correct then you simply need to do:
with codecs.open("textfile.txt", "r", "utf-8") as f:
dataset = [line.lower() for line in f if not line.isspace()]
Now dataset is a list of lower-cased lines (i.e. strings). If you want to combine them into one string, you can do:
text = ''.join(dataset)
I am a little confused why you are doing:
lines = [sum((line for line in dataset if line), [])]
First off by adding square brackets around the call to sum you end up with a list with one element: the result of sum, not sure if that was intended...
Regardless, the result of sum() will be a list of all the words in the file that were separated by whitespace, if this is the desired end result then you can simply use re.split:
with open(...) as f:
links = [re.split("\W+",f.read())]
#is it possible you instead wanted:
#links = re.split("\W+",f.read())
the "\W" simply means any whitespace ("\n"," ","\t" etc.) and the + means (1 or more multiples) so it will handle multiple newlines or multiple spaces.
I'm new to python and am working on a script to read from a file that is delimited by double tabs (except for the first row wich is delimited by single tabs"
I tried the following:
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row
The thing is that csv.reader won't take a two character delimeter. Is there a good way to make a double-tab delimeter work?
The output currently looks like this:
['2011-11-28 10:25:44', '', '2011-11-28 10:33:00', '', 'Showering', '']
['2011-11-28 10:34:23', '', '2011-11-28 10:43:00', '', 'Breakfast', '']
['2011-11-28 10:49:48', '', '2011-11-28 10:51:13', '', 'Grooming','']
There should only be three columns of data, however, it is picking up the extra empty fields because of the double tabs that separate the fields.
If performance is not an issue here , would you be fine with this quick and hacky solution.
f = open('data.csv', 'rU')
source = list(csv.reader(f,skipinitialspace=True, delimiter='\t'))
for row in source:
print row[::2]
row[::2] does a stride on list row for indexes that are multiples of 2. For the above mentioned output, index striding by an offset (here its 2) is one way to go!
How much do you know about your data? Is it ever possible that an entry contains a double-tab? If not, I'd abandon the csv module and use the simple approach:
with open('data.csv') as data:
for line in data:
print line.strip().split('\t\t')
The csv module is nice for doing tricky things, like determining when a delimiter should split a string, and when it shouldn't because it is part of an entry. For example, say we used spaces as delimiters, and we had a row such as:
"this" "is" "a test"
We surround each entry with quotes, giving three entries. Clearly, if we use the approach of splitting on spaces, we'll get
['"this"', '"is"', '"a', 'test"']
which is not what we want. The csv module is useful here. But if we can guarantee that whenever a space shows up, it is a delimiter, then there's no need to use the power of the csv module. Just use str.split and call it a day.
I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions