Reading in file with different number of spaces as delimiter? - python

I'm trying to read in a file but it's looking really awkward because each of the spaces between columns is different. This is what I have so far:
with open('sextractordata1488.csv') as f:
#getting rid of title, aka unusable lines:
for _ in xrange(15):
next(f)
for line in f:
cols = line.split(' ')
#9 because it's 9 spaces before the first column with real data
print cols[10]
I looked up how to do this and saw tr and sed commands that gave syntax errors when I attempted them, plus I wasn't really sure where in the code to put them (in the for loop or before it?). I want to reduce all the spaces between columns to one space so that I can consistently get the one column without issues (at the moment because it's a counter column from 1 to 101 I only get 10 through 99 and a bunch of spaces and parts from other columns in between because 1 and 101 have a different number of characters, and thus a different number of spaces from the beginning of the line).

Just use str.split() without an argument. The string is then split on arbitrary width whitespace. That means it doesn't matter how many spaces there are between non-whitespace content anymore:
>>> ' this is rather \t\t hard to parse without\thelp\n'.split()
['this', 'is', 'rather', 'hard', 'to', 'parse', 'without', 'help']
Note that leading and trailing whitespace are removed as well. Tabs, spaces, newlines, and carriage returns are all considered whitespace.
For completeness sake, the first argument can also be set to None for the same effect. This is helpful to know when you need to limit the split with the second argument:
>>> ' this is rather \t\t hard to parse without\thelp\n'.split(None)
['this', 'is', 'rather', 'hard', 'to', 'parse', 'without', 'help']
>>> ' this is rather \t\t hard to parse without\thelp\n'.split(None, 3)
['this', 'is', 'rather', 'hard to parse without\thelp\n']

cols = line.split() should be sufficient
>> "a b".split()
['a', 'b']

Related

Replace spaces with non-breaking spaces according to a specific criterion

I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.
For example:
If in a sentence, I have:
"You need to walk 5 km."
I need to replace the space between 5 and km with a non-breaking space.
So far, I have managed to do this:
import os
unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
# iterate and read all files in the directory
for file in os.listdir():
# check if the file is a file
if os.path.isfile(file):
# open the file
with open(file, 'r', encoding='utf-8') as f:
# read the file
content = f.read()
# search for exemple in the file
for i in unites:
if i in content:
# find the next character after the unit
next_char = content[content.find(i) + len(i)]
# check if the next character is a space
if next_char == ' ':
# replace the space with a non-breaking space
content = content.replace(i + ' ', i + '\u00A0')
But this replace all the spaces in the document and not the ones that I want.
Can you help me?
EDIT
after UlfR's answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.
Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :
I want to search for the phrase "Can the search be hypothetical?"
I would like the space between hypothetical and ? to be replaced by a non-breaking space.
Otherwise also "In the search it is necessary to refer to the "{figure 1.12}"
I would like the space between {, figure and } to be a non-breaking space but also the space between figure and 1.12 to be a non-breaking space (so all spaces in this case).
I've tried to do this :
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']
nbsp = '\u00A0'
rgx = re.sub(r'(\b\d+)(%s) (%s)\b'%(units, units_before_after),r'\1%s\2'%nbsp,text))
print(rgx)
But I'am having some trouble, do you have any ideas to share ?
You should use re to do the replacement. Like so:
import re
text = "You need to walk 5 km or 500000 cm."
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
nbsp = '\u00A0'
print(re.sub(r'(\b\d+) (%s)\b'%'|'.join(units),r'\1%s\2'%nbsp,text))
Both the search and replace patterns are dynamically built, but basically you have a pattern that matches:
At the beginning of something \b
1 or more digits \d+
One space
One of the units km|m|cm|...
At the end of something \b
Then we replaces the all that with the two groups with the nbsp-string between them.
See re for more info on how to us regular expressions in python. Its well worth the invested time to learn the basics since its a very powerful and useful tool!
Have fun :)

Why does split() return more elements than split(" ") on same string?

I am using split() and split(" ") on the same string. But why is split(" ") returning less number of elements than split()? I want to know in what specific input case this would happen.
str.split with the None argument (or, no argument) splits on all whitespace characters, and this isn't limited to just the space you type in using your spacebar.
In [457]: text = 'this\nshould\rhelp\tyou\funderstand'
In [458]: text.split()
Out[458]: ['this', 'should', 'help', 'you', 'understand']
In [459]: text.split(' ')
Out[459]: ['this\nshould\rhelp\tyou\x0cunderstand']
List of all whitespace characters that split(None) splits on can be found at All the Whitespace Characters? Is it language independent?
If you run the help command on the split() function you'll see this:
split(...) S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the delimiter
string. If maxsplit is given, at most maxsplit splits are done. If sep
is not specified or is None, any whitespace string is a separator and
empty strings are removed from the result.
Therefore the difference between the to is that split() without specifing the delimiter will delete the empty strings while the one with the delimiter won't.
The method str.split called without arguments has a somewhat different behaviour.
First it splits by any whitespace character.
'foo bar\nbaz\tmeh'.split() # ['foo', 'bar', 'baz', 'meh']
But it also remove the empty strings from the output list.
' foo bar '.split(' ') # ['', 'foo', 'bar', '']
' foo bar '.split() # ['foo', 'bar']
In Python, the split function splits on a specific string if specified, otherwise on spaces (and then you can access the result list by index as usual):
s = "Hello world! How are you?"
s.split()
Out[9]:['Hello', 'world!', 'How', 'are', 'you?']
s.split("!")
Out[10]: ['Hello world', ' How are you?']
s.split("!")[0]
Out[11]: 'Hello world'
From my own experience, the most confusion had come from split()'s different treatments on whitespace.
Having a separator like ' ' vs None, triggers different behavior of split(). According to the Python documentation.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Below is an example, in which the sample string has a trailing space ' ', which is the same whitespace as the one passed in the second split(). Hence, this method behaves differently, not because of some whitespace character mismatch, but it's more of how this method was designed to work, maybe for convenience in common scenarios, but it can also be confusing for people who expect the split() to just split.
sample = "a b "
sample.split()
>>> ['a', 'b']
sample.split(' ')
>>> ['a', 'b', '']

How can I travel through the words of a file in PYTHON?

I have a file .txt and I want to travel through the words of it. I have a problem, I need to remove the punctuation marks before travelling through the words. I have tried this, but it isn't removing the punctuation marks.
file=open(file_name,"r")
for word in file.read().strip(",;.:- '").split():
print word
file.close()
The problem with your current method is that .strip() doesn't really do what you want. It removes leading and trailing characters (and you want to remove ones within the text), and if you want to specify characters in addition to whitespace, they need to be in a list.
Another problem is that there are many more potential punctuation characters (question marks, exclamations, unicode ellipses, em dashes) that wouldn't get filtered out by your list. Instead, you can use string.punctuation to get a wide range of characters (note that string.punctuation doesn't include some non-English characters, so its viability may depend on the source of your input):
import string
punctuation = set(string.punctuation)
text = ''.join(char for char in text if char not in punctuation)
An even faster method (shown in other answers on SO) uses string.translate() to replace the characters:
import string
text = text.translate(string.maketrans('', ''), string.punctuation)
strip()only removes characters found at the beginning or end of a string.
So split() first to cut into words, then strip() to remove punctuation.
import string
with open(file_name, "rt") as finput:
for line in finput:
for word in line.split():
print word.strip(string.punctuation)
Or use a natural language aware library like nltk: http://www.nltk.org/
You can try using the re module:
import re
with open(file_name) as f:
for word in re.split('\W+', f.read()):
print word
See the re documentation for more details.
Edit: In case of non ASCII characters, the previous code ignore them. In that case the following code can help:
import re
with open(file_name) as f:
for word in re.compile('\W+', re.unicode).split(f.read().decode('utf8')):
print word
The following code preserves apostrophes and blanks, and could easily be modified to preserve double quotations marks, if desired. It works by using a translation table based on a subclass of the string object. I think the code is fairly easy to understand. It might be made more efficient if necessary.
class SpecialTable(str):
def __getitem__(self, chr):
if chr==32 or chr==39 or 48<=chr<=57 \
or 65<=chr<=90 or 97<=chr<=122:
return chr
else:
return None
specialTable = SpecialTable()
with open('temp2.txt') as inputText:
for line in inputText:
print (line)
convertedLine=line.translate(specialTable)
print (convertedLine)
print (convertedLine.split(' '))
Here's typical output.
This! is _a_ single (i.e. 1) English sentence that won't cause any trouble, right?
This is a single ie 1 English sentence that won't cause any trouble right
['This', 'is', 'a', 'single', 'ie', '1', 'English', 'sentence', 'that', "won't", 'cause', 'any', 'trouble', 'right']
'nother one.
'nother one
["'nother", 'one']
I would remove the punctuation marks with the replace function after storing the words in a list like so:
with open(file_name,"r") as f_r:
words = []
for row in f_r:
words.append(row.split())
punctuation = [',', ';', '.', ':', '-']
words = [x.replace(y, '') for y in punctuation for x in words]

How to split up a string on multiple delimiters but only capture some?

I want to split a string on any combination of delimiters I provide. For example, if the string is:
s = 'This, I think,., کباب MAKES , some sense '
And the delimiters are \., ,, and \s. However I want to capture all delimiters except whitespace \s. The output should be:
['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']
My solution so far is is using the re module:
pattern = '([\.,\s]+)'
re.split(pattern, s)
However, this captures whitespace as well. I have tried using other patterns like [(\.)(,)\s]+ but they don't work.
Edit: #PadraicCunningham made an astute observation. For delimiters like Some text ,. , some more text, I'd only want to remove leading and trailing whitespace from ,. , and not whitespace within.
The following approach would be the most simple one, I suppose ...
s = 'This, I think,., کباب MAKES , some sense '
pattern = '([\.,\s]+)'
splitted = [i.strip() for i in re.split(pattern, s) if i.strip()]
The output:
['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']
NOTE: According to the new edit on the question, I've improved my old regex. The new one is quite long but trust me, it's work!
I suggest a pattern below as a delimiter of the function re.split():
(?<![,\.\ ])(?=[,\.]+)|(?<=[,\.])(?![,\.\ ])|(?<=[,\.])\ +(?![,\.\ ])|(?<![,\.\ ])\ +(?=[,\.][,\.\ ]+)|(?<![,\.\ ])\ +(?![,\.\ ])
My workaround here doesn't require any pre/post space modification. The thing that make regex work is about how you order the regex expressions with or. My cursory strategy is any patterns that dealing with a space-leading will be evaluated last.
See DEMO
Additional
According to #revo's comment he provided an another shorten version of mine which is
\s+(?=[^.,\s])|\b(?:\s+|(?=[,.]))|(?<=[,.])\b
See DEMO
I believe this is the most efficient option regarding memory, and really efficient regarding computation time:
import re
from itertools import chain
from operator import methodcaller
input_str = 'This, I think,., ???? MAKES , some sense '
iterator = filter(None, # Filter out all 'None's
chain.from_iterable( # Flatten the tuples into one long iterable
map(methodcaller("groups"), # Take the groups from each match.
re.finditer("(.*?)(?:([\.,]+)|\s+|$)", input_str))))
# If you want a list:
list(iterator)
Update based on OP's last edit
Python 3.*:
list(filter(None, re.split('([.,]+(?:\s+[.,]+)*)|\s', s)))
Output:
['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']

Stripping punctuation from unique strings in an input file

This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:
f = open('#file name ...', 'a+')
for x in set(f.read().split()):
print x
But the problem is that if the input file has, for instance, this line:
This is not is, clearly is: weird
It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?
Thanks for any help. (I am really new to Python.)
import re
for x in set(re.findall(r'\b\w+\b', f.read())):
should be more able to distinguish words correctly.
This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).
If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].
>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']
You can use translation tables if you don't care about replacing your punctuation characters with white space, for eg.
>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = " "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is clearly is weird'
# And for your case of creating a set of unique words.
>>> set('This is not is clearly is weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

Categories