I'm using regular expressions to split a string using multiple delimiters. But if two of my delimiters occur next to each other in the string, it puts an empty string in the resulting list. For example:
re.split(',|;', "This,is;a,;string")
Results in
['This', 'is', 'a', '', 'string']
Is there any way to avoid getting '' in my list without adding ,; as a delimiter?
Try this:
import re
re.split(r'[,;]+', 'This,is;a,;string')
> ['This', 'is', 'a', 'string']
Related
I'm trying to read in a file but it's looking really awkward because each of the spaces between columns is different. This is what I have so far:
with open('sextractordata1488.csv') as f:
#getting rid of title, aka unusable lines:
for _ in xrange(15):
next(f)
for line in f:
cols = line.split(' ')
#9 because it's 9 spaces before the first column with real data
print cols[10]
I looked up how to do this and saw tr and sed commands that gave syntax errors when I attempted them, plus I wasn't really sure where in the code to put them (in the for loop or before it?). I want to reduce all the spaces between columns to one space so that I can consistently get the one column without issues (at the moment because it's a counter column from 1 to 101 I only get 10 through 99 and a bunch of spaces and parts from other columns in between because 1 and 101 have a different number of characters, and thus a different number of spaces from the beginning of the line).
Just use str.split() without an argument. The string is then split on arbitrary width whitespace. That means it doesn't matter how many spaces there are between non-whitespace content anymore:
>>> ' this is rather \t\t hard to parse without\thelp\n'.split()
['this', 'is', 'rather', 'hard', 'to', 'parse', 'without', 'help']
Note that leading and trailing whitespace are removed as well. Tabs, spaces, newlines, and carriage returns are all considered whitespace.
For completeness sake, the first argument can also be set to None for the same effect. This is helpful to know when you need to limit the split with the second argument:
>>> ' this is rather \t\t hard to parse without\thelp\n'.split(None)
['this', 'is', 'rather', 'hard', 'to', 'parse', 'without', 'help']
>>> ' this is rather \t\t hard to parse without\thelp\n'.split(None, 3)
['this', 'is', 'rather', 'hard to parse without\thelp\n']
cols = line.split() should be sufficient
>> "a b".split()
['a', 'b']
I have a string composed of numbers and letters: string = 'this1234is5678it', and I would like the string.split() output to give me a list like ['this', '1234', 'is', '5678', 'it'], splitting at where numbers and letter meet. Is there an easy way to do this?
You can use Regex for this.
import re
s = 'this1234is5678it'
re.split('(\d+)',s)
Running example http://ideone.com/JsSScE
Outputs ['this', '1234', 'is', '5678', 'it']
Update
Steve Rumbalski mentioned in the comment the importance of the parenthesis in the regex. He quotes from the documentation:
If capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list." Without the parenthesis the result would be ['this', 'is',
'it'].
This function takes in a string text, and returns a list which contains lists of strings, one list for each sentence in the string text.
Sentences are separated by one of the strings ".", "?", or "!". We ignore the possibility of other punctuation separating sentences. so 'Mr.X' will turn to 2 sentences, and 'don't' will be two words.
For example, the text is
Hello, Jack. How is it going? Not bad; pretty good, actually... Very very
good, in fact.
And the function returns:
['hello', 'jack'],
['how', 'is', 'it', 'going'],
['not', 'bad', 'pretty', 'good', 'actually'],
['very', 'very', 'good', 'in', 'fact']]
The most confusing part is how to make the function detect the characters , . ! ? and how to make it a list of lists contains words in each sentence.
Thank you.
This sounds very much like a homework problem to me, so I'll provide general tips instead of exact code.
a string has the split(char) function on it. You can use this to split your string based on a specific character. However, you will have to use a loop and perform the split multiple times.
You could also use a regular expression to find matches (that would be a better solution.) That would let you find all matches at once. Then you would iterate over the matches and spit them based on spaces, while stripping out punctuation.
Edit: Here's an example of a regular expression you could use to get sentence groups all at once:
\s*([^.?!]+)\s*
The \s* surrounding the parenthesis causes any extra spaces to be removed from the result, and the parenthesis are a capture group. You can use re.findall() to get a list of all captured results, and then you can loop over these items and use re.split() and some conditional logic to append all the words to a new list.
Let me know how you get along with that, and if you have any other questions please provide us the code you have so far.
you can use re.split() :
>>> s="Hello, Jack. How is it going? Not bad; pretty good, actually... Very very good, in fact."
>>> import re
>>> [re.split(r'\W',i) for i in re.split(r'\.|\?|\!',s) if len(i)]
and for remove empty indices you can do this :
>>> [[x for x in i if len(x)]for i in my_s]
[['Hello', 'Jack'], ['How', 'is', 'it', 'going'], ['Not', 'bad', 'pretty', 'good', 'actually'], ['Very', 'very', 'good', 'in', 'fact']]
using Python, I'm trying to replace the special chars in the following text:
"The# dog.is.yelling$at!Me to"
to spaces so that afterwards I would get the list:
['The','dog','is','yelling','at','me']
How do I do that in one line?
You can use regular expressions:
>>> import re
>>> re.split("[#$!.\s]+", "The# dog.is.yelling$at!Me to" )
['The', 'dog', 'is', 'yelling', 'at', 'Me', 'to']
Or to split by any non alpha-numeric sequence of chars, as pointed #thg435:
>>> re.split("\W+", "The# dog.is.yelling$at!Me to" )
>>> ".a string".split('.')
['', 'a string']
>>> "a .string".split('.')
['a ', 'string']
>>> "a string.".split('.')
['a string', '']
>>> "a ... string".split('.')
['a ', '', '', ' string']
>>> "a ..string".split('.')
['a ', '', 'string']
>>> 'this is a test'.split(' ')
['this', '', 'is', 'a', 'test']
>>> 'this is a test'.split()
['this', 'is', 'a', 'test']
Why is split() different from split(' ') when the invoked string only have spaces as whitespaces?
Why split('.') splits "..." to ['','']? split() does not consider an empty word between 2 separators...
The docs are clear about this (see #agf below), but I'd like to know why is this the chosen behaviour.
I have looked in the source code (here) and thought line 136 should be just less than: ...i < str_len...
See the str.split docs, this behavior is specifically mentioned:
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2',
'3']). Splitting an empty string with a specified separator returns
[''].
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
Python tries to do what you would expect. Most people not thinking too hard would probably expect
'1 2 3 4 '.split()
to return
['1', '2', '3', '4']
Think about splitting data where spaces have been used instead of tabs to create fixed-width columns -- if the data is different widths, there will be different number of spaces in each row.
There is often trailing whitespace at the end of a line that you can't see, and the default ignores it as well -- it gives you the answer you'd visually expect.
When it comes to the algorithm used when a delimiter is specified, think about a row in a CSV file:
1,,3
means there is data in the 1st and 3rd columns, and none in the second, so you would want
'1,,3'.split(',')
to return
['1', '', '3']
otherwise you wouldn't be able to tell what column each string came from.