I'm trying to split an inputted document at specific characters. I need to split them at [ and ] but I'm having a difficult time figuring this out.
def main():
for x in docread:
words = x.split('[]')
for word in words:
doclist.append(word)
this is the part of the code that splits them into my list. However, it is returning each line of the document.
For example, I want to convert
['I need to [go out] to lunch', 'and eat [some food].']
to
['I need to', 'go out', 'to lunch and eat', 'some food', '.']
Thanks!
You could try using re.split() instead:
>>> import re
>>> re.split(r"[\[\]]", "I need to [go out] to lunch")
['I need to ', 'go out', ' to lunch']
The odd-looking regular expression [\[\]] is a character class that means split on either [ or ]. The internal \[ and \] must be backslash-escaped because they use the same characters as the [ and ] to surround the character class.
str.split() splits at the exact string you pass to it, not at any of its characters. Passing "[]" would split at occurrences of [], but not at individual brackets. Possible solutions are
splitting twice:
words = [z for y in x.split("[") for z in y.split("]")]
using re.split().
string.split(s), the one you are using, treats the entire content of 's' as a separator. In other words, you input should've looked like "[]'I need to []go out[] to lunch', 'and eat []some food[].'[]" for it to give you the results you want.
You need to use split(s) from the re module, which will treat s as a regex
import re
def main():
for x in docread:
words = re.split('[]', x)
for word in words:
doclist.append(word)
Related
I'm doing research on sentiment analysis. In a list of data, I'd like to remove all punctuation, in orde to get to the words in their pure version. But I would like to keep emoticons, such as :) and :/.
Is there a way to say in Python that I want to remove all punctuation signs unless they appear in a combination such as ":)", ":/", "<3"?
Thanks in advance
This is my code for the stripping:
for message in messages:
message=message.lower()
message=message.replace("!","")
message=message.replace(".","")
message=message.replace(",","")
message=message.replace(";","")
message=message.replace(";","")
message=message.replace("?","")
message=message.replace("/","")
message=message.replace("#","")
You can try this regex:
(?<=\w)[^\s\w](?![^\s\w])
Usage:
import re
print(re.sub(r'(?<=\w)[^\s\w](?![^\s\w])', '', your_data))
Here is an online demo.
The idea is to match a single special character if it is preceded by a letter.
If the regex doesn't work as you expect, you can customize it a little. For example if you don't want it to match commas, you can remove them from the character class like so: (?<=\w)[^\s\w,](?![^\s\w]). Or if you want to remove the emoticon :-), you can add it to the regex like so: (?<=\w)[^\s\w](?![^\s\w])|:-\).
Going off of the work you've already done using str.replace, you could do something like this:
lines = [
"Sentence 1.",
"Sentence 2 :)",
"Sentence <3 ?"
]
emoticons = {
":)": "000smile",
"<3": "000heart"
}
emoticons_inverse = {v: k for k, v in emoticons.items()}
punctuation = ",./<>?;':\"[]\\{}|`~!##$%^&*()_+-="
lines_clean = []
for line in lines:
#Replace emoticons with non-punctuation
for emote, rpl in emoticons.items():
line = line.replace(emote, rpl)
#Remove punctuation
for char in line:
if char in punctuation:
line = line.replace(char, "")
#Revert emoticons
for emote, rpl in emoticons_inverse.items():
line = line.replace(emote, rpl)
lines_clean.append(line)
print(lines_clean)
This is not super efficient, though, so if performance becomes a bottleneck you might want to examine how you can make this faster.
Output: python3 test.py
['Sentence 1', 'Sentence 2 :)', 'Sentence <3 ']
Your best bet might be to simply declare a list of emoticons as a variable. Then compare your punctuation to the list. If it's not in the list, remove it from the string.
Edit: Instead of using a whole block of str.replace() over and over, you might try something like:
to_remove = ".,;:!()\"
for char in to_remove:
message = message.replace(char, "")
Edit 2:
The simplest way (skill-wise) might be to try this:
from string import punctuation
emoticons = [":)" ":D" ":("]
word_list = message.split(" ")
for word in word_list:
if word not in emoticons:
word = word.translate(None, punctuation)
output = " ".join(word_list)
Once again, this will only work on emoticons that are separated from other characters, i.e. "Sure :D" but not "Sorry:(".
>>>user_sentence = "hello \t how are you?"
>>>import re
>>>user_sentenceSplit = re.findall(r"([\s]|[\w']+|[.,!?;])",user_sentence)
>>>print user_sentenceSplit
I get ['hello', '\t', 'how', 'are', 'you', '?']
I don't know how to create any code that will replace the '\t' with 'tab'.
I do not believe that replacing \t in the original string will ever work, you have two issues:
Your code also outputs spaces as tokens, but you do not want to have them
The \t in between letters will become a part of a word token.
So, you need to replace [\s] with [^\S ] pattern that matches any whitespace but a regular space (add more excluded whitespace symbols if necessary into the negated character class) and you need to iterate through all the tokens and check if a token is equal to a tab, and then replace it with tab value. So, the best is to use re.finditer and push the found values into a list variable, see sample code below:
import re
user_sentence = "hello \t how are you?"
user_sentenceSplit = []
for x in re.finditer(r"[^\S ]|[\w']+|[.,!?;]",user_sentence):
if x.group() == "\t": # if it is a tab, replace the value
user_sentenceSplit.append("tab")
else: # else, push the match value
user_sentenceSplit.append(x.group())
print(user_sentenceSplit)
See the Python demo
I think str.replace would do the job.
user_sentence.replace('\t', 'tab')
Do this before splitting the string.
It is behavior of Python's compiler. You should not be worrying about it. Pyhton's Compiler store tab as \t. You need not to do anything on it as it will treat it as tab while performing any action over it. For example:
>>> my_string = 'Yes Hello So?' # <- String with tab
>>> my_string
'Yes\tHello\tSo?' # <- Stored tab as '\t'
>>> print my_string
Yes Hello So? # While printing, again tab
However you exact requirement is not clear to me. In case you want to replace the value of \t with tab string, you may do:
>>> my_string = my_string.replace('\t', 'tab')
>>> my_string
'YestabHellotabSo?'
where my_string is holding the value I mentioned in previous example.
Anyone know how I can find the character in the center that is surrounded by spaces?
1 + 1
I'd like to be able to separate the + in the middle to use in a if/else statement.
Sorry if I'm not too clear, I'm a Python beginner.
I think you are looking for something like the split() method which will split on white space by default.
Suppose we have a string s
s = "1 + 1"
chunks = s.split()
print(chunks[1]) # Will print '+'
This regular expression will detect a single character surrounded by spaces, if the character is a plus or minus or mult or div sign: r' ([+-*/]) '. Note the spaces inside the apostrophes. The parentheses "capture" the character in the middle. If you need to recognize a different set of characters, change the set inside the brackets.
If you haven't dealt with regular expressions before, read up on the re module. They are very useful for simple text processing. The two relevant features here are "character classes" (the square brackets in my example) and "capturing parentheses" (the round parens).
You can use regex:
s="1 + 1"
a=re.compile(r' (?P<sym>.) ')
a.search(s).group('sym')
import re
def find_between(string, start_=' ', end_=' '):
re_str = r'{}([-+*/%^]){}'.format(start_, end_)
try:
return re.search(re_str, string).group(1)
except AttributeError:
return None
print(find_between('9 * 5', ' ', ' '))
Not knowing how many spaces separate your central character, then I'd use the following:
s = '1 + 1'
middle = filter(None, s.split())[1]
print middle # +
The split works as in the solution provided by Zac, but if there are more than a single space, then the returned list will have a bunch of '' elements, which we can get rid of with the filter(None, ) function.
Then it's just a matter of extracting your second element.
Check it in action at https://eval.in/636622
If we look at it step-by-step, then here is how it all works using a python console:
>>> s = '1 + 1'
>>> s.split()
['1', '+', '', '', '1']
>>> filter(None, s.split())
['1', '+', '1']
>>> filter(None, s.split())[1]
'+'
I have a text file something like -
$ abc
defghjik
am here
not now
$ you
are not
here but go there
$ ....
I want to extract text between two $ signs and put that text into a list or a dict. How can I do this in python by reading the file?
I tried regex but it gives me alternate values of the text file:
f1 = open('some.txt','r')
lines = f1.read()
x = re.findall(r'$(.*?)$', lines, re.DOTALL)
I want the output as something like below -
['abc', 'defghjik', 'am here', 'not now']
['you', 'are not', 'here but go there']
Sorry but am new to python and trying to learn, any help appreciated! Thanks!
In regular expressions $ is a character of special meaning and needs to be escaped to match a literal character. Also to match multiple parts I would use a lookahead (?=...) assertion to assert matching a literal $ character.
>>> x = re.findall(r'(?s)\$\s*(.*?)(?=\$)', lines)
>>> [i.splitlines() for i in x]
[['abc', 'defghjik', 'am here', 'not now'], ['you', 'are not', 'here but go there']]
Working Demo
$ has a special meaning in regex, so to match it you need to escape it first. Note that inside a character class([]), $ and other metcharatcers lose their special meaning, so no escaping required there. Following regex should do it:
\$\s*([^$]+)(?=\$)
Debuggex Demo
Demo:
>>> lines = '''$ abc
defghjik
am here
not now
$ you
are not
here but go there
$'''
>>> it = re.finditer(r'\$\s*([^$]+)(?=\$)', lines, re.DOTALL)
>>> [x.group(1).splitlines() for x in it]
[['abc', 'defghjik', 'am here', 'not now'], ['you', 'are not', 'here but go there']]
Regex may not actually be what you want: your desired output has every line as an individual entry in a list. I'd suggest just using lines.split(), and then iterating over the resulting array.
I'll write this as if you just need to print the text you want as output. Adapt as necessary.
f1 = open('some.txt','r')
lines = f1.read()
lists = []
for s in lines.split('\n'):
if s == '$':
if lists:
print lists
lists = []
else: lists.append(s)
if lists: print lists
Happy Python-ing! Welcome to the club. :)
$ holds a special meaning in a regex. It is an anchor. It matches the end of the string or just before the newline at the end of the string. See here :
Regular Expression Operations
You can escape the $ sign by prefixing it with a '\' character, so it won't be treated as an anchor.
Better yet, you don't need to use regex at all here. You can use the split method of strings in python.
>>> string = '''$ abc
defghjik
am here
not now
$ you
are not
here but go there
$ '''
>>> string.split('$')
['', ' abc\ndefghjik\nam here\nnot now\n', ' you\nare not\nhere but go there\n', ' ']
And you get a list. To remove the empty string entries if you want, you can do this:
a=string.split('$')
while a.count('') > 0:
a.remove('')
Reading parts of files often boils down to an "iteration pattern." There are a number of generators in the itertools package that can help. Or you can craft your own generator. For example:
def take_sections(predicate, iterable, firstpost=lambda x:x):
i = iter(iterable)
try:
nextone = i.next()
while True:
batch = [ firstpost(nextone) ]
nextone = i.next()
while not predicate(nextone):
batch.append(nextone)
nextone = i.next()
yield batch
except StopIteration:
yield batch
return
this is similar to itertools.takewhile except it's more of a take until loop (i.e. test at the bottom, not the top). It also has a built in clean-up/post-process function for the first line in a section (the "section marker"). Once you've abstracted this iteration pattern, you need to read the lines in the file, define how the section markers are identified and cleaned up, and run the generator:
with open('some.txt','r') as f1:
lines = [ l.strip() for l in f1.readlines() ]
dollar_line = lambda x: x.startswith('$')
clean_dollar_line = lambda x: x[1:].lstrip()
print list(take_sections(dollar_line, lines, clean_dollar_line))
Yielding:
[['abc', 'defghjik', 'am here', 'not now'],
['you', 'are not', 'here but go there'],
['....']]
I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.