Delimiters in splitting string in python - python

I know this question has been asked a few times, but what I'm asking is not how to do it, but which delimiter should be used.
So I have a very long string and I want to split it into words. The result is not what I wanted, so I thought to add another delimiter.
The problem is there are words like vs. and U.S. in the string. If I use . as a delimiter, I will get vs but U.S. becomes U and S. This is not what I wanted.
Another example, there are words brainf*ck *7 F***ing x*x+y*y works* f*k in the string. If I use * as a delimiter, the result will be very messy (brainf*ck becomes brainf and ck, F***ing becomes F and ing, and so on)
' delimiter have the same problem; (don't 'starting out' what's do's dont's)
- = + ( ) also have some minor problem but I can handle those delimiters. The problem is with . * '.
Does anyone have any idea how to tackle this problem?

What about using re:
import re
text = 'U.S. vs. brainf*ck *7 F***ing x*x+y*y works* f*k'
get = re.split('\s', text)
# ['U.S.', 'vs.', 'brainf*ck', '*7', 'F***ing', 'x*x+y*y', 'works*', 'f*k']
#Example
print(get[0]) # U.S.
print(get[1]) # vs.

Related

splitting first and last name regex [duplicate]

This question already has answers here:
Regular expression for first and last name
(28 answers)
Closed 3 years ago.
Hello I have a string of full names.
string='Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
I would like to split it by first and last name to have an output like this
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
I tried using this code:
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', string))
that returns this result
['Christof', 'Koch', 'Jonathan', 'Harel', 'Moran', 'Cerf', 'Wolfgang', 'Einhaeuser']
I would like to have each full name as an item.
Any suggestions? Thanks
You can use a lookahead after any lowercase to see if it's followed by an immediate uppercase or end-of-line such as [a-zA-Z\s]+?[a-z](?=[A-Z]|$) (more specific) or even .+?[a-z](?=[A-Z]|$) (more broad).
import re
string = 'Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
print(re.findall(r".+?[a-z](?=[A-Z]|$)", string))
# -> ['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
Having provided this answer, definitely check out Falsehoods Programmers Believe About Names; depending on your data, it might be erroneous to assume that your format will be parseable using the lower->upper assumption.
For your list of strings in this format from the comments, just add a list comprehension. The regex I provided above happens to be robust to the middle initials without modification (but I have to emphasize that if your dataset is enormous, that might not hold).
import re
names = ['Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser', 'Za?d HarchaouiC?line Levy-leduc', 'David A. ForsythDuan Tran', 'Arnold SmeuldersSennay GhebreabPieter Adriaans', 'Peter L. BartlettAmbuj Tewari', 'Javier R. MovellanPaul L. RuvoloIan Fasel', 'Deli ZhaoXiaoou Tang']
result = [re.findall(r".+?[a-z](?=[A-Z]|$)", x) for x in names]
for name in result:
print(name)
Output:
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
['Za?d Harchaoui', 'C?line Levy-leduc']
['David A. Forsyth', 'Duan Tran']
['Arnold Smeulders', 'Sennay Ghebreab', 'Pieter Adriaans']
['Peter L. Bartlett', 'Ambuj Tewari']
['Javier R. Movellan', 'Paul L. Ruvolo', 'Ian Fasel']
['Deli Zhao', 'Xiaoou Tang']
And if you want all of these names in one list, add
flattened = [x for y in result for x in y]
It'll most likely have FP and TN, yet maybe OK to start with:
[A-Z][^A-Z\r\n]*\s+[A-Z][^A-Z\r\n]*
Test
import re
expression = r"[A-Z][^A-Z]*\s+[A-Z][^A-Z]*"
string = """
Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser
"""
print(re.findall(expression, string))
Output
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Find a string within a string using pattern matching in python

I'd like to use part of a string ('project') that is returned from an API. The string looks like this:
{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
I'd like to store the 'LS003942_EP... ' part in a new variable called foldername. I'm thought a good way would be to use a regex to find the text after Title. Here's my code:
orders = api.get_all(view='Folder', fields='Project Title', maxRecords=1)
for new in orders:
print ("Found 1 new project")
print (new['fields'])
project = (new['fields'])
s = re.search('Title(.+?)', result)
if s:
foldername = s.group(1)
print(foldername)
This gives me an error -
TypeError: expected string or bytes-like object.
I'm hoping for foldername = 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'
You can use ast.literal_eval to safely evaluate a string containing a Python literal:
import ast
s = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
print(ast.literal_eval(s)['Project Title'])
# LS003942_EP - 5 Random Road, Sunny Place, SA 5000
It seems (to me) that you have a dictionary and not string. Considering this case, you may try:
s = {'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
print(s['Project Title'])
If you have time, take a look at dictionaries.
I don't think you need a regex here:
string = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
foldername = string[string.index(":") + 2: len(string)-1]
Essentially, I'm finding the position of the first colon, then adding 2 to get the starting index of your foldername (which would be the apostrophe), and then I use index slicing and slice everything from the index to the second-last character (the last apostrophe).
However, if your string is always going to be in the form of a valid python dict, you could simply do foldername = (eval(string).values)[0]. Here, I'm treating your string as a dict and am getting the first value from it, which is your desired foldername. But, as #AKX notes in the comments, eval() isn't safe as somebody could pass malicious code as a string. Unless you're sure that your input strings won't contain code (which is unlikely), it's best to use ast.literal_eval() as it only evaluates literals.
But, as #MaximilianPeters notes in the comments, your response looks like a valid JSON, so you could easily parse it using json.parse().
You could try this pattern: (?<='Project Title': )[^}]+.
Explanation: it uses positive lookbehind to assure, that match will occure after 'Project Title':. Then it matches until } is encountered: [^}]+.
Demo

replacing certain expressions in file but only one time

I extracted some expressions from a file and I want to insert these expressions in the same file but under different format, like between brackets. My problem is that I want for every expression only one replacing.
the file looks like this
file = """he is a good man
she is a beautiful woman
this is a clever student
he is a bad neighbour
they are bad men
She is very beautiful"""
and the expressions are like this
ex = """ good, clever, beautiful, bad,"""
the code used is
adj = ex.split(",")
for a in adj:
if a in file:
file = file.replace(a, ' ' +'[[' + a + ']]')
print file
this gives the following output:
he is a [[good]] man [[
]]she is a [[ beautiful]] woman [[
]]this is a [[ clever]] student [[
]]he is a [[ bad]] neighbour [[
]]they are [[ bad]] men [[
]]She is very [[ beautiful]] [[
]] [[
]]
while the expected output is
he is a [[good]] man
she is a [[ beautiful]] woman
this is a [[ clever]] student
he is a [[ bad]] neighbour
they are bad men # so here "bad" will not be replaced because there is another 'bad' replaced
She is very beautiful # and here 'beautiful' will not be replaced like 'bad'
If file content is stored as string
the replace method of a string also takes in a third optional argument called max.
http://www.tutorialspoint.com/python/string_replace.htm
This will allow you to choose the occurrence of a word that you want to replace.
for instance,
>>> "he is a good man, and a good husband".replace('good', '[[ good ]]', 1)
'he is a [[ good ]] man, and a good husband'
>>>
Hang on, im working on your example now.
Example 2 : Read from a file, one line at a time.
In the above method, I will assume that you have read the file and store its content as a single string . In the 2nd answer below, I will show you how you may implement your code to solve your problem
Assuming you have a file testfile.txt with the following content :
he is a good man
she is a beautiful woman
this is a clever student
he is a bad neighbour
they are bad men
She is very beautifu
Here is your python code
#!/usr/bin/env python
# your expression
ex = """ good, clever, beautiful, bad,"""
# list comprehension to clean up your expression,
# first by spliting it by comma and then remove anything that is just a empty
wanted_terms = [x.strip() for x in ex.split(',') if x.strip() != '']
## read file using with statement
with open('testfile.txt') as f:
for line in f:
line = line.strip()
## for each wanted terms check if they exist in the line
for x in wanted_terms:
if x in line:
## I prefer to use string format here.
#replacement = "[[ %s ]]" % x
#line = line.replace(x, replacement, 1)
## if term exist, do replacement. Use max =1 to ensure it replace only the first instance.
line = line.replace(x, '[[' + x +']]', 1 )
## remove it from term list so that in future, it will replace any new occurence
wanted_terms.remove(x)
Let me know you find this useful or if there are any other comments,
Cheers,
Biobirdman
biobirdman seems to have a good solution, so use that for the correct thing. My post here is just to explain what went wrong. When you did:
ex = """ good, clever, beautiful, bad,"""
adj = ex.split(",")
You got something other than what you thought
print adj
[' good', ' clever', ' beautiful', ' bad', '']
I don't know if you mean to have a space before each one string, but you almost certainly don't mean to have a '' at the end. In fact, I think you didn't have this for your example, otherwise you'd get a different bad behavior. What I think you had was a new line character at the end of ex. So that '' that's showing up was actually a newline in your attempt.
So it matched all the ones you expected, plus all the newlines for you. For anyone using the code you posted, they'll get a match between every pair of characters.
[[]]h [[]]e [[]] [[]]i [[]]s [[]] [[]]a ........
TO fix: get rid of the newline. Eliminate the extra spaces. How? Take a look at strip.
Two changes to your code. Avoiding a empty string in adj and removing leading whitespaces when you replace word with [[word]]. word has values like " beautiful", " clever" in your code.
file = """he is a good man
she is a beautiful woman
this is a clever student
he is a bad neighbour
they are bad men
She is very beautiful"""
ex = """ good, clever, beautiful, bad,"""
adj = filter(None, ex.split(",")) # removing empty strings from list
# SO ref: http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
for a in adj:
if a in file:
file = file.replace(a, ' ' +'[[' + a.strip() + ']]') # strip() removes leading or trailing whitespaces
print file

identifying strings which cant be spelt in a list item

I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

Categories