Python - .split(), - two arguments - python

I need to make a modification on a python code.
This code scrapes information from a .csv file, to finally integrate it in a new .csv file, in a different structure.
In one of the columns of the source files, I have a value (string), which is in 99% of the time formed this way: 'block1 block2 block3'.
Block2 always ends with the value 'm' 99% of the time.
example: 'R2 180m RFT'.
By browsing the source dataset, I realized that in 1% of the cases, the block2 can end with 'M'.
As I need all the values after the 'm' or 'M' value, I'm a bit stuck.
I used the .split() function, like this in my :
'Newcolumn': getattr(row_unique_ids, 'COLUMNINTHEDATASET').split ('m') [1],
By doing so, my script falls in error, because it falls on a value of this style :
R2 180M AST'.
So I would like to know how to integrate an additional argument, which would allow me to make the split work well if the script falls on 'm' or 'M'.
Thank you for your help.

One solution is to
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
s = s.lower()
s.split('m')[1]
But that will mess up your casing. If you want to preserve casing,
another solution is to do:
x = ''
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
for c in s:
if c == 'M'
x += 'm'
x += c
x.split('m')[1]

One way to do multi-arguments split is, in general:
import re
string = "this is 3an infamous String4that I need to s?plit in an infamou.s way"
#Preserve the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])",r'\1'+"DELIMITER",string).split('DELIMITER'))
#Discard the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])","DELIMITER",string).split('DELIMITER'))
Output:
['this is 3', 'an infamous S', 'tring4', 'that I', ' need to s?', 'plit in an infamou.', 's way']
['this is ', 'an infamous ', 'tring', 'that ', ' need to s', 'plit in an infamou', 's way']
In your context:
import re
string = "R2 180m RFT R2 180M RFT"
print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",string).split('M'))
#print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",getattr(row_unique_ids, 'COLUMNINTHEDATASET')).split('M'))
Output:
['R2 180', ' RFT R2 180', ' RFT']
It will split on m and M if those are preceded by a number.

Related

How to select sub-strings based on the presence of word pairs? Python

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:
'Sentence' 'First2'
0 If this is a string what does it say? 0 can I
1 And this is a string, should it say more? 1 should it
2 This is yet another string. 2 what does
3 etc. etc. 3 etc. etc
The result I want from the above example would be:
0 what does it say?
1 should it say more?
2
The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.
a = df['Sentence']
b = df['First2']
#The function seems to loop over all r's but only over the first b:
def func(z):
for x in b:
if x in r:
s = z[z.index(x):]
return s
else:
return ‘’
df['Segments'] = a.apply(func)
It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?
I believe there is a bug in your code.
else:
return ''
This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.
A sample working code is below:
# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
for first_two in first_twos:
if first_two in sentence:
s = sentence[sentence.index(first_two):]
return s
return ''
df['Segments'] = a.apply(func)
And the output:
df:
{
'First2': ['can I', 'should it', 'what does'],
'Segments': ['what does it say? ', 'should it say more?', ''],
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string. ' ]
}
you can loop over two things easily via zip(iterator,iterator_foo)
My question was answered by the following code:
def func(r):
for i in b:
if i in r:
q = r[r.index(i):]
return q
return ''
df['Segments'] = a.apply(func)
The solution was pointed out here by Daming Lu (only the last line is different from his). The problem was in the last two lines of the original code:
else:
return ''
This caused the function to return too early. Daming Lu's answer was better than the answer to the possible duplicate question python for-loop only executes once? which created other problems - as explained in my respons to wii. (So I am not sure mine really is a duplicate.)

Function's result is not updated?

I started learning Python quite recently. I am trying to a piece of code that does some simple text editing. The program is suppose to take a txt file encoding in UTF-8, make sure everything is indented 1 space starting from the second line and delete any potential double or triple spaces.
My plan is, reading the information from the txt file and store it in a list. Then I am going to process the elements in the list, then finally rewrite them back to the file (which has not been implemented yet) The first part of auto indent code is working I think.
However for the code that detects and deletes unwanted spaces, I tried in the function method, I think it is working; However when I test for the list content in the body code, the contents seem unaltered (the original state). What could I have been done wrong?
To give an idea of an example file, I will post parts of the txt file I am trying to process
Original:
There are various kinds of problems concerning human rights. Every day we hear news reporting on human rights violation. Human rights NGOs (For example, Amnesty International or Human Rights Watch) have been trying to deal with and resolve these problems in order to restore the human rights of individuals.
Expected:
There are various kinds of problems concerning human rights. Every day we hear news reporting on human rights violation. Human rights NGOs (For example, Amnesty International or Human Rights Watch) have been trying to deal with and resolve these problems in order to restore the human rights of individuals.
My code is as follows
import os
os.getcwd()
os.chdir('D:')
os.chdir('/Documents/2011_data/TUFS_08_2011')
words = []
def indent(string):
for x in range(0, len(string)):
if x>0:
if string[x]!= "\n":
if string[x][0] != " ":
y = " " + string[x]
def delete(self):
for x in self:
x = x.replace(" ", " ")
x = x.replace(" ", " ")
x = x.replace(" ", " ")
print(x, end='')
return self
with open('dummy.txt', encoding='utf_8') as file:
for line in file:
words.append(line)
file.close()
indent(words)
words = delete(words)
for x in words:
print(x, end='')
You can easily remove spaces with a split() and a join;
In [1]: txt = ' This is a text with multiple spaces. '
Using the split() method of a string gives you a list of words without whitespace.
In [3]: txt.split()
Out[3]: ['This', 'is', 'a', 'text', 'with', 'multiple', 'spaces.']
Then you can use the join method with a single space;
In [4]: ' '.join(txt.split())
Out[4]: 'This is a text with multiple spaces.'
If you want an extra space in front, insert an empty string in the list;
In [7]: s = txt.split()
In [8]: s
Out[8]: ['This', 'is', 'a', 'text', 'with', 'multiple', 'spaces.']
In [9]: s.insert(0, '')
In [10]: s
Out[10]: ['', 'This', 'is', 'a', 'text', 'with', 'multiple', 'spaces.']
In [11]: ' '.join(s)
Out[11]: ' This is a text with multiple spaces.'
Your delete function iterates through a list, assigning each string to x, then successively reassigns x with the result of various replaces. But it never then puts the result back in the list, which is returned unchanged.
The easiest thing to do would be to build up a new list consisting of the results of the modifications, and then return that.
def delete(words):
result = []
for x in words:
... modify...
result.append(x)
return result
(Note it's not a good idea to use the name 'self', as that implies you're in an object method, which you're not.)

Edit content of a list with split and find

I have a dictionary named dicitionario1. I need to replace the content of dicionario[chave][1] which is a list, for the list lista_atributos.
lista_atribtutos uses the content of dicionario[chave][1] to get a list where:
All the information is separed by "," except when it finds the characters "(#" and ")". In this case, it should create a list with the content between those characters (also separated by ","). It can find one or more entries of '(#' and I need to work with every single of them.
Although this might be easy, I'm stuck with the following code:
dicionario1 = {'#998' : [['IFCPROPERTYSET'],["'0siSrBpkjDAOVD99BESZyg',#41,'Geometric Position',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]],
'#1000' : [['IFCRELDEFINESBYPROPERTIES'],["'1dEWu40Ab8zuK7fuATUuvp',#41,$,$,(#973,#951),#998"]]}
for chave in dicionario1:
lista_atributos = []
ini = 0
for i in dicionario1[chave][1][0][ini:]:
if i == '(' and dicionario1[chave][1][0][dicionario1[chave][1][0].index(i) + 1] == '#':
ini = dicionario1[chave][1][0].index(i) + 1
fim = dicionario1[chave][1][0].index(')')
lista_atributos.append(dicionario1[chave][1][0][:ini-2].split(','))
lista_atributos.append(dicionario1[chave][1][0][ini:fim].split(','))
lista_atributos.append(dicionario1[chave][1][0][fim+2:].split(','))
print lista_atributos
Result:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", '#41', "'Geometric Position'", '$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757'], ['']]
Unfortunately I can figure out how to iterate over the dictionario1[chave][1][0] to get this result:
[["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", ['#41'], ["'Geometric Position'"], ['$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
I need the"["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$']..." in the result, also to turn into ["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$']...
Also If I modify "Geometric Position" to "(Geometric Position)" the result becomes:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
SOLUTION: (thanks to Rob Watts)
import re
dicionario1 =["'0siSrBpkjDAOVD99BESZyg',#41,'(Geometric) (Position)',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]
dicionario1 = re.findall('\([^)]*\)|[^,]+', dicionario1[0])
for i in range(len(dicionario1)):
if dicionario1[i].startswith('(#'):
dicionario1[i] = dicionario1[i][1:-1].split(',')
else:
pass
print dicionario1
["'0siSrBpkjDAOVD99BESZyg'", '#41', "'(Geometric) (Position)'", '$', ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
One problem I see with your code is the use of index:
ini = dicionario1[chave][1][0].index(i) + 2
fim = dicionario1[chave][1][0].index(')')
index returns the index of the first occurrence of the character. So if you have two ('s in your string, then both times it will give you the index of the first one. That (and your break statement) is why in your example you've got ['2.1', '2.2', '2.3'] correctly but also have '(#5.1', '5.2', '5.3)'.
You can get around this by specifying a starting index to the index method, but I'd suggest a different strategy. If you don't have any commas in the parsed strings, you can use a fairly simple regex to find all your groups:
'\([^)]*\)|[^,]+'
This will find everything inside parenthesis and also everything that doesn't contain a comma. For example:
>>> import re
>>> teststr = "'1',$,#41,(#10,#5)"
>>> re.findall('\([^)]*\)|[^,]+', teststr)
["'1'", '$', '#41', '(#10,#5)']
This leaves you will everything grouped appropriately. You still have to do a little bit of processing on each entry, but it should be fairly straightforward.
During your processing, the startswith method should be helpful. For example:
>>> '(something)'.startswith('(')
True
>>> '(something)'.startswith('(#')
False
>>> '(#1,#2,#3)'.startswith('(#')
True
This will make it easy for you to distinguish between (...) and (#...). If there are commas in the (...), you could always split on comma after you've used the regex.

identifying strings which cant be spelt in a list item

I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.

How to save a regular expression user input value (Python)

I am making a simple chat bot in Python. It has a text file with regular expressions which help to generate the output. The user input and the bot output are separated by a | character.
my name is (?P<'name'>\w*) | Hi {'name'}!
This works fine for single sets of input and output responses, however I would like the bot to be able to store the regex values the user inputs and then use them again (i.e. give the bot a 'memory'). For example, I would like to have the bot store the value input for 'name', so that I can have this in the rules:
my name is (?P<'word'>\w*) | You said your name is {'name'} already!
my name is (?P<'name'>\w*) | Hi {'name'}!
Having no value for 'name' yet, the bot will first output 'Hi steve', and once the bot does have this value, the 'word' rule will apply. I'm not sure if this is easily feasible given the way I have structured my program. I have made it so that the text file is made into a dictionary with the key and value separated by the | character, when the user inputs some text, the program compares whether the user input matches the input stored in the dictionary, and prints out the corresponding bot response (there is also an 'else' case if no match is found).
I must need something to happen at the comparing part of the process so that the user's regular expression text is saved and then substituted back into the dictionary somehow. All of my regular expressions have different names associated with them (there are no two instances of 'word', for example...there is 'word', 'word2', etc), I did this as I thought it would make this part of the process easier. I may have structured the thing completely wrong to do this task though.
Edit: code
import re
io = {}
with open("rules.txt") as brain:
for line in brain:
key, value = line.split('|')
io[key] = value
string = str(raw_input('> ')).lower()+' word'
x = 1
while x == 1:
for regex, output in io.items():
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
string = str(raw_input('> ')).lower()+' word'
else:
print ' Sorry?'
string = str(raw_input('> ')).lower()+' word'
I had some difficulty to understand the principle of your algorithm because I'm not used to employ the named groups.
The following code is the way I would solve your problem, I hope it will give you some ideas.
I think that having only one dictionary isn't a good principle, it increases the complexity of reasoning and of the algorithm. So I based the code on two dictionaries: direg and memory
Theses two dictionaries have keys that are indexes of groups, not all the indexes, only some particular ones, the indexes of the groups being the last in each individual patterns.
Because, for the fun, I decided that the regexes must be able to have several groups.
What I call individual patterns in my code are the following strings:
"[mM]y name [Ii][sS] (\w*)"
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
You see that the second individual pattern has 2 capturing groups: consequently there are 3 individual patterns, but a total of 4 groups in all the individual groups.
So the creation of the dictionaries needs some additional care to take account of the fact that the index of the last matching group ( which I use with help of the attribute of name lastindex of a regex MatchObject ) may not correspond to the numbering of individual regexes present in the regex pattern: it's harder to explain than to understand. That's the reason why I count in the function distr() the occurences of strings {0} {1} {2} {3} {4} etc whose number MUST be the same as the number of groups defined in the corresponding individual pattern.
I found the suggestion of Laurence D'Oliveiro to use '||' instead of '|' as separator interesting.
My code simulates a session in which several inputs are done:
import re
regi = ("[mM]y name [Ii][sS] (\w*)"
"||Hi {0}!"
"||You said that your name was {0} !!!",
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"||OK here's your file {0}\\{1} :"
"||I already gave you the file {0}\\{1} !",
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
"||OK, I will do {0}"
"||You already did {0}. Do yo really want again ?")
direg = {}
memory = {}
def distr(regi,cnt = 0,di = direg,mem = memory,
regnb = re.compile('{\d+}')):
for i,el in enumerate(regi,start=1):
sp = el.split('||')
cnt += len(regnb.findall(sp[1]))
di[cnt] = sp[1]
mem[cnt] = sp[2]
yield sp[0]
regx = re.compile('|'.join(distr(regi)))
print 'direg :\n',direg
print
print 'memory :\n',memory
for inp in ('I say that my name is Armano the 1st',
'In repertory ONE I want file SPACE',
'I want to record music',
'In repertory ONE I want file SPACE',
'I say that my name is Armstrong',
'But my name IS Armstrong now !!!',
'In repertory TWO I want file EARTH',
'Now my name is Helena'):
print '\ninput ==',inp
mat = regx.search(inp)
if direg[mat.lastindex]:
print 'output ==',direg[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
direg[mat.lastindex] = None
memory[mat.lastindex] = memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
else:
print 'output ==',memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
if not memory[mat.lastindex].startswith('Sorry'):
memory[mat.lastindex] = 'Sorry, ' \
+ memory[mat.lastindex][0].lower()\
+ memory[mat.lastindex][1:]
result
direg :
{1: 'Hi {0}!', 3: "OK here's your file {0}\\{1} :", 4: 'OK, I will do {0}'}
memory :
{1: 'You said that your name was {0} !!!', 3: 'I already gave you the file {0}\\{1} !', 4: 'You already did {0}. Do yo really want again ?'}
input == I say that my name is Armano the 1st
output == Hi Armano!
input == In repertory ONE I want file SPACE
output == OK here's your file ONE\SPACE :
input == I want to record music
output == OK, I will do record music
input == In repertory ONE I want file SPACE
output == I already gave you the file ONE\SPACE !
input == I say that my name is Armstrong
output == You said that your name was Armano !!!
input == But my name IS Armstrong now !!!
output == Sorry, you said that your name was Armano !!!
input == In repertory TWO I want file EARTH
output == Sorry, i already gave you the file ONE\SPACE !
input == Now my name is Helena
output == Sorry, you said that your name was Armano !!!
OK, let me see if I understand this:
You want to a dictionary of key-value pairs. This will be the “memory” of the chatbot.
You want to apply regular-expression rules to user input. But which rules might apply is conditional on which keys are already present in the memory dictionary: if “name” is not yet defined, then the rule that defines “name” applies; but if it is, then the rule that mentions “word” applies.
Seems to me you need more information attached to your rules. For example, the “word” rule you gave above shouldn’t actually add “word” to the dictionary, otherwise it would only apply once (imagine if the user keeps trying to say “my name is x” more than twice).
Does that give you a bit more idea about how to proceed?
Oh, by the way, I think “|” is a poor choice for a separator character, because it can occur in regular expressions. Not sure what to suggest: how about “||”?

Categories