I have a dictionary named dicitionario1. I need to replace the content of dicionario[chave][1] which is a list, for the list lista_atributos.
lista_atribtutos uses the content of dicionario[chave][1] to get a list where:
All the information is separed by "," except when it finds the characters "(#" and ")". In this case, it should create a list with the content between those characters (also separated by ","). It can find one or more entries of '(#' and I need to work with every single of them.
Although this might be easy, I'm stuck with the following code:
dicionario1 = {'#998' : [['IFCPROPERTYSET'],["'0siSrBpkjDAOVD99BESZyg',#41,'Geometric Position',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]],
'#1000' : [['IFCRELDEFINESBYPROPERTIES'],["'1dEWu40Ab8zuK7fuATUuvp',#41,$,$,(#973,#951),#998"]]}
for chave in dicionario1:
lista_atributos = []
ini = 0
for i in dicionario1[chave][1][0][ini:]:
if i == '(' and dicionario1[chave][1][0][dicionario1[chave][1][0].index(i) + 1] == '#':
ini = dicionario1[chave][1][0].index(i) + 1
fim = dicionario1[chave][1][0].index(')')
lista_atributos.append(dicionario1[chave][1][0][:ini-2].split(','))
lista_atributos.append(dicionario1[chave][1][0][ini:fim].split(','))
lista_atributos.append(dicionario1[chave][1][0][fim+2:].split(','))
print lista_atributos
Result:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", '#41', "'Geometric Position'", '$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757'], ['']]
Unfortunately I can figure out how to iterate over the dictionario1[chave][1][0] to get this result:
[["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", ['#41'], ["'Geometric Position'"], ['$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
I need the"["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$']..." in the result, also to turn into ["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$']...
Also If I modify "Geometric Position" to "(Geometric Position)" the result becomes:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
SOLUTION: (thanks to Rob Watts)
import re
dicionario1 =["'0siSrBpkjDAOVD99BESZyg',#41,'(Geometric) (Position)',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]
dicionario1 = re.findall('\([^)]*\)|[^,]+', dicionario1[0])
for i in range(len(dicionario1)):
if dicionario1[i].startswith('(#'):
dicionario1[i] = dicionario1[i][1:-1].split(',')
else:
pass
print dicionario1
["'0siSrBpkjDAOVD99BESZyg'", '#41', "'(Geometric) (Position)'", '$', ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
One problem I see with your code is the use of index:
ini = dicionario1[chave][1][0].index(i) + 2
fim = dicionario1[chave][1][0].index(')')
index returns the index of the first occurrence of the character. So if you have two ('s in your string, then both times it will give you the index of the first one. That (and your break statement) is why in your example you've got ['2.1', '2.2', '2.3'] correctly but also have '(#5.1', '5.2', '5.3)'.
You can get around this by specifying a starting index to the index method, but I'd suggest a different strategy. If you don't have any commas in the parsed strings, you can use a fairly simple regex to find all your groups:
'\([^)]*\)|[^,]+'
This will find everything inside parenthesis and also everything that doesn't contain a comma. For example:
>>> import re
>>> teststr = "'1',$,#41,(#10,#5)"
>>> re.findall('\([^)]*\)|[^,]+', teststr)
["'1'", '$', '#41', '(#10,#5)']
This leaves you will everything grouped appropriately. You still have to do a little bit of processing on each entry, but it should be fairly straightforward.
During your processing, the startswith method should be helpful. For example:
>>> '(something)'.startswith('(')
True
>>> '(something)'.startswith('(#')
False
>>> '(#1,#2,#3)'.startswith('(#')
True
This will make it easy for you to distinguish between (...) and (#...). If there are commas in the (...), you could always split on comma after you've used the regex.
Related
I need some help please.
I have a dataframe with multiple columns where 2 are:
Content_Clean = Column filled with Content - String
Removals: list of strings to be removed from Content_Clean Column
Problem: I am trying to replace words in Content_Clean with spaces if in Removals Column:
Example Image
Example:
Content Clean: 'Johnny and Mary went to the store'
Removals: ['Johnny','Mary']
Output: 'and went to the store'
Example Code:
for i in data_eng['Removals']:
for u in i:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].str.replace(u,' ')
This does not work as Removals columns contain lists per row.
Another Example:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].apply(lambda x: re.sub(data_eng.loc[data_eng['Content_Clean'] == x, 'Removals'].values[0], '', x))
Does not work as this code is only looking for one string.
The problem is that Removals column is a list that I want use to remove/ replace with spaces in the Content_Clean column on a per row basis.
The example image link might help
Here you go. This worked on my test data. Let me know if it works for you
def repl(row):
for word in row['Removals']:
row['Content_Clean'] = row['Content_Clean'].replace(word, '')
return row
data_eng = data_eng.apply(repl, axis=1)
You can call the str.replace(old, new) method to remove unwanted words from a string.
Here is one small example I have done.
a_string = "I do not like to eat apples and watermelons"
stripped_string = a_string.replace(" do not", "")
print(stripped_string)
This will remove "do not" from the sentence
I need to make a modification on a python code.
This code scrapes information from a .csv file, to finally integrate it in a new .csv file, in a different structure.
In one of the columns of the source files, I have a value (string), which is in 99% of the time formed this way: 'block1 block2 block3'.
Block2 always ends with the value 'm' 99% of the time.
example: 'R2 180m RFT'.
By browsing the source dataset, I realized that in 1% of the cases, the block2 can end with 'M'.
As I need all the values after the 'm' or 'M' value, I'm a bit stuck.
I used the .split() function, like this in my :
'Newcolumn': getattr(row_unique_ids, 'COLUMNINTHEDATASET').split ('m') [1],
By doing so, my script falls in error, because it falls on a value of this style :
R2 180M AST'.
So I would like to know how to integrate an additional argument, which would allow me to make the split work well if the script falls on 'm' or 'M'.
Thank you for your help.
One solution is to
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
s = s.lower()
s.split('m')[1]
But that will mess up your casing. If you want to preserve casing,
another solution is to do:
x = ''
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
for c in s:
if c == 'M'
x += 'm'
x += c
x.split('m')[1]
One way to do multi-arguments split is, in general:
import re
string = "this is 3an infamous String4that I need to s?plit in an infamou.s way"
#Preserve the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])",r'\1'+"DELIMITER",string).split('DELIMITER'))
#Discard the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])","DELIMITER",string).split('DELIMITER'))
Output:
['this is 3', 'an infamous S', 'tring4', 'that I', ' need to s?', 'plit in an infamou.', 's way']
['this is ', 'an infamous ', 'tring', 'that ', ' need to s', 'plit in an infamou', 's way']
In your context:
import re
string = "R2 180m RFT R2 180M RFT"
print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",string).split('M'))
#print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",getattr(row_unique_ids, 'COLUMNINTHEDATASET')).split('M'))
Output:
['R2 180', ' RFT R2 180', ' RFT']
It will split on m and M if those are preceded by a number.
I'm trying to modify list elements and replace the original element with the newly modified one. However, I've noticed that the desired behavior differs depending on how I construct my for loop. For example:
samples = ['The cat sat on the mat.', 'The dog at my homework.']
punctuation = ['\'', '\"', '?', '!', ',', '.']
for sample in samples:
sample = [character for character in sample if character not in punctuation]
sample = ''.join(sample)
print(samples)
for i in range(len(samples)):
samples[i] = [character for character in samples[i] if character not in punctuation]
samples[i] = ''.join(samples[i])
print(samples)
This program outputs:
['The cat sat on the mat.', 'The dog at my homework.']
['The cat sat on the mat', 'The dog at my homework']
The second for loop is the desired output with the punctuation removed from the sentence, but I'm having trouble understanding why that happens. I've searched online and found this Quora answer to be helpful in explaining the technical details, but I'm wondering if it's impossible to modify list elements using the first method of for loops, and if I have to resort to using functions like range or enumerate to modify list elements within loops.
Thank you.
Modifying the iterator is not enough,
You need to modify the list as well:
You need to replace the item in the list, not update the local variable created by the for loop. One option would be to use a range and update by index.
for i in range(len(samples)):
sample = [character for character in samples[i] if character not in punctuation]
samples[i] = ''.join(sample)
That said, a more pythonic approach would be to use a comprehension. You can also use the regex library to do the substitution.
import re
clean_samples = [
re.sub("['\"?!,.]", "", sample)
for sample in samples
]
Try this out:
samples = ['The cat sat on the mat.', 'The dog at my homework.']
punctuation = ['\'', '\"', '?', '!', ',', '.']
new_sample = []
for sample in samples:
sample = [character for character in sample if character not in punctuation]
sample = ''.join(sample)
new_sample.append(sample)
print(new_sample)
In this case, sample is an iterator, not the element of the list, so when you try to modify sample you are not updating the element.
I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:
'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'
The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.
For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.
I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:
regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)
mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)
#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:
print "Goal = ", mysearch2.group()
if mysearch3 is not None:
print "Assist = ", mysearch3.group()
if mysearch4 is not None:
print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
print "Yellow Card = ", mysearch5.group()
This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.
Can anyone see what I am doing wrong?
You can try something like this:
import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
'Man of the Match',
'Goal',
'Assist',
'Yellow Card',
'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']
Some general overview on regex methods in python:
re.search | re.match
In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:
s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
# if line contains an IP address...
print(line)
You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:
lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
if repattern.match(line):
print(line)
# Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!
The take away is that you use these two functions to select lines in a longer document (or any iterable)
re.findall | re.finditer
It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.
s = """Phone book:
Adam: (555)123-4567
Joe: (555)987-6543
Alice:(555)135-7924"""
pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']
re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.
other methods: re.split | re.sub
re.split is great if you have a number of things you need to split by. Imagine you had the string:
s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''
There's no great way to do that with str.split like you're used to, so instead do:
separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
# special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']
re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....
dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
\d{1,2} # two digits
(.) # followed by a separator (capture)
\d{1,2} # two more digits
\1 # a backreference to that separator
\d{2}(?:\d{2})? # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
match = match_pat.match(date)
if match:
sep = match.group(1) # the separator
separators.append(sep)
else:
dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09
A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:
def get_department(dept_num):
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
if hasattr(dept_num, 'group'): # then it's a match, not a number
dept_num = dept_num.group(0)
return departments.get(dept_num, "Unknown Dept")
file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file
dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance
Without using regex here you could do:
replaced_lines = []
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
for line in file.splitlines():
the_split_line = line.split(',')
replaced_lines.append(','.join(the_split_line[:-1]+ \
departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!
Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!
I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.