I have the following list Novar:
["'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'", "'population_3000|road_class_3_3000|trafBuf25|population_1000'"]
How can I append a character (e.g. $) to a specific part of the items in the list, for instance to road_class_3_3000, so that the upgraded list becomes:
["'population_3000|road_class_3_3000$'", "'population_3000|road_class_3_3000$|trafBuf25'", "'population_3000|road_class_3_3000$|trafBuf25|population_1000'"]
Most similar questions on Stack Overflow seem to focus on manipulating the item itself, rather than a part of the item, e.g. here and here.
Therefore, applying the following code:
if (item == "road_class_3_3000"):
item.append("$")
Would be of no use since road_class_3_3000 is part of the items "'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'" and "'population_3000|road_class_3_3000|trafBuf25|population_1000'"
You might harness re module (part of standard library) for this task following way
import re
novar = ["'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'", "'population_3000|road_class_3_3000|trafBuf25|population_1000'"]
novar2 = [re.sub(r'(?<=road_class_3_3000)', '$', i) for i in novar]
print(novar2)
output
["'population_3000|road_class_3_3000$'", "'population_3000|road_class_3_3000$|trafBuf25'", "'population_3000|road_class_3_3000$|trafBuf25|population_1000'"]
Feature I used is called positive lookbehind, it is kind of zero-length assertion. I look for place after road_class_3_3000 of zer-length, which I then replace using $ character.
I have a string that includes multiple comma-separated lists of values, always embedded between <mks:Field name="MyField"> and </mks:Field>.
For example:
<mks:Field name="MyField">X001_ABC</mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X001_ABC,X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2,X002_XYZ</mks:Field>
In this example I have the following values to work with:
X001_ABC
(empty)
X000_Test1,X000_Test2
X001_ABC,X000_Test1
X000_Test1,X000_Test2,X002_XYZ
Now I want to remove all the values that do not start with the prefix ""X000_", including any needless commas, so that my result looks like this:
<mks:Field name="MyField"></mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field>
I have tried the following regex, but it does not work properly if only one value exists not matching my regex and I do not want to change my regex if a new value matching my prefix is introduced (e.g. X000_Test3).
Search: (?<=name="MyField">)[^<>](?:.*?(X000_Test1,X000_Test2|X000_Test1|X000_Test2))?.*?(?=</mks:Field>)
Replace: \1
This gives me the following result that does not match the expected output:
<mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test2</mks:Field>
Unfortunately I cannot simply parse the string with something else - I only have the option of a regex search/replace in this case.
Thank you in advance, any help would be appreciated.
If you are using Javascript use this:
prefix='X000';
let pattern= new RegExp(`((?<=>)|,)((?!${prefix}|[>\<,]).)*(,|(?=\<))`, 'g');
For any other language use this:
'/((?<=>)|,)((?!X000|[>\<,]).)*(,|(?=\<))/';
X000 being the prefix you want to keep
First off, this is homework. (I couldn't use a tag in the title and nothing showed up in the tag list at the bottom for homework, so please let me know if I should EDIT something else regarding this matter).
So I have been reading through the python docs and scavenging SO, finding several solutions that are close to what I want, but not exact.
I have a dictionary which I read in to a string:
a
aa
aabbaa
...
z
We are practicing various regex patters on this data.
The specific problem here is to return a list of words which match the pattern, NOT tuples with the groups within each match.
For example:
Given a subset of this dictionary like:
someword
sommmmmeword
someworddddd
sooooomeword
I want to return:
['sommmmmword', 'someworddddd']
NOT:
[('sommmmword', 'mmmmm', ...), ...] # or any other variant
EDIT:
My reasoning behind the above example, is that I want to see how I can avoid making a second pass over the results. That is instead of saying:
res = re.match(re.compile(r'pattern'), dictionary)
return [r[0] for r in res]
I specifically want a mechanism where I can just use:
return re.match(re.compile(r'pattern'), dictionary)
I know that may sound silly, but I am doing this to really dig into regex. I mention this at the bottom.
This is what I have tried:
# learned about back refs
r'\b([b-z&&[^eiou]])\1+\b' -> # nothing
# back refs were weird, I want to match something N times
r'\b[b-z&&[^eiou]]{2}\b' -> # nothing
Somewhere in testing I noticed a pattern returning things like '\nsomeword'. I couldn't figure out what it was but if I find the pattern again I will include it here for completeness.
# Maybe the \b word markers don't work how I think?
r'.*[b-z&&[^eiou]]{2}' -> # still nothing
# Okay lets just try to match something in between anything
r'.*[b-z&&[^eiou]].*' -> # nope
# Since its words, maybe I should be more explicit.
r'[a-z]*[b-z&&[^eiou]][a-z]*' -> # still nope
# Decided to go back to grouping.
r'([b-z&&[^eiou]])(\1)' # I realize set difference may be the issue
# I saw someone (on SO) use set difference claiming it works
# but I gave up on it...
# OKAY getting close
r'(([b-df-hj-np-tv-xz])(\2))' -> [('ll', 'l', 'l'), ...]
# Trying the the previous ones without set difference
r'\b(.*(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # returned everything (all words)
# Here I realize I need a non-greedy leading pattern (.* -> .*?)
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # still everything
# Maybe I need the comma in {3,} to get anything 3 or more
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3,}).*)\b' -> # still everything
# okay I'll try a 1 line test just in case
r'\b(.*?([b-df-hj-np-tv-xz])(\2{3,}).*)\b'
# Using 'asdfdffff' -> [('asdfdffff', 'f', 'fff')]
# Using dictionary -> [] # WAIT WHAT?!
How does this last one work? Maybe there there are no 3+ repeating consonant words? I'm using /usr/share/dict/cracklib-small on my schools server which is about 50,000 words I think.
I am still working on this but any advice would be awesome.
One thing I find curious is that you can not back reference a non-capturing group. If I want to output only the full word, I use (?:...) to avoid capture, but then I can not back reference. Obviously I could leave the captures, loop over the results and filter out the extra stuff, but I absolutely want to figure this out using ONLY regex!
Perhaps there is a way to do the non-capture, but still allow back reference? Or maybe there is an entirely different expression I haven't tested yet.
Here are some points to consider:
Use re.findall to get all the results, not re.match (that only searches for 1 match and only at the string start).
[b-z&&[^eiou]] is a Java/ICU regex, this syntax is not supported by Python re. In Python, you can either redefine the ranges to skip the vowels, or use (?![eiou])[b-z].
To avoid "extra" values in tuples with re.findall, do not use capturing groups. If you need backreferences, use re.finditer instead of re.findall and access .group() of each match.
Coming back to the question, how you can use a backreference and still get the whole match, here is a working demo:
import re
s = """someword
sommmmmeword
someworddddd
sooooomeword"""
res =[x.group() for x in re.finditer(r"\w*([b-df-hj-np-tv-xz])\1\w*", s)]
print(res)
# => ['sommmmmeword', 'someworddddd']
I have a python script which parses an xml file and then gives me the required information. My output looks like this, and is 100% correct:
output = ['77:275,77:424,77:425,77:426,77:427,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:431,77:432,77:433,77:435,77:467,77:470,77:471,77:484,77:485,77:475,77:476,77:437,77:438,77:439,77:440,77:442,77:443,77:444,77:445,77:446,77:447,77:449,77:450,77:451,77:454,77:455,77:456,77:305,77:309,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:317,77:321,77:346,77:349,77:350,77:351,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:362,77:367,77:369,77:374,77:370,77:372,77:373,77:387,77:388,77:389,77:392,77:393,77:394,77:328,77:283,77:284,77:285,77:288,77:289,77:290,77:292,']
It is all fine, but I want to remove the duplicate elements in an element, like in the case above. I tried using the OrderedDict package or just simple list(set(output)), but obvoiusly they both didn't work. Does anyone have a tip for me on how to solve this problem.
You have one element in a list. If you expected it to be treated as separate elements, you need to explicitly split it.
You could split the string on the ',' comma character into a list with str.split():
separate_elements = output[0].split(',')
after which you can use set() (unordered) or OrderedDict (maintaining order) and re-join the string if you still need just the one string object:
','.join(set(separate_elements))
You can put that back into a list with just one element, but there is little point if all you ever handle is that one string.