regex extraction 2 groups resulting only in one match

regex extraction 2 groups resulting only in one match - python

New to regex.
Consider you have the following text structure:
"hello_1:45||hello_2:67||bye_1:45||bye_5:89||.....|| bye_last:100" and so on
I want to build a dictionary out of it taking the string value as a key, and the decimal number as the dict value.
I was trying to check my concept using this nice tool
I wrote my regex expression:
(\w+):(\d+)
And got only one match ->the first in the string : hello_1:45
I tried also something like:
.*(\w+):(\d+).*
But also not good, any ideas?

You should use the g (global) modifier to get all the matches and not stop to the first one. In python you can use the re.findall function to get all the matches. Check the example here.

You may achieve this only through split function.
s = "hello_1:45||hello_2:67||bye_1:45||bye_5:89"
print {i.split(':')[0]:i.split(':')[1] for i in s.split('||')}
Try this if you want to convert the value part as int.
print {i.split(':')[0]:int(i.split(':')[1]) for i in s.split('||')}
or
print {i.split(':')[0]:float(i.split(':')[1]) for i in s.split('||')}

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']

Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:

Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Regex to remove strings from list that do not match given prefix

I have a string that includes multiple comma-separated lists of values, always embedded between <mks:Field name="MyField"> and </mks:Field>.
For example:
<mks:Field name="MyField">X001_ABC</mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X001_ABC,X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2,X002_XYZ</mks:Field>
In this example I have the following values to work with:
X001_ABC
(empty)
X000_Test1,X000_Test2
X001_ABC,X000_Test1
X000_Test1,X000_Test2,X002_XYZ
Now I want to remove all the values that do not start with the prefix ""X000_", including any needless commas, so that my result looks like this:
<mks:Field name="MyField"></mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field>
I have tried the following regex, but it does not work properly if only one value exists not matching my regex and I do not want to change my regex if a new value matching my prefix is introduced (e.g. X000_Test3).
Search: (?<=name="MyField">)[^<>](?:.*?(X000_Test1,X000_Test2|X000_Test1|X000_Test2))?.*?(?=</mks:Field>)
Replace: \1
This gives me the following result that does not match the expected output:
<mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test2</mks:Field>
Unfortunately I cannot simply parse the string with something else - I only have the option of a regex search/replace in this case.
Thank you in advance, any help would be appreciated.

If you are using Javascript use this:
prefix='X000';
let pattern= new RegExp(`((?<=>)|,)((?!${prefix}|[>\<,]).)*(,|(?=\<))`, 'g');
For any other language use this:
'/((?<=>)|,)((?!X000|[>\<,]).)*(,|(?=\<))/';
X000 being the prefix you want to keep

Find and replace semi-common strings in dataframe?

I am attempting to find a semi-common occurring string and remove all other data in the column. Pandas and Re have been imported. For instance, I have dataframe...
>>>df
COLUMN COUNT DATA
1 this row RA-123: data 8b43a
2 here RA-5372: data 94h63c
I need to keep just the RA-'number that follows' and remove everything before and after. The numbers that follow are not always the same length and the 'RA-' string does not always occur in the same position. There is a colon after every instance that can be used as a delimiter.
I tried this (a friend wrote the regex search piece for me because I am not familiar with it).
df.assign(DATA= df['DATA'].str.extract(re.search('RA[^:]+')))
But python returned
TypeError: search() missing 1 required positional argument: 'string'
What am I missing here? Thanks in advance!

You should use acapturing group with extract:
df['DATA'].str.extract(r'(RA-\d+)')
Here, (RA-\d+) is a capturing group matching RA, then a hyphen and then one or more digits.
You may use your own pattern, but you still need to wrap it with capturing parentheses, r'(RA[^:]+)'.

Looking at the docs, you don't need the re.search method. You just call df[DATA] = df['DATA'].str.extract(r'RA[^:]+'))

As I mentioned earlier, no need for re here.
Other answers addressed well how to use extract directly. However, to answer your specificly, if you really want to use re, the way to go is to use re.compile instead of re.search.
df.assign(DATA= df['DATA'].str.extract(re.compile(regex_str)))

compute character frequencies in Python strings

I was wondering if there is a way in Python 3.5 to check if a string contains a certain symbol. Also I'd like to know if there is a way to check the amount the string contains. For example, if I want to check how many times the character '$' appears in this string...
^$#%#$$,
how would I do that?

You can use split to check if symbol's in the string:
if your_str.split('$'):
print(your_str.count('$'))
You can also use re.findall:
import re
print(len(re.findall('\$', your_str)))
It returns 0 if there is no such a symbol in the string, otherwise returns count of that symbol in the string.
But the easiest way is to check and return count if symbol is in:
print(your_str.count('$'))
It returns 0 if nothing is found.

These are the built-in functions index and count. You can find full documentation at the official site. Please get used to doing the research on your own; the first step is to get familiar with the names of the language elements.
if my_str.index('$') != 0:
# Found a dollar sign
print my_str.count('$')

Compare & manipulate strings with python

I've written an XML parser in Python and have just added functionality to read a further script from a different directory.
I've got two args, first is the path where I'm parsing XML. Second is a string in another XML file which I want to match with the first path;
arg1 = \work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator
path = calculators/2012/example/calculator
How can I compare the two strings to match identify that they're both referencing the same thing and also, how can I strip calculator from either string so I can store that & use it?
edit
Just had a thought. I have used a Regex to get the year out of the path already with year = re.findall(r"\.(\d{4})\.", path) following a problem Python has with numbers when converting the path to an import statement.
I could obviously split the strings and use a regex to match the path as a pattern in arg1 but this seems a long way round. Surely there's a better method?

Here I am assuming you are actually talking about strings, and not file paths - for which #mgilson's suggestion is better
How can I compare the two strings to match identify that they're both
referencing the same thing
Well first you need to identify what you mean by "the same thing"
At first glance it seems that if the the second string ends with the first string with the reversed slash, you have a match.
arg1 = r'\work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator'
arg2 = r'calculators/2012/example/calculator'
>>> arg1.endswith(arg2.replace('/','\\'))
True
and also, how can I strip calculator from
either string so I can store that & use it?
You also need to decide if you want to strip the first calculator, the last calculator or any occurance of calculator in the string.
If you just want to remove the last string after the separator, then its simply:
>>> arg2.split('/')[-1]
'calculator'
Now to get the orignal string back, without the last bit:
>>> '/'.join(arg2.split('/')[:-1])
'calculators/2012/example'

check out os.path.samefile:
http://docs.python.org/library/os.path.html#os.path.samefile
and os.path.dirname:
http://docs.python.org/library/os.path.html#os.path.dirname
or maybe os.path.basename (I'm not sure what part of the string you want to keep).

Here, try this:
arg1 = "\work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator"
path = "calculators/2012/example/calculator"
arg1=arg1.replace("/","\\")
path=path.replace("/","\\")
if str(arg1).endswith(str(path)) or str(path).endswith(str(arg1)):
print "Match"
That should work for your needs. Cheers :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex extraction 2 groups resulting only in one match - python

You should use the g (global) modifier to get all the matches and not stop to the first one. In python you can use the re.findall function to get all the matches. Check the example here.

Related

Extract values in name=value lines with regex

Regex to remove strings from list that do not match given prefix

Find and replace semi-common strings in dataframe?

compute character frequencies in Python strings

Compare & manipulate strings with python

Categories

Resources