How to extract a certain text from a string using Python

How to extract a certain text from a string using Python - python

Consider I have a string in the format
sampleapp-ABCD-1234-us-eg-123456789. I need to extract the text ABCD-1234. Its more like I need ABCD and then the numbers before the -
Please let me know how can i do that

You could use string.split(), so it would be:
string = 'sampleapp-ABCD-1234-us-eg-123456789'
example = string.split('-')
Then you can access 'abcd' and '1234' as example[1] and example[2] respectively. You can also join them back together into one string if needs be with string.join().
string = 'sampleapp-ABCD-1234-us-eg-123456789'
example = string.split('-')
newstring = ' '.join(example[1:3])
print (newstring)
You can also change the seperator, '-'.join would make it so the output is 'ABCD-1234' rather than 'ABCD 1234'.

You can use Regex (Regular expression)
Here's the Python script you can use:
import re
txt = "sampleapp-ABCD-1234-us-eg-123456789"
x = re.findall("([ABCD]+[-][0-9]+)", txt)
print(x)
More varied version:
x = re.findall("([A-Z]{4}[-][0-9]+)", txt)
For more info about Regex you can learn it here: regexr.com
Hope this helps. Cheer!

You can do that :
txt = "sampleapp-ABCD-1234-us-eg-123456789"
abcd = txt[10:14]
digits = txt[15:19]
print(abcd)
print(digits)
You can also use split the text using txt.split("-") and then you can extract what you want :
abcd = txt.split("-")[1]
digits = txt.split("-")[2]

Please keep this post as an enquiry if it doesn't answer your question.
If what your are saying is that the string is of the form a-B-X0-c-d-X1
and that you want to extract B-X0 from the string then you can do the following:
text = 'a-B-X0-c-d-X1'
extracted_val = '-'.join(text.split('-')[1:3])

Related

Remove dot following specific text

I'm trying to use remove dot (.) from specific following words like com and org for text cleaning using Python e.g.
Input: cnnindonesia.com liputan.org
Output: cnnindonesiacom liputanorg
Anybody has an idea using regex or iterations? Thank you.

You can use .replace() and a list comprehension; regular expressions aren't necessary here:
data = ["cnnindonesia.com", "liputan.org"]
print([url.replace(".com", "com").replace(".org", "org") for url in data])

Try this
input = "cnnindonesia.com liputan.org"
output = input.replace(".", "")
print(output)
Output
cnnindonesiacom liputanorg

You can split on the '.' and then join it.
input = "cnnindonesia.com liputan.org"
output = input.split(".")
output = ("").join(output)

If you have multiple patterns, re would be useful:
import re
s = "cnnindonesia.com liputan.org example.net twitch.tv"
output = re.sub(r"\.(com|org|net|tv)", r"\1", s)
print(output) # cnnindonesiacom liputanorg examplenet twitchtv

How to split or cut the string in python

I am trying to split the string with python code with following output:
import os
f = "Retirement-User-Portfolio-DEV-2020-7-29.xml"
to_output = os.path.splitext(f)[0]
print(to_output)
I have received an output :
Retirement-User-Portfolio-DEV-2020-7-29
However, I want the output like this below and remove "-DEV-2020-7-29" FROM THE STRING:
Retirement-User-Portfolio

You can use split() and join() to split on the kth occurrence of a character.
f = "Retirement-User-Portfolio-DEV-2020-7-29.xml"
to_output = '-'.join(f.split('-')[0:3])
You should explain your question more with details on the pattern you are trying to match - is it always the third character? Other solutions (e.g., regex) may be more appropriate.

Try this code -
f = "Retirement-User-Portfolio-DEV-2020-7-29.xml"
a = f.split('-')
print('-'.join(a[:3]))

regex to extract data between quotes

As title says string is '="24digit number"' and I want to extract number between "" (example: ="000021484123647598423458" should get me '000021484123647598423458').
There are answers that answer how to get data between " but in my case I also need to confirm that =" exist without capturing (there are also other "\d{24}" strings, but they are for other stuff) it.
I couldn't modify these answers to get what I need.
My latest regex was ((?<=\")\d{24}(?=\")) and string is ="000021484123647598423458".
UPDATE: I think I will settle with pattern r'^(?:\=\")(\d{24})(?:\")' because I just want to capture digit characters.
word = '="000021484123647598423458"'
pattern = r'^(?:\=\")(\d{24})(?:\")'
match = re.findall(pattern, word)[0]
Thank you all for suggestions.

You could have it like:
=(['"])(\d{24})\1
See a demo on regex101.com.
In Python:
import re
string = '="000021484123647598423458"'
rx = re.compile(r'''=(['"])(\d{24})\1''')
print(rx.search(string).group(2))
# 000021484123647598423458

Any one of the following works:
>>> st = '="000021484123647598423458"'
>>> import re
>>> re.findall(r'".*\d+.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'".*\d{24}.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'"\d{24}"',st)
['"000021484123647598423458"']

Python - Extract text from string

What are the most efficient ways to extract text from a string? Are there some available functions or regex expressions, or some other way?
For example, my string is below and I want to extract the IDs as well
as the ScreenNames, separately.
[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]
Thank you!
Edit: These are the text strings that I want to pull. I want them to be in a list.
Target_IDs = 1234567890, 233323490, 4459284
Target_ScreenNames = RandomNameHere, AnotherRandomName, YetAnotherName

import re
str = '[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]'
print 'Target IDs = ' + ','.join( re.findall(r'ID=(\d+)', str) )
print 'Target ScreenNames = ' + ','.join( re.findall(r' ScreenName=(\w+)', str) )
Output :
Target IDs = 1234567890,233323490,4459284
Target ScreenNames = RandomNameHere,AnotherRandomName,YetAnotherName

It depends. Assuming that all your text comes in the form of
TagName = TagValue1, TagValue2, ...
You need just two calls to split.
tag, value_string = string.split('=')
values = value_string.split(',')
Remove the excess space (probably a couple of rstrip()/lstrip() calls will suffice) and you are done. Or you can take regex. They are slightly more powerful, but in this case I think it's a matter of personal taste.
If you want more complex syntax with nonterminals, terminals and all that, you'll need lex/yacc, which will require some background in parsers. A rather interesting thing to play with, but not something you'll want to use for storing program options and such.

The regex I'd use would be:
(?:ID=|ScreenName=)+(\d+|[\w\d]+)
However, this assumes that ID is only digits (\d) and usernames are only letters or numbers ([\w\d]).
This regex (when combined with re.findall) would return a list of matches that could be iterated through and sorted in some fashion like so:
import re
s = "[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]"
pattern = re.compile(r'(?:ID=|ScreenName=)+(\d+|[\w\d]+)');
ids = []
names = []
for p in re.findall(pattern, s):
if p.isnumeric():
ids.append(p)
else:
names.append(p)
print(ids, names)

Regex issue in python

I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?

try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/

The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)

You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21

You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21

value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a certain text from a string using Python - python

Consider I have a string in the format sampleapp-ABCD-1234-us-eg-123456789. I need to extract the text ABCD-1234. Its more like I need ABCD and then the numbers before the - Please let me know how can i do that

You can do that : txt = "sampleapp-ABCD-1234-us-eg-123456789" abcd = txt[10:14] digits = txt[15:19] print(abcd) print(digits) You can also use split the text using txt.split("-") and then you can extract what you want : abcd = txt.split("-")[1] digits = txt.split("-")[2]

Please keep this post as an enquiry if it doesn't answer your question. If what your are saying is that the string is of the form a-B-X0-c-d-X1 and that you want to extract B-X0 from the string then you can do the following: text = 'a-B-X0-c-d-X1' extracted_val = '-'.join(text.split('-')[1:3])

Related

Remove dot following specific text

How to split or cut the string in python

regex to extract data between quotes

Python - Extract text from string

Regex issue in python

Categories

Resources