Remove dot following specific text - python

I'm trying to use remove dot (.) from specific following words like com and org for text cleaning using Python e.g.
Input: cnnindonesia.com liputan.org
Output: cnnindonesiacom liputanorg
Anybody has an idea using regex or iterations? Thank you.

You can use .replace() and a list comprehension; regular expressions aren't necessary here:
data = ["cnnindonesia.com", "liputan.org"]
print([url.replace(".com", "com").replace(".org", "org") for url in data])

Try this
input = "cnnindonesia.com liputan.org"
output = input.replace(".", "")
print(output)
Output
cnnindonesiacom liputanorg

You can split on the '.' and then join it.
input = "cnnindonesia.com liputan.org"
output = input.split(".")
output = ("").join(output)

If you have multiple patterns, re would be useful:
import re
s = "cnnindonesia.com liputan.org example.net twitch.tv"
output = re.sub(r"\.(com|org|net|tv)", r"\1", s)
print(output) # cnnindonesiacom liputanorg examplenet twitchtv

Related

How to extract a certain text from a string using Python

Consider I have a string in the format
sampleapp-ABCD-1234-us-eg-123456789. I need to extract the text ABCD-1234. Its more like I need ABCD and then the numbers before the -
Please let me know how can i do that
You could use string.split(), so it would be:
string = 'sampleapp-ABCD-1234-us-eg-123456789'
example = string.split('-')
Then you can access 'abcd' and '1234' as example[1] and example[2] respectively. You can also join them back together into one string if needs be with string.join().
string = 'sampleapp-ABCD-1234-us-eg-123456789'
example = string.split('-')
newstring = ' '.join(example[1:3])
print (newstring)
You can also change the seperator, '-'.join would make it so the output is 'ABCD-1234' rather than 'ABCD 1234'.
You can use Regex (Regular expression)
Here's the Python script you can use:
import re
txt = "sampleapp-ABCD-1234-us-eg-123456789"
x = re.findall("([ABCD]+[-][0-9]+)", txt)
print(x)
More varied version:
x = re.findall("([A-Z]{4}[-][0-9]+)", txt)
For more info about Regex you can learn it here: regexr.com
Hope this helps. Cheer!
You can do that :
txt = "sampleapp-ABCD-1234-us-eg-123456789"
abcd = txt[10:14]
digits = txt[15:19]
print(abcd)
print(digits)
You can also use split the text using txt.split("-") and then you can extract what you want :
abcd = txt.split("-")[1]
digits = txt.split("-")[2]
Please keep this post as an enquiry if it doesn't answer your question.
If what your are saying is that the string is of the form a-B-X0-c-d-X1
and that you want to extract B-X0 from the string then you can do the following:
text = 'a-B-X0-c-d-X1'
extracted_val = '-'.join(text.split('-')[1:3])

How to split or cut the string in python

I am trying to split the string with python code with following output:
import os
f = "Retirement-User-Portfolio-DEV-2020-7-29.xml"
to_output = os.path.splitext(f)[0]
print(to_output)
I have received an output :
Retirement-User-Portfolio-DEV-2020-7-29
However, I want the output like this below and remove "-DEV-2020-7-29" FROM THE STRING:
Retirement-User-Portfolio
You can use split() and join() to split on the kth occurrence of a character.
f = "Retirement-User-Portfolio-DEV-2020-7-29.xml"
to_output = '-'.join(f.split('-')[0:3])
You should explain your question more with details on the pattern you are trying to match - is it always the third character? Other solutions (e.g., regex) may be more appropriate.
Try this code -
f = "Retirement-User-Portfolio-DEV-2020-7-29.xml"
a = f.split('-')
print('-'.join(a[:3]))

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

regex to extract data between quotes

As title says string is '="24digit number"' and I want to extract number between "" (example: ="000021484123647598423458" should get me '000021484123647598423458').
There are answers that answer how to get data between " but in my case I also need to confirm that =" exist without capturing (there are also other "\d{24}" strings, but they are for other stuff) it.
I couldn't modify these answers to get what I need.
My latest regex was ((?<=\")\d{24}(?=\")) and string is ="000021484123647598423458".
UPDATE: I think I will settle with pattern r'^(?:\=\")(\d{24})(?:\")' because I just want to capture digit characters.
word = '="000021484123647598423458"'
pattern = r'^(?:\=\")(\d{24})(?:\")'
match = re.findall(pattern, word)[0]
Thank you all for suggestions.
You could have it like:
=(['"])(\d{24})\1
See a demo on regex101.com.
In Python:
import re
string = '="000021484123647598423458"'
rx = re.compile(r'''=(['"])(\d{24})\1''')
print(rx.search(string).group(2))
# 000021484123647598423458
Any one of the following works:
>>> st = '="000021484123647598423458"'
>>> import re
>>> re.findall(r'".*\d+.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'".*\d{24}.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'"\d{24}"',st)
['"000021484123647598423458"']

Regex: Replace one pattern with another

I am trying to replace one regex pattern with another regex pattern.
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt'
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
pattern = re.compile('\d+x\d+') # for st_srt
re.sub(pattern, 'S\1E\2',st_srt)
I know the use of S\1E\2 is wrong here. The reason am using \1 and \2 is to catch the value 01 and 02 and use it in S\1E\2.
My desired output is:
st_srt = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.srt'
So, what is the correct way to achieve this.
You need to capture what you're trying to preserve. Try this:
pattern = re.compile(r'(\d+)x(\d+)') # for st_srt
st_srt = re.sub(pattern, r'S\1E\2', st_srt)
Well, it looks like you already accepted an answer, but I think this is what you said you're trying to do, which is get the replace string from 'st_mkv', then use it in 'st_srt':
import re
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt'
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
replace_pattern = re.compile(r'Awake\.([^.]+)\.')
m = replace_pattern.match(st_mkv)
replace_string = m.group(1)
new_srt = re.sub(r'^Awake\.[^.]+\.', 'Awake.{0}.'.format(replace_string), st_srt)
print new_srt
Try using this regex:
([\w+\.]+){5}\-\w+
copy the stirngs into here: http://www.gskinner.com/RegExr/
and paste the regex at the top.
It captures the names of each string, leaving out the extension.
You can then go ahead and append the extension you want, to the string you want.
EDIT:
Here's what I used to do what you're after:
import re
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt' // dont actually need this one
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
replace_pattern = re.compile(r'([\w+\.]+){5}\-\w+')
m = replace_pattern.match(st_mkv)
new_string = m.group(0)
new_string += '.srt'
>>> new_string
'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.srt'
import re
st_srt = 'Awake.01x02.iNTERNAL.WEBRiP.XViD-GeT.srt'
st_mkv = 'Awake.S01E02.iNTERNAL.WEBRiP.XViD-GeT.mkv'
pattern = re.compile(r'(\d+)x(\d+)')
st_srt_new = re.sub(pattern, r'S\1E\2', st_srt)
print st_srt_new

Categories