Remove Prefixes From a String - python

What's a cute way to do this in python?
Say we have a list of strings:
clean_be
clean_be_al
clean_fish_po
clean_po
and we want the output to be:
be
be_al
fish_po
po

Another approach which will work for all scenarios:
import re
data = ['clean_be',
'clean_be_al',
'clean_fish_po',
'clean_po', 'clean_a', 'clean_clean', 'clean_clean_1']
for item in data:
item = re.sub('^clean_', '', item)
print (item)
Output:
be
be_al
fish_po
po
a
clean
clean_1

Here is a possible solution that works with any prefix:
prefix = 'clean_'
result = [s[len(prefix):] if s.startswith(prefix) else s for s in lst]

You've merely provided minimal information on what you're trying to achieve, but the desired output for the 4 given inputs can be created via the following function:
def func(string):
return "_".join(string.split("_")[1:])

you can do this:
strlist = ['clean_be','clean_be_al','clean_fish_po','clean_po']
def func(myList:list, start:str):
ret = []
for element in myList:
ret.append(element.lstrip(start))
return ret
print(func(strlist, 'clean_'))
I hope, it was useful, Nohab

There are many ways to do based on what you have provided.
Apart from the above answers, you can do in this way too:
string = 'clean_be_al'
string = string.replace('clean_','',1)
This would remove the first occurrence of clean_ in the string.
Also if the first word is guaranteed to be 'clean', then you can try in this way too:
string = 'clean_be_al'
print(string[6:])

You can use lstrip to remove a prefix and rstrip to remove a suffix
line = "clean_be"
print(line.lstrip("clean_"))
Drawback:
lstrip([chars])
The [chars] argument is not a prefix; rather, all combinations of its values are stripped.

Related

python string split slice and into a list

I have a string for example "streemlocalbbv"
and I have my_function that takes this string and a string that I want to find ("loc") in the original string. And what I want to get returned is this;
my_function("streemlocalbbv", "loc")
output = ["streem","loc","albbv"]
what I did so far is
def find_split(string,find_word):
length = len(string)
find_word_start_index = string.find(find_word)
find_word_end_index = find_word_start_index + len(find_word)
string[find_word_start_index:find_word_end_index]
a = string[0:find_word_start_index]
b = string[find_word_start_index:find_word_end_index]
c = string[find_word_end_index:length]
return [a,b,c]
Trying to find the index of the string I am looking for in the original string, and then split the original string. But from here I am not sure how should I do it.
You can use str.partition which does exactly what you want:
>>> "streemlocalbbv".partition("loc")
('streem', 'loc', 'albbv')
Use the split function:
def find_split(string,find_word):
ends = string.split(find_word)
return [ends[0], find_word, ends[1]]
Use the split, index and insert function to solve this
def my_function(word,split_by):
l = word.split(split_by)
l.insert(l.index(word[:word.find(split_by)])+1,split_by)
return l
print(my_function("streemlocalbbv", "loc"))
#['str', 'eem', 'localbbv']

function call the convert a list is alpha characters to numeric

I am trying a manual implementation of the Soundex Algorithm and this requires converting alpha text characters to numeric text characters. I have defined the following function:
import re
def sub_pattern(text):
sub = [str(i) for i in range(1,4)]
string = text
abc = re.compile('[abc]')
xyz = re.compile('[xyz]')
encode = [abc, xyz]
encode_iter = iter(encode)
alpha_search = re.compile('[a-zA-Z]')
for i in sub:
if alpha_search.search(string):
pattern = next(encode_iter)
string = pattern.sub(i, string)
else:
return(string)
This function will encode abc characters to 1 and xyz characters to 2. However, it only works for a single string and I need to pass a list of strings to the function. I've gotten the results I want using:
list(map(sub_pattern, ['aab', 'axy', 'bzz']
But I want to be able to pass the list to the function directly. I've tried this with no success as it ends only returning the first string from the list.
def sub_pattern(text_list):
all_encoded = []
sub = [str(i) for i in range(1,4)]
abc = re.compile('[abc]')
xyz = re.compile('[xyz]')
encode = [abc, xyz]
encode_iter = iter(encode)
alpha_search = re.compile('[a-zA-Z]')
for string in text_list:
for i in sub:
if alpha_search.search(string):
pattern = next(encode_iter)
string = pattern.sub(i, string)
else:
all_encoded.append(string)
A couple things to note:
Because I am implementing the Soundex Algorithm, the order of the text when I encode it matters. I would prefer to update the string character at its orginal index to avoid having to reorganize it afterwards. In other words, you can't do any sorting to the string...I've created the iterator to incrementally update the string and it only grabs the next regex pattern if all the characters have not already been converted.
This function will be a part of two custom classes that I am creating. Both will call the __iter__ method so that I can created the iterable. That's why I use the iter() function to create an iterable because it will create a new instance if the iterator automatically.
I know this may seem like a trivial issue relative to what I'm doing, but I'm stuck.
Thank you in advance.
How about using your own function recursively? You get to keep the original exactly as it is, in case you needed it:
import re
def sub_pattern(text):
if isinstance(text, str):
sub = [str(i) for i in range(1,4)]
string = text
abc = re.compile('[abc]')
xyz = re.compile('[xyz]')
encode = [abc, xyz]
encode_iter = iter(encode)
alpha_search = re.compile('[a-zA-Z]')
for i in sub:
if alpha_search.search(string):
pattern = next(encode_iter)
string = pattern.sub(i, string)
else:
return(string)
else:
return([sub_pattern(t) for t in text])
print(list(map(sub_pattern, ['aab', 'axy', 'bzz']))) # old version still works
print(sub_pattern(['aab', 'axy', 'bzz'])) # new version yields the same result
Should a reader don't know what recursively means: calling a function from within itself.
It is allowed because each function call creates its own
scope,
it can be useful when you can solve a problem by performing a simple operation multiple times, or can't predict in advance how many times you need to perform it to reach your solution, e.g. when you need to unpack nested structures
it is defined by choosing a base case (the solution), and call the function in all other cases until you reach your base case.
I assume the issue with your example was, that once you traversed the iterator, you ran into StopIteration for the next string.
I'm not sure this is what you want, but I would create a new iterator for each string, since you have to be able to traverse over all of it for every new item. I tweaked some variable names that may cause confusion, too (string and sub). See comments for changes:
def sub_pattern(text_list):
all_encoded = []
digits = [str(i) for i in range(1,4)]
abc = re.compile('[abc]')
xyz = re.compile('[xyz]')
encode = [abc, xyz]
alpha_search = re.compile('[a-zA-Z]')
for item in text_list:
# Create new iterator for each string.
encode_iter = iter(encode)
for i in digits:
if alpha_search.search(item):
pattern = next(encode_iter)
item = pattern.sub(i, item)
else:
all_encoded.append(item)
# You likely want appending to end once no more letters can be found.
break
# Return encoded texts.
return all_encoded
Test:
print(sub_pattern(['aab', 'axy', 'bzz'])) # Output: ['111', '122', '122']

Remove unwanted substring from a list of strings at specified indexes

New to python and I want to remove the prefix of two stings. Just leaving everything before the J and removing the .json.
I tried using [:1] but it removes the entire first string
name = ['190523-105238-J105150.json',
'190152-105568-J616293.json']
I want to output this
name = ['J105150',
'J616293']
You can use split() in a list-comprehension:
name = ['190523-105238-J105150.json',
'190152-105568-J616293.json']
print([x.rsplit('-', 1)[1].split('.')[0] for x in name])
# ['J105150', 'J616293']
You could use find() function and array splicing.
name = ['190523-105238-J105150.json' ,'190152-105568-J616293.json']
for i in range(len(name)):
start_of_json = name[i].find('.json')
start_of_name = name[i].find('J')
name[i] = name[i][start_of_name:start_of_json]
Doing [:1] will slice your current list to take only elements that are before index 1, so only element at index 0 will be present.
This is not what you want.
A regex can help you reach your goal.
import re
output = [re.search(r'-([\w+]).json', x).group(0) for x in your_list]
Firstly it is not a data frame, it is an array.
You could use something simple as below line for this, assuming you have static structure.
name = [x[x.index("J"):x.index(".")] for x in name]
Here are two possible approaches:
One is more verbose. The other does essentially the same thing but condenses it into a one-liner, if you will.
Approach 1:
In approach 1, we create an empty list to store the results temporarily.
From there we parse each item of name and .split() each item on the hyphens.
For each item, this will yield a list composed of three elements: ['190523', '105238', 'J105150.json'] for example.
We use the index [-1] to select just the last element and then .replace() the text .json with the empty string '' effectively removing the .json.
We then append the item to the new_names list.
Lastly, we overwrite the variable label name, so that it points at the new list we generated.
name = ['190523-105238-J105150.json', '190152-105568-J616293.json']
new_names = []
for item in name:
item = item.split('-')[-1]
new_names.append(item.replace('.json', ''))
name = new_names
Approach 2:
name = ['190523-105238-J105150.json', '190152-105568-J616293.json']
name = [item.split('-')[-1].replace('.json', '') for item in name]
Originally the list is name = ['190523-105238-J105150.json', '190152-105568-J616293.json'].
List comprehensions in python are extremely useful and powerful.
eq = [name[i][name[i].find("J"):name[i].rfind(".json")] for i in range(len(name))], a list comprehension is used to create a new list of values from the list name by finding the starting at the value J and going to before the .json. The result of find() is of type integer.
The complete code can be seen below.
def main():
name = ['190523-105238-J105150.json', '190152-105568-J616293.json']
eq = [name[i][name[i].find("J"):name[i].rfind(".json")] for i in range(len(name))]
print(eq)
if __name__ == "__main__":
main()
output: ['J105150', 'J616293']

Remove comma and change string to float

I want to find "money" in a file and change the string to float , for example, I use regular expression to find "$33,326" and would like to change to [33326.0, "$"] (i.e., remove comma, $ sign and change to float). I wrote the following function but it gives me an error
import locale,re
def currencyToFloat(x):
empty = []
reNum = re.compile(r"""(?P<prefix>\$)?(?P<number>[.,0-9]+)(?P<suffix>\s+[a-zA-Z]+)?""")
new = reNum.findall(x)
for i in new:
i[1].replace(",", "")
float(i[1])
empty.append(i[1])
empty.append(i[0])
return empty
print currencyToFloat("$33,326")
Can you help me debug my code?
money = "$33,326"
money_list = [float("".join(money[1:].split(","))), "$"]
print(money_list)
OUTPUT
[33326.0, '$']
When you do
float(i[1])
you are not modifying anything. You should store the result in some variable, like:
temp = ...
But to cast to float your number have to have a dot, not a comma, so you can do:
temp = i[1].replace(",", ".")
and then cast it to float and append to the list:
empty.append(float(temp))
Note:
Something important you should know is that when you loop through a list, like
for i in new:
i is a copy of each element, so if you modify it, no changes will be done in the list new. To modify the list you can iterate over the indices:
for i in range(len(new)):
new[i] = ...
You can use str.translate()
>>>money= "$333,26"
>>>float(money.translate(None, ",$"))
33326.0
With Python 3 you can use str.maketrans with str.translate:
money = "$33,326"
print('money: {}'.format(float(money.translate(str.maketrans('', '', ",$")))))
Output: money: 33326.0

Regular Expression in Python

I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.
The string that comes back from Enom looks somewhat like this:
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1
I'd like to build a list from that which looks like this:
[domain1.com, domain2.org, domain3.co.uk, domain4.net]
To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.
re.findall("^.*(SLD|TLD).*$", enom, re.M)
Edit:
Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.
Here is the naive approach:
a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
print ".".join(c[x:x+2])
>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net
You have a capturing group in your expression. re.findall documentation says:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
That's why only the conent of the capturing group is returned.
try:
re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)
This would return a list of tuples:
[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]
Combining SLDs and TLDs is then up to you.
this works for you example,
>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?
A famous quote seems to be appropriate here:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
domains = []
components = []
for line in enom.split('\n'):
k,v = line.split('=')
if k == 'TLDOverride':
continue
components.append(v)
if k.startswith('TLD'):
domains.append('.'.join(components))
components = []
P.S. I'm not sure what's this TLDOverride so the code just ignores it.
Here's one way:
import re
print map('.'.join, zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
Just for fun, map -> filter -> map:
input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""
splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)
>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
This appears to do what you want:
domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))
It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:
d = dict(x.split('=') for x in enom.strip().splitlines())
domains = [
d[key] + '.' + d.get('T' + key[1:], '')
for key in d if key.startswith('SLD')
]
You need to use multiline regex for this. This is similar to this post.
data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
domain, tld = item.group(1), item.group(2)
print "%s.%s" % (domain,tld)
As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:
lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]

Categories