Extracting part of a string based on its naming convention

Extracting part of a string based on its naming convention - python

I'm trying to extract a piece of information about a certain file. The file name is extracted from an xml file.
The information I want is stored in the name of the file, I want to know how to extract the letters between the 2nd and 3rd period in the string.
Eg. name is extracted from the xml, it is stored as a string that looks something like this "aa.bb.cccc.dd.ee" and I need to find what "cccc" actually is in each of the strings I extract (~50 of them).
I've done some searching and some playing around with slicing etc. but I can't get even close.
I can't just specify the letter in the range [6:11] because the length of the string varies as does the number of characters before the part I want to find.
UPDATE: Solution Added.
Due to the fact the data that I was trying to split and extract part from was from an xml file it was being stored as an element.
I iterated through the list of Estate Names and stored the EstateName attribute for each one as a variable
for element in EstateList:
EstateStr = element.getAttribute('EstateName')
I then used the split on this new variable which contains strings rather than elements and wrote them to the desired text file:
asset = EstateStr.split('.', 3)[2]
z.write(asset + "\n")

If you are certain it will always have this format (5 blocks of characters, separated by 4 decimals points) you can split on '.' then index the third element [2].
>>> 'aa.bb.cccc.dd.ee'.split('.')[2]
'cccc'
This works for various string lengths so you don't have to worry about the absolute position using slicing as your first approach mentioned.
>>> 'a.b.c.d.e'.split('.')[2]
'c'
>>> 'eeee.ddddd.ccccc.bbbbb.aaaa'.split('.')[2]
'ccccc'

Split the string on the period:
third_part = inputstring.split('.', 3)[2]
I've used str.split() with a limit here for efficiency; no point in splitting the dd.ee part here, for example.
The [2] index then picks out the third result from the split, your cccc string:
>>> "aa.bb.cccc.dd.ee".split('.', 3)[2]
'cccc'

You could use re module to extract the string between 2 and third dot.
>>> re.search(r'^[^.]*\.[^.]*\.([^.]*)\..*', "aa.bb.cccc.dd.ee").group(1)
'cccc'

Related

Regex substitution that returns a trimmed version of the input?

I am dealing with a variety of "five and two" strings that refer to an individual. The strings have the first five letters of an individual's last name, and then the first two letters of the individual's first name. Each string concludes with a two digit numeral that acts as a "tiebreaker" if more than two individuals have the same "five and two." The numerals are to be considered strings. In the event of an individual who possesses a last name shorter than five letters, the entire last name is included in the string with no extra characters to fill in the gap.
Examples:
adamsjo02
allenje01
alstoga01
ariasge01
aucoide01
ayraujo01
belkti01 #This individual has a last name with only four letters
I wish to convert each of these strings into a "four and one" string that has a three digit numeral. The result of the above examples after being converted should look like this:
adamj002
allej001
alstg001
ariag001
aucod001
ayraj001
belkt001
I am using python throughout my project. I suspect that a regex substitution would be the best course of action to achieve what I need. I have little experience with regexes, and have come up with this thus far to detect the regex:
re.compile(r'(/w){2,5}(/w/w)(/w/w)')
While this does not work for me, it does lay out that I perceive there to be three groupings in each string. The last name portion, the first name portion, and the numerals (to be treated as strings). Each of those groupings ought to be undergoing a change, with exception to any individual that may have a last name of four or fewer letters.

You can do with a proper escape character \ and f-string:
import re
text = '''adamsjo02
allenje01
alstoga01
ariasge01
aucoide01
ayraujo01
belkti01
maja01'''
p = re.compile(r"(\w{2,5})(\w{2})(\d{2})")
output = [f"{m.group(1):_<4.4}{m.group(2):1.1}{m.group(3):0>3}" for m in map(p.search, text.splitlines())]
print(output)
# ['adamj002', 'allej001', 'alstg001', 'ariag001', 'aucod001', 'ayraj001', 'belkt001', 'ma__j001']

In this case, since you have a very specific format, I'd say regex is not necessary, though it does the job. I'm proposing, then, an alternate solution without using it.
def to_four_one(code: str) -> str:
last, first, number = code[:-4][:4], code[-4:-2], int(code[-2:])
return f"{last}{first[-2]}{number:03}"
It's a simple function that rearranges the elements in the string. It simply gets the last name, first name and number as different elements, and rewrites them as the new format asks (clipping last names for len == 4, and first names for len == 1, besides formatting the number as 3 digit).
Usage below. I added two more names with even less characters to show it doesn't break in those cases.
codes = [
"adamsjo02",
"allenje01",
"alstoga01",
"ariasge01",
"aucoide01",
"ayraujo01",
"belkti01",
"jorma03",
"baka02"]
[print(to_four_one(code)) for code in codes]
>>>adamj002
allej001
alstg001
ariag001
aucod001
ayraj001
belkt001
jorm003
bak002

Taking a specific character in the string for a list of strings in python

I have a list of 22000 strings like abc.wav . I want to take out a specific character from it in python like a character which is before .wav from all the files. How to do that in python ?

finding the spot of a character could be .split(), but if you want to pull up a specific spot in a string, you could use list[stringNum[letterNum]]. And then list[stringNum].split("a") would get two or more separate strings that are on the other side of the letter "a". Using those strings you could get the spots by measuring the length of the string versus the length of the strings outside of a and compare where those spots were taken. Just a simple algorithm idea ig. You'd have to play around with it.

I am assuming you are trying to reconstruct the same string without the letter before the extension.
resultList = []
for item in list:
newstr = item.split('.')[0]
extstr = item.split('.')[1]
locstr = newstr[:-1] <--- change the selection here depending on the char you want to remove
endstr = locstr + extstr
resultList.append(endstr)
If you are trying to just save a list of the letters you remove only, do the following:
resultList = []
for item in list:
newstr = item.split('.')[0]
endstr = newstr[-1]
resultList.append(endstr)

df= pd.DataFrame({'something':['asb.wav','xyz.wav']})
df.something.str.extract("(\w*)(.wav$)",expand=True)
Gives:
0 1
0 asb .wav
1 xyz .wav

Is it possible to search and replace a string with "any" characters?

There are probably several ways to solve this problem, so I'm open to any ideas.
I have a file, within that file is the string "D133330593" Note: I do have the exact position within the file this string exists, but I don't know if that helps.
Following this string, there are 6 digits, I need to replace these 6 digits with 6 other digits.
This is what I have so far:
def editfile():
f = open(filein,'r')
filedata = f.read()
f.close()
#This is the line that needs help
newdata = filedata.replace( -TOREPLACE- ,-REPLACER-)
#Basically what I need is something that lets me say "D133330593******"
#->"D133330593123456" Note: The following 6 digits don't need to be
#anything specific, just different from the original 6
f = open(filein,'w')
f.write(newdata)
f.close()

Use the re module to define your pattern and then use the sub() function to substitute occurrence of that pattern with your own string.
import re
...
pat = re.compile(r"D133330593\d{6}")
re.sub(pat, "D133330593abcdef", filedata)
The above defines a pattern as -- your string ("D133330593") followed by six decimal digits. Then the next line replaces ALL occurrences of this pattern with your replacement string ("abcdef" in this case), if that is what you want.
If you want a unique replacement string for each occurrence of pattern, then you could use the count keyword argument in the sub() function, which allows you to specify the number of times the replacement must be done.
Check out this library for more info - https://docs.python.org/3.6/library/re.html

Let's simplify your problem to you having a string:
s = "zshisjD133330593090909fdjgsl"
and you wanting to replace the 6 characters after "D133330593" with "123456" to produce:
"zshisjD133330594123456fdjgsl"
To achieve this, we can first need to find the index of "D133330593". This is done by just using str.index:
i = s.index("D133330593")
Then replace the next 6 characters, but for this, we should first calculate the length of our string that we want to replace:
l = len("D133330593")
then do the replace:
s[:i+l] + "123456" + s[i+l+6:]
which gives us the desired result of:
'zshisjD133330593123456fdjgsl'
I am sure that you can now integrate this into your code to work with a file, but this is how you can do the heart of your problem .
Note that using variables as above is the right thing to do as it is the most efficient compared to calculating them on the go. Nevertheless, if your file isn't too long (i.e. efficiency isn't too much of a big deal) you can do the whole process outlined above in one line:
s[:s.index("D133330593")+len("D133330593")] + "123456" + s[s.index("D133330593")+len("D133330593")+6:]
which gives the same result.

Python: Replace the first nth matching letter to another letter

This is related to trimming a csv file process.
I have a mar-formatted csv file that has 4 columns, but the last column has too many (and unknown number of) commas.
I want to replace the delimiter to another character such as "|"
For example, string = "a,b,c,d,e,f" into "a|b|c|d,e,f"
The following codes works, but I like to find a better and efficient way to process large size txt file.
sample_txt='a,b,c,d,e,f'
temp=sample_txt.split(",")
output_txt='|'.join(temp[0:3])+'|'+','.join(temp[3:])

Python has the perfect way to do this, with str.replace:
>>> sample_txt='a,b,c,d,e,f'
>>> print(sample_txt.replace(',', '|', 3))
a|b|c|d,e,f
str.replace takes an optional third argument (or fourth if you count self) which dictates the maximum number of replacements to happen.

sample_txt='a,b,c,d,e,f'
output_txt = sample_txt.replace(',', '|', 3)

Finding various string repeats in python in next 10 characters

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!

You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'

So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.