Split string by special pattern - python

I have long string, which can consist of few sub-strings (not always, sometimes it's one string, sometimes there are 4 sub-strings sticked together). Each one starts with byte length, for example 4D or 4E. Below is example big-string which consists of 4 sub-strings:
4D44B9096268182113077A95C84005D55FCD9D79476DDA4346C7EF1F4F07D4B46693F51812C8B74E4E44B9097368182113077A340040058D55E7E8D3924C57182F6E07A4D3617E100D1652169668636CB54E44B9096868182113077A37004005705FE9461E85F69A4C8E1B00CE03E6337B8F3D853A51C447B9694E44B9096668182113077AA400400555C9FAADA21F1EC93DBD5B579E4E07DDAF75A45D095E72010DBB
After splitting by pattern, the output SHOULD BE:
4D44B9096268182113077A95C84005D55FCD9D79476DDA4346C7EF1F4F07D4B46693F51812C8B74E
4E44B9097368182113077A340040058D55E7E8D3924C57182F6E07A4D3617E100D1652169668636CB5
4E44B9096868182113077A37004005705FE9461E85F69A4C8E1B00CE03E6337B8F3D853A51C447B969
4E44B9096668182113077AA400400555C9FAADA21F1EC93DBD5B579E4E07DDAF75A45D095E72010DBB
Each long string has ID - in this case it's 44B909, each line has this ID after bytes. My original code took first 6 letters (4D44B9) and splitted string by this. It's working in 95% cases - where EACH line has same length, for example 4D. The problem is that not always each line has same length - as in string above. Look at my code below:
def repeat():
string = input('Please paste string below:'+'\n')
code = string[:6]
print('\n')
print('SPLITTED:')
string = string.replace(code, '\n'+'\n'+code)
print(string)
while True:
repeat()
When you try to paste this one long string, it won't split it, because first line has 4D, and rest has 4E. I'd like it to "ignore" (for a moment) first 2 letters (4E) and take six next letters, as "split-pattern"? The output should be as these 4 lines above! I was changing code a bit, but I was getting some strange results, like below:
44B9096268182113077A95C84005D55FCD9D79476DDA4346C7EF1F4F07D4B46693F51812C8B74E
44B9097368182113077A340040058D55E7E8D3924C57182F6E07A4D3617E100D1652169668636CB54E
44B9096868182113077A37004005705FE9461E85F69A4C8E1B00CE03E6337B8F3D853A51C447B9694E
44B9096668182113077AA400400555C9FAADA21F1EC93DBD5B579E4E07DDAF75A45D095E72010DBB
How can I make it work??

If the first two characters encode the string's length in hex, why do you not use that to decide how much of the string to consume?
However, the offsets in your example seem wrong; 4D is correct (decimal 78) but 4E should apparently be 51 (the string is four characters longer).
For the question about how to split on a slightly variable pattern, a regular expression seems like a good solution.
import re
splitted = re.split(r'4[DE](?=44B909)', string)
In so many words, this says "use 4D or 4E as the delimiter to split on, but only if it's immediately followed by 44B909".
(There will be an empty group before the first value but that's easy to shift off; or change the regex to r'(?<!^)4[DE](?=44B909O)'.)
If you don't want to discard anything, include everything in the lookahead:
splitted = re.split(r'(?<!^)(?=4[DE]44B909)', string)

Related

Regex substitution that returns a trimmed version of the input?

I am dealing with a variety of "five and two" strings that refer to an individual. The strings have the first five letters of an individual's last name, and then the first two letters of the individual's first name. Each string concludes with a two digit numeral that acts as a "tiebreaker" if more than two individuals have the same "five and two." The numerals are to be considered strings. In the event of an individual who possesses a last name shorter than five letters, the entire last name is included in the string with no extra characters to fill in the gap.
Examples:
adamsjo02
allenje01
alstoga01
ariasge01
aucoide01
ayraujo01
belkti01 #This individual has a last name with only four letters
I wish to convert each of these strings into a "four and one" string that has a three digit numeral. The result of the above examples after being converted should look like this:
adamj002
allej001
alstg001
ariag001
aucod001
ayraj001
belkt001
I am using python throughout my project. I suspect that a regex substitution would be the best course of action to achieve what I need. I have little experience with regexes, and have come up with this thus far to detect the regex:
re.compile(r'(/w){2,5}(/w/w)(/w/w)')
While this does not work for me, it does lay out that I perceive there to be three groupings in each string. The last name portion, the first name portion, and the numerals (to be treated as strings). Each of those groupings ought to be undergoing a change, with exception to any individual that may have a last name of four or fewer letters.
You can do with a proper escape character \ and f-string:
import re
text = '''adamsjo02
allenje01
alstoga01
ariasge01
aucoide01
ayraujo01
belkti01
maja01'''
p = re.compile(r"(\w{2,5})(\w{2})(\d{2})")
output = [f"{m.group(1):_<4.4}{m.group(2):1.1}{m.group(3):0>3}" for m in map(p.search, text.splitlines())]
print(output)
# ['adamj002', 'allej001', 'alstg001', 'ariag001', 'aucod001', 'ayraj001', 'belkt001', 'ma__j001']
In this case, since you have a very specific format, I'd say regex is not necessary, though it does the job. I'm proposing, then, an alternate solution without using it.
def to_four_one(code: str) -> str:
last, first, number = code[:-4][:4], code[-4:-2], int(code[-2:])
return f"{last}{first[-2]}{number:03}"
It's a simple function that rearranges the elements in the string. It simply gets the last name, first name and number as different elements, and rewrites them as the new format asks (clipping last names for len == 4, and first names for len == 1, besides formatting the number as 3 digit).
Usage below. I added two more names with even less characters to show it doesn't break in those cases.
codes = [
"adamsjo02",
"allenje01",
"alstoga01",
"ariasge01",
"aucoide01",
"ayraujo01",
"belkti01",
"jorma03",
"baka02"]
[print(to_four_one(code)) for code in codes]
>>>adamj002
allej001
alstg001
ariag001
aucod001
ayraj001
belkt001
jorm003
bak002

regex extraction with comma and thousand separators of various sizes [duplicate]

I am wondering, how would regular expression for testing correct format of number for German culture would look like.
In German, comma is used as decimal mark and dot is used to separate thousands.
Therefore:
1.000 equals to 1000
1,000 equals to 1
1.000,89 equals to 1000.89
1.000.123.456,89 equals to 1000123456.89
The real trick, seems to me, is to make sure, that there could be several dots, optionally followed by comma separator
This is the regex I would use:
^-?\d{1,3}(?:\.\d{3})*(?:,\d+)?$
Debuggex Demo
And this is a code example to interpret it as a valid floating point (notice the parseFloat() after the string replacements).
Edit: as mentioned in Severin Klug's answer, the below code assumes that the numbers are known to be in German format. Attempting to "detect" whether a string contains a German format or US format number is not arbitrary and out of scope for this question. '1.234' is valid in both formats but with different actual values, without context it is impossible to know for sure which format was meant.
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89'];
document.getElementById('out').value=numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
<textarea id="out" rows="10" style="width:100%"></textarea>
I would have posted this as a comment, but I dont have enough reputation.
#funkwurm, your post https://stackoverflow.com/a/28361329/7329611 contains javascript
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89', '1.2'];
numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
which should convert german numbers to english/international ones - which it does for every number with exactly three digits after a german thousands dot like the numbers you use in the example array. BUT - and there is the critical Use-Case-Error: it just deletes dots from any other string with not three digits after it aswell.
So if you insert a string like '1.2' it returns 12, if you insert '1.23' it returns 123.
And this is a very critical behaviour, if anyone just takes the above code snippet and thinks it'll convert any given number correctly into english ones. Because already correct english numbers will be corrupted! So be careful, please.
This regex should work :
([0-9]{1,3}(?:\.[0-9]{3})*(?:\,[0-9]+)?)
A good regex would be something like this
Regex regex = new Regex("-?\d{1,3}(?:\.\d{3})*(?:,\d+)?");
Match match = regex.Match(input);
Decimal result = Decimal.Zero;
if (match.Success)
result = Decimal.Parse(match.Value, new CultureInfo("de-DE"));
The result is the german number as parsed value.
Try this it will match your inputs:
^(\d+\.)*\d+(,\d+)?
This regex would work for + numbers
/^[0-9]{0,3}(\.[0-9]{3})*(,[0-9]{0,2})?$/
Breakdown
[0-9]{0,3} - this section allows zero up to 3 numbers. empty value is valid, '1', '26', '789' are valid. '1589' is invalid
(\.[0-9]{3})* - this section allows zero or more dots... if there's a dot, there must be three digits after the dot. '2.589' is valid. '2.5896' and '2.45' are invalid
(,[0-9]{0,2})? - this section allows zero or 1 comma. there can be zero up to 2 digits after the comma. '25,', '25,5', '25,45' are valid. '25,456' and '25,45,8' are invalid
Hope this is helpful

String splitting in python by finding non-zero character

I want to do the following split:
input: 0x0000007c9226fc output: 7c9226fc
input: 0x000000007c90e8ab output: 7c90e8ab
input: 0x000000007c9220fc output: 7c9220fc
I use the following line of code to do this but it does not work!
split = element.rpartition('0')
I got these outputs which are wrong!
input: 0x000000007c90e8ab output: e8ab
input: 0x000000007c9220fc output: fc
what is the fastest way to do this kind of split?
The only idea for me right now is to make a loop and perform checking but it is a little time consuming.
I should mention that the number of zeros in input is not fixed.
Each string can be converted to an integer using int() with a base of 16. Then convert back to a string.
for s in '0x000000007c9226fc', '0x000000007c90e8ab', '0x000000007c9220fc':
print '%x' % int(s, 16)
Output
7c9226fc
7c90e8ab
7c9220fc
input[2:].lstrip('0')
That should do it. The [2:] skips over the leading 0x (which I assume is always there), then the lstrip('0') removes all the zeros from the left side.
In fact, we can use lstrip ability to remove more than one leading character to simplify:
input.lstrip('x0')
format is handy for this:
>>> print '{:x}'.format(0x000000007c90e8ab)
7c90e8ab
>>> print '{:x}'.format(0x000000007c9220fc)
7c9220fc
In this particular case you can just do
your_input[10:]
You'll most likely want to properly parse this; your idea of splitting on separation of non-zero does not seem safe at all.
Seems to be the XY problem.
If the number of characters in a string is constant then you can use
the following code.
input = "0x000000007c9226fc"
output = input[10:]
Documentation
Also, since you are using rpartitionwhich is defined as
str.rpartition(sep)
Split the string at the last occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself.
Since your input can have multiple 0's, and rpartition only splits the last occurrence this a malfunction in your code.
Regular expression for 0x00000 or its type is (0x[0]+) and than replace it with space.
import re
st="0x000007c922433434000fc"
reg='(0x[0]+)'
rep=re.sub(reg, '',st)
print rep

Finding various string repeats in python in next 10 characters

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories