Python extract string starting with index up to character - python

Say I have an incoming string that varies a little:
" 1 |r|=1.2e10 |v|=2.4e10"
" 12 |r|=-2.3e10 |v|=3.5e-04"
"134 |r|= 3.2e10 |v|=4.3e05"
I need to extract the numbers (ie. 1.2e10, 3.5e-04, etc)... so I would like to start at the end of '|r|' and grab all characters up to the ' ' (space) after it. Same for '|v|'
I've been looking for something that would:
Extract a substring form a string starting at an index and ending on a specific character...
But have not found anything remotely close.
Ideas?
NOTE: Added new scenario, which is the one that is causing lots of head-scratching...

To keep it elegant and generic, let's utilize split:
First, we split by ' ' to tokens
Then we find if it has an equal sign and parse the key-value
import re
sabich = "134 |r| = 3.2e10 |v|=4.3e05"
parts = sabich.split(' |')
values = {}
for p in parts:
if '=' in p:
k, v = p.split('=')
values[k.replace('|', '').strip()] = v.strip(' ')
# {'r': '3.2e10', 'v': '4.3e05'}
print(values)
This can be converted to the one-liner:
import re
sabich = "134 |r| = 3.2e10 |v|=4.3e05"
values = {t[0].replace('|', '').strip() : t[1].strip(' ') for t in [tuple(p.split('=')) for p in sabich.split(' |') if '=' in p]}
# {'|r|': '1.2e10', '|v|': '2.4e10'}
print(values)

You can solve it with a regular expression.
import re
strings = [
" 1 |r|=1.2e10 |v|=2.4e10",
" 12 |r|=-2.3e10 |v|=3.5e-04"
]
out = []
pattern = r'(?P<name>\|[\w]+\|)=(?P<value>-?\d+(?:\.\d*)(?:e-?\d*)?)'
for s in strings:
out.append(dict(re.findall(pattern, s)))
print(out)
Output
[{'|r|': '1.2e10', '|v|': '2.4e10'}, {'|r|': '-2.3e10', '|v|': '3.5e-04'}]
And if you want to convert the strings to number
out = []
pattern = r'(?P<name>\|[\w]+\|)=(?P<value>-?\d+(?:\.\d*)(?:e-?\d*)?)'
for s in strings:
# out.append(dict(re.findall(pattern, s)))
out.append({
name: float(value)
for name, value in re.findall(pattern, s)
})
Output
[{'|r|': 12000000000.0, '|v|': 24000000000.0}, {'|r|': -23000000000.0, '|v|': 0.00035}]

Related

How to replace every third word in a string with the # length equivalent

Input:
string = "My dear adventurer, do you understand the nature of the given discussion?"
expected output:
string = 'My dear ##########, do you ########## the nature ## the given ##########?'
How can you replace the third word in a string of words with the # length equivalent of that word while avoiding counting special characters found in the string such as apostrophes('), quotations("), full stops(.), commas(,), exclamations(!), question marks(?), colons(:) and semicolons (;).
I took the approach of converting the string to a list of elements but am finding difficulty filtering out the special characters and replacing the words with the # equivalent. Is there a better way to go about it?
I solved it with:
s = "My dear adventurer, do you understand the nature of the given discussion?"
def replace_alphabet_with_char(word: str, replacement: str) -> str:
new_word = []
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
for c in word:
if c in alphabet:
new_word.append(replacement)
else:
new_word.append(c)
return "".join(new_word)
every_nth_word = 3
s_split = s.split(' ')
result = " ".join([replace_alphabet_with_char(s_split[i], '#') if i % every_nth_word == every_nth_word - 1 else s_split[i] for i in range(len(s_split))])
print(result)
Output:
My dear ##########, do you ########## the nature ## the given ##########?
There are more efficient ways to solve this question, but I hope this is the simplest!
My approach is:
Split the sentence into a list of the words
Using that, make a list of every third word.
Remove unwanted characters from this
Replace third words in original string with # times the length of the word.
Here's the code (explained in comments) :
# original line
line = "My dear adventurer, do you understand the nature of the given discussion?"
# printing original line
print(f'\n\nOriginal Line:\n"{line}"\n')
# printing somehting to indicate that next few prints will be for showing what is happenning after each lone
print('\n\nStages of parsing:')
# splitting by spaces, into list
wordList = line.split(' ')
# printing wordlist
print(wordList)
# making list of every third word
thirdWordList = [wordList[i-1] for i in range(1,len(wordList)+1) if i%3==0]
# pritning third-word list
print(thirdWordList)
# characters that you don't want hashed
unwantedCharacters = ['.','/','|','?','!','_','"',',','-','#','\n','\\',':',';','(',')','<','>','{','}','[',']','%','*','&','+']
# replacing these characters by empty strings in the list of third-words
for unwantedchar in unwantedCharacters:
for i in range(0,len(thirdWordList)):
thirdWordList[i] = thirdWordList[i].replace(unwantedchar,'')
# printing third word list, now without punctuation
print(thirdWordList)
# replacing with #
for word in thirdWordList:
line = line.replace(word,len(word)*'#')
# Voila! Printing the result:
print(f'\n\nFinal Output:\n"{line}"\n\n')
Hope this helps!
Following works and does not use regular expressions
special_chars = {'.','/','|','?','!','_','"',',','-','#','\n','\\'}
def format_word(w, fill):
if w[-1] in special_chars:
return fill*(len(w) - 1) + w[-1]
else:
return fill*len(w)
def obscure(string, every=3, fill='#'):
return ' '.join(
(format_word(w, fill) if (i+1) % every == 0 else w)
for (i, w) in enumerate(string.split())
)
Here are some example usage
In [15]: obscure(string)
Out[15]: 'My dear ##########, do you ########## the nature ## the given ##########?'
In [16]: obscure(string, 4)
Out[16]: 'My dear adventurer, ## you understand the ###### of the given ##########?'
In [17]: obscure(string, 3, '?')
Out[17]: 'My dear ??????????, do you ?????????? the nature ?? the given ???????????'
With help of some regex. Explanation in the comments.
import re
imp = "My dear adventurer, do you understand the nature of the given discussion?"
every_nth = 3 # in case you want to change this later
out_list = []
# split the input at spaces, enumerate the parts for looping
for idx, word in enumerate(imp.split(' ')):
# only do the special logic for multiples of n (0-indexed, thus +1)
if (idx + 1) % every_nth == 0:
# find how many special chars there are in the current segment
len_special_chars = len(re.findall(r'[.,!?:;\'"]', word))
# ^ add more special chars here if needed
# subtract the number of special chars from the length of segment
str_len = len(word) - len_special_chars
# repeat '#' for every non-special char and add the special chars
out_list.append('#'*str_len + word[-len_special_chars] if len_special_chars > 0 else '')
else:
# if the index is not a multiple of n, just add the word
out_list.append(word)
print(' '.join(out_list))
A mixed of regex and string manipulation
import re
string = "My dear adventurer, do you understand the nature of the given discussion?"
new_string = []
for i, s in enumerate(string.split()):
if (i+1) % 3 == 0:
s = re.sub(r'[^\.:,;\'"!\?]', '#', s)
new_string.append(s)
new_string = ' '.join(new_string)
print(new_string)

Is there a regrex script that can be used to extract texts by defining a start and an end in a text file [duplicate]

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.
I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
How to do the same thing in Python?
Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.
regular expression
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
PS Python Challenge?
Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'
you can do using just one line of code
>>> import re
>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)
In python, extracting substring form string can be done using findall method in regular expression (re) module.
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
print(text[text.index(left)+len(left):text.index(right)])
Gives
string
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
Using PyParsing
import pyparsing as pp
word = pp.Word(pp.alphanums)
s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
print(match)
which yields:
[['1234']]
One liner with Python 3.8 if text is guaranteed to contain the substring:
text[text.find(start:='AAA')+len(start):text.find('ZZZ')]
Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.
also, you can find all combinations in the bellow function
s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
word_places = []
i=0
while True:
word_place = text.find(word,i)
i+=len(word)+word_place
if i>=len(text):
break
if word_place<0:
break
word_places.append(word_place)
return word_places
def find_all_combination(text,start,end):
start_places = find_all_places(text,start)
end_places = find_all_places(text,end)
combination_list = []
for start_place in start_places:
for end_place in end_places:
print(start_place)
print(end_place)
if start_place>=end_place:
continue
combination_list.append(text[start_place:end_place])
return combination_list
find_all_combination(s,"Part","Part")
result:
['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']
In case you want to look for multiple occurences.
content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
spos = c.find('_Suffix')
if spos!=-1:
strings.append( c[:spos])
print( strings )
Or more quickly :
strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]
Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.
def find_substring(string, start, end):
len_until_end_of_first_match = string.find(start) + len(start)
after_start = string[len_until_end_of_first_match:]
return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]
Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :
string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []
for char in string:
if char in numbersList: output.append(char)
print(f"output: {''.join(output)}")
### output: 1234
Typescript. Gets string in between two other strings.
Searches shortest string between prefixes and postfixes
prefixes - string / array of strings / null (means search from the start).
postfixes - string / array of strings / null (means search until the end).
public getStringInBetween(str: string, prefixes: string | string[] | null,
postfixes: string | string[] | null): string {
if (typeof prefixes === 'string') {
prefixes = [prefixes];
}
if (typeof postfixes === 'string') {
postfixes = [postfixes];
}
if (!str || str.length < 1) {
throw new Error(str + ' should contain ' + prefixes);
}
let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);
let value = str.substring(start.pos + start.sub.length, end.pos);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
while (true) {
try {
start = this.indexOf(value, prefixes);
} catch (e) {
break;
}
value = value.substring(start.pos + start.sub.length);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
}
return value;
}
a simple approach could be the following:
string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]
One liners that return other string if there was no match.
Edit: improved version uses next function, replace "not-found" with something else if needed:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

How to split a string based on word match from different lists?

I have a string. Now I want to split the string into parts if anything matches from two different lists. how can I do that ? there what i have.
dummy_word = "I have a HTML file"
dummy_type = ["HTML","JSON","XML"]
dummy_file_type = ["file","document","paper"]
for e in dummy_type:
if e in dummy_word:
type_found = e
print("type ->" , e)
dum = dummy_word.split(e)
complete_dum = "".join(dum)
for c in dummy_file_type:
if c in complete_dum:
then = complete_dum.split("c")
print("file type ->",then)
In the given scenario my expected output is ["I have a", "HTML","file"]
These sort of tasks a handled pretty well by itertools.groupby(). Here the key will translate to individual words if the words is in the set of words, or False if it's not. This allows all the non-special words to group together and each special word to become its own element:
from itertools import groupby
dummy_word = "I have a HTML file"
dummy_type = ["HTML","JSON","XML"]
dummy_file_type = ["file","document","paper"]
words = set(dummy_type).union(dummy_file_type)
[" ".join(g) for k, g in
groupby(dummy_word.split(), key=lambda word: (word in words) and word)]
# ['I have a', 'HTML', 'file']
This worked for me:
dummy_word = "I have a HTML file"
dummy_type = ["HTML","JSON","XML"]
dummy_file_type = ["file","document","paper"]
temp = ""
dummy_list = []
for word in dummy_word.split():
if word in dummy_type or word in dummy_file_type:
if temp:
dummy_list.append(temp)
print(temp, "delete")
print(temp)
new_word = word + " "
dummy_list.append(new_word)
temp = ""
else:
temp += word + " "
print(temp)
print(dummy_list)
One more way using re:
>>> list(map(str.strip, re.sub("|".join(dummy_type + dummy_file_type), lambda x: "," + x.group(), dummy_word).split(',')))
['I have a', 'HTML', 'file']
>>>
First, form a regex pattern by concatenating all the types using join. Using re.sub, the string is replaced where tokens are prepended with a comma, and then we split the string using comma separator. map is used to strip the whitespaces.

How to copy spaces from one string to another in Python?

I need a way to copy all of the positions of the spaces of one string to another string that has no spaces.
For example:
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
output = "ESTD TD L ATPNP ZQ EPIE"
Insert characters as appropriate into a placeholder list and concatenate it after using str.join.
it = iter(string2)
output = ''.join(
[next(it) if not c.isspace() else ' ' for c in string1]
)
print(output)
'ESTD TD L ATPNP ZQ EPIE'
This is efficient as it avoids repeated string concatenation.
You need to iterate over the indexes and characters in string1 using enumerate().
On each iteration, if the character is a space, add a space to the output string (note that this is inefficient as you are creating a new object as strings are immutable), otherwise add the character in string2 at that index to the output string.
So that code would look like:
output = ''
si = 0
for i, c in enumerate(string1):
if c == ' ':
si += 1
output += ' '
else:
output += string2[i - si]
However, it would be more efficient to use a very similar method, but with a generator and then str.join. This removes the slow concatenations to the output string:
def chars(s1, s2):
si = 0
for i, c in enumerate(s1):
if c == ' ':
si += 1
yield ' '
else:
yield s2[i - si]
output = ''.join(char(string1, string2))
You can try insert method :
string1 = "This is a piece of text"
string2 = "ESTDTDLATPNPZQEPIE"
string3=list(string2)
for j,i in enumerate(string1):
if i==' ':
string3.insert(j,' ')
print("".join(string3))
outout:
ESTD TD L ATPNP ZQ EPIE

Python string split join 4

import re
string = "is2 Thi1s T4est 3a"
def order(sentence):
res = ''
count = 1
list = sentence.split()
for i in list:
for i in list:
a = re.findall('\d+', i)
if a == [str(count)]:
res += " ".join(i)
count += 1
print(res)
order(string)
Above there is a code which I have problem with. Output which I should get is:
"Thi1s is2 3a T4est"
Instead I'm getting the correct order but with spaces in the wrong places:
"T h i 1 si s 23 aT 4 e s t"
Any idea how to make it work with this code concept?
You are joining the characters of each word:
>>> " ".join('Thi1s')
'T h i 1 s'
You want to collect your words into a list and join that instead:
def order(sentence):
number_words = []
count = 1
words = sentence.split()
for word in words:
for word in words:
matches = re.findall('\d+', word)
if matches == [str(count)]:
number_words.append(word)
count += 1
result = ' '.join(number_words)
print(result)
I used more verbose and clear variable names. I also removed the list variable; don't use list as a variable name if you can avoid it, as that masks the built-in list name.
What you implemented comes down to a O(N^2) (quadratic time) sort. You could instead use the built-in sort() function to bring this to O(NlogN); you'd extract the digit and sort on its integer value:
def order(sentence):
digit = re.compile(r'\d+')
return ' '.join(
sorted(sentence.split(),
key=lambda w: int(digit.search(w).group())))
This differs a little from your version in that it'll only look at the first (consecutive) digits, it doesn't care about the numbers being sequential, and will break for words without digits. It also uses a return to give the result to the caller rather than print. Just use print(order(string)) to print the return value.
If you assume the words are numbered consecutively starting at 1, then you can sort them in O(N) time even:
def order(sentence):
digit = re.compile(r'\d+')
words = sentence.split()
result = [None] * len(words)
for word in words:
index = int(digit.search(word).group())
result[index - 1] = word
return ' '.join(result)
This works by creating a list of the same length, then using the digits from each word to put the word into the correct index (minus 1, as Python lists start at 0, not 1).
I think the bug is simply in the misuse of join(). You want to concatenate the current sorted string. i is simply a token, hence simply add it to the end of the string. Code untested.
import re
string = "is2 Thi1s T4est 3a"
def order(sentence):
res = ''
count = 1
list = sentence.split()
for i in list:
for i in list:
a = re.findall('\d+', i)
if a == [str(count)]:
res = res + " " + i # your bug here
count += 1
print(res)
order(string)

Categories