not standard splitting - python

I have string value like:
a='[-sfdfj aidjf -dugs jfdsif -usda [[s dfdsf sdf]]]'
I want to transform "a" into dictionary: the strings with preceding "-" character should be keys and what goes after the space should be values of the key preceding it.
If we are working with "a", then what I want is the resulting dictionary like:
dict_a={'-sfdfj': 'aidjf', '-dugs': 'jfdsif', '-usda': '[[s dfdsf sdf]]'}
This would be simple if not the last value('[[s dfdsf sdf]]'), it contains the spaces. Otherwise I would just strip the external brackets and split the "a", then convert the resulting list into dict_a, but alas the reality is not on my side.
Even if I get the list like:
list_a=['-sfdfj', 'aidjf', '-dugs', 'jfdsif', '-usda', '[[s dfdsf sdf]']
this would be enough.
Any help will be appreciated.

You can split the string by '-' and then add the '-' back.
a = '[-sfdfj aidjf -dugs jfdsif -usda [[s dfdsf sdf]]]'
a = a[1:-1] # get ride of the start and end []
sections = a.split('-')
dict_a = {}
for s in sections:
s = s.strip()
if len(s) == 0:
continue
key_value = s.split(' ') # split key value by space
key = '-' + key_value[0] # the first element is key
value = ' '.join(key_value[1:]) # the lefe are value
dict_a[key] = value

I can tell you a way to go about it.
Strip the quotes and the outer brackets. Then split the string using spaces. Iterate over the list obtained and check for any opening brackets. Keep a count of the number of opening brackets, join all the list items as a string with spaces between each such item until you encounter an equal number of closing brackets. The remaining items remain as is. You could try implementing it. If you face any issues, I'll help you with the code.
#chong's answer is a neater way to go about it.

Using a regular expression:
>>> import re
>>> dict(re.findall('(-\S+) ([^-]+)', a[:-1].replace(' -', '-')))
{'-sfdfj': 'aidjf', '-dugs': 'jfdsif', '-usda': '[[s dfdsf sdf]]'}
Using #ChongTang's idea:
>>> dict(('-' + b).strip().split(maxsplit=1) for b in a[1:-1].split('-') if b)
{'-sfdfj': 'aidjf', '-dugs': 'jfdsif', '-usda': '[[s dfdsf sdf]]'}

You can try this:
import re
a='[-sfdfj aidjf -dugs jfdsif -usda [[s dfdsf sdf]]]'
pattern_key=re.compile(r'(?P<key>-\S+)\s+')
pattern_val=re.compile(r' (?P<val>[^-].*?)( -|\Z)')
d={}
matches=pattern_key.finditer(a)
matches1=pattern_val.finditer(a)
for m,n in zip(matches, matches1):
d[m.group('key')]= n.group('val')
print d

Related

Replacing substring but skipping previous occurance

I have a long string that may contain multiple same sub-strings. I would like to extract certain sub-strings by using regex. Then, for each extracted sub-string, I want to append [i] and replace the original one.
By using Regex, I extracted ['df.Libor3m','df.Libor3m_lag1','df.Libor3m_lag1']. However, when I tried to add [i] to each item, the first 'df.Libor3m_lag1' in string is replaced twice.
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD.find(var_name)
new_var_name = var_name+'[i]'
function_text_MD=function_text_MD.replace(var_name,new_var_name,1)
So I got '0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i][i],0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'.
df.Libor3m_lag1[i][i] was added [i] twice.
What I want to get:
'0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i],0.9))+0.7*np.maximum(df.Libor3m_lag1[i],0.9)'
Thanks in advance!
Here is the code.
import re
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD = function_text_MD.replace(var_name,var_name+'[i]')
print(function_text_MD)
t = "0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)"
p = re.split("(?<=df\.)[a-zA-Z_0-9]+", t)
s = re.findall("(?<=df\.)[a-zA-Z_0-9]+", t)
s = [x+"[i]" for x in s]
result = "".join([p[0],s[0],p[1],s[1],p[2],s[2]])
use the regular expression to split string first.
use the same regular expression to find the spliters
change the spliters to what you want
put the 2 list together and join.

How to remove substring after a specific character in a list of strings in Python

I have a list of string labels. i want to keep the substring of very element before the second "." and remove all characters after the second ".".
I found post that show how to do this with a text string using the split function. However, the list datatype does not have a split function. The actual data type is a pandas.core.indexes.base.index which appears to be a list to me.
For the first element in the list, I want to keep L1.Energy and remove everything after the second ".".
current_list = ['L1.Energy.Energy', 'L1.Utility.Energy', 'L1.Technology.Utility', 'L1.Financial.Utility']
desired_list = [L1.Energy', 'L1.Utility', 'L1.Technology,'L1.Financial']
Here as a oneliner:
desired_list = [ s[:s.find(".",s.find(".")+1)] for s in current_list]
current_list = ['L1.Energy.Energy', 'L1.Utility.Energy', 'L1.Technology.Utility', 'L1.Financial.Utility']
desired_list = [ '.'.join(x.split('.')[:2]) for x in current_list ]
BTW, this will work also if your labels have more than two dots (like 'L1.Utility.Energy.Electric')
Here, its ugly but it works
bob = ['L1.Energy.Energy', 'L1.Utility.Energy',
'L1.Technology.Utility','L1.Financial.Utility']
result = []
for i in bob:
temp = i.split(".")
result.append(temp[0] + "." + temp[1])
print(result)
A solution with regex:
desired_list = [re.sub('(\..*)(\..*)',r'\1', s) for s in current_list]
Output:
['L1.Energy', 'L1.Utility', 'L1.Technology', 'L1.Financial']

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

how to split brackets using python abcd[00451.00]

I have tried below code to split but I am unable to split
import re
s = "abcd[00451.00]"
print str(s).strip('[]')
I need output as only number or decimal format 00451.00 this value but I am able to get output as abcd[00451.00
If you know for sure that there will be one opening and closing brackets you can do
s = "abcd[00451.00]"
print s[s.index("[") + 1:s.rindex("]")]
# 00451.00
str.index is used to get the first index of the element [ in the string, where as str.rindex is used to get the last index of the element in ]. Based on those indexes, the string is sliced.
If you want to convert that to a floating point number, then you can use float function, like this
print float(s[s.index("[") + 1:s.rindex("]")])
# 451.0
You should use re.search:
import re
s = "abcd[00451.00]"
>>> print re.search(r'\[([^\]]+)\]', s).group(1)
00451.00
You can first split on the '[' and then strip the resulting list of any ']' chars:
[p.strip(']') for p in s.split('[')]

Strip in Python

I have a question regarding strip() in Python. I am trying to strip a semi-colon from a string, I know how to do this when the semi-colon is at the end of the string, but how would I do it if it is not the last element, but say the second to last element.
eg:
1;2;3;4;\n
I would like to strip that last semi-colon.
Strip the other characters as well.
>>> '1;2;3;4;\n'.strip('\n;')
'1;2;3;4'
>>> "".join("1;2;3;4;\n".rpartition(";")[::2])
'1;2;3;4\n'
how about replace?
string1='1;2;3;4;\n'
string2=string1.replace(";\n","\n")
>>> string = "1;2;3;4;\n"
>>> string.strip().strip(";")
"1;2;3;4"
This will first strip any leading or trailing white space, and then remove any leading or trailing semicolon.
Try this:
def remove_last(string):
index = string.rfind(';')
if index == -1:
# Semi-colon doesn't exist
return string
return string[:index] + string[index+1:]
This should be able to remove the last semicolon of the line, regardless of what characters come after it.
>>> remove_last('Test')
'Test'
>>> remove_last('Test;abc')
'Testabc'
>>> remove_last(';test;abc;foobar;\n')
';test;abc;foobar\n'
>>> remove_last(';asdf;asdf;asdf;asdf')
';asdf;asdf;asdfasdf'
The other answers provided are probably faster since they're tailored to your specific example, but this one is a bit more flexible.
You could split the string with semi colon and then join the non-empty parts back again using ; as separator
parts = '1;2;3;4;\n'.split(';')
non_empty_parts = []
for s in parts:
if s.strip() != "": non_empty_parts.append(s.strip())
print "".join(non_empty_parts, ';')
If you only want to use the strip function this is one method:
Using slice notation, you can limit the strip() function's scope to one part of the string and append the "\n" on at the end:
# create a var for later
str = "1;2;3;4;\n"
# format and assign to newstr
newstr = str[:8].strip(';') + str[8:]
Using the rfind() method(similar to Micheal0x2a's solution) you can make the statement applicable to many strings:
# create a var for later
str = "1;2;3;4;\n"
# format and assign to newstr
newstr = str[:str.rfind(';') + 1 ].strip(';') + str[str.rfind(';') + 1:]
re.sub(r';(\W*$)', r'\1', '1;2;3;4;\n') -> '1;2;3;4\n'

Categories