Split string from two pattern based on regex Python - python

Given a two file path
Z:\home\user\dfolder\NO,AG,GK.jpg
Z:\home\user\dfolder\NI,DG,BJ (1).jpg
The objective is to split each string and store into a dict
Currently, I first split the path using os.path.split to get list of s
s=['NO,AG,GK.jpg','NI,DG,BJ (1).jpg']
and iteratively split the string as below
all_dic=[]
for ds in s:
k=ds.split(",")
kk=k[-1].split('.jpg')[0].split("(")[0] if bool(re.search('\(\d+\)', ds)) else k[-1].split('.jpg')[0]
nval={"f":k[0],"s":k[1],"t":kk}
all_dic.append(nval)
But, I am curious for a regex approach, or any 1 liner .

One liner parsing using regex + inline list parsing:
import re
s = ['NO,AG,GK.jpg', 'NI,DG,BJ (1).jpg']
keys = ['f', 's', 't']
all_dic = [{keys[k]: x for k, x in enumerate(
re.sub("(\s\(\d+\))?(\.jpg)?", "", item).split(','))} for item in s]
print(all_dic)
->
[{'f': 'NO', 's': 'AG', 't': 'GK'}, {'f': 'NI', 's': 'DG', 't': 'BJ'}]

Well, I think this is the easiest way to get the same output without using the split() function.
The regular expression takes only the letters and puts them in a list, so we don't even have to split the string or remove the (1) from it.
import re
s=['NO,AG,GK.jpg','NI,DG,BJ (1).jpg']
all_dic = []
for ds in s:
regex = '[a-zA-Z]+'
k = re.findall(regex,ds) # We extract all the matches (as a list)
nval={'f':k[0],'s':k[1],'t':k[2]} # We create the dictionary
all_dic.append(nval) # We append the dictionary to the list
print(all_dic)
# Output: [{'f': 'NO', 's': 'AG', 't': 'GK'}, {'f': 'NI', 's': 'DG', 't': 'BJ'}]
Also, you have the file extension in k[3], just in case you need it.

Related

Text manipulation to form an equation

a=0.77 ,b=0.2 ,c=0.20, d=0.79 ,z=(c+d), e=(z*a) ,output=(z+e)
I have a text file like above. I need a parser logic that will throw an equation like
output=(0.20+0.79)+((0.20+0.79)*a) what are some efficient ways to do it? Are there any libraries? Thank you!
Primitive method is to work with strings and use replace()
First use split(',') to convert string to list
['a=0.77 ', 'b=0.2 ', 'c=0.20', ' d=0.79 ', 'z=(c+d)', ' e=(z*a) ', 'output=(z+e)']
Next use .strip() to remove spaces from ends and begins.
Next use .split('=') on every element to create nested lists.
[['a', '0.77'], ['b', '0.2'], ['c', '0.20'], ['d', '0.79'], ['z', '(c+d)'], ['e', '(z*a)'], ['output', '(z+e)']]
Next use dict() to create dictionary.
{'a': '0.77',
'b': '0.2',
'c': '0.20',
'd': '0.79',
'e': '(z*a)',
'output': '(z+e)',
'z': '(c+d)'}
And now you can get first 'a' : '0.77 to run .replace('a', '0.77)` on other items in dictionary. And repeate it with other values from dictionary.
So finally you could get dictionary
{'a': '0.77',
'b': '0.2',
'c': '0.20',
'd': '0.79',
'e': '((0.20+0.79)*0.77)',
'output': '((0.20+0.79)+((0.20+0.79)*0.77))',
'z': '(0.20+0.79)'}
and output has string ((0.20+0.79)+((0.20+0.79)*0.77))
import sympy
import pprint
text = 'a=0.77 ,b=0.2 ,c=0.20, d=0.79 ,z=(c+d), e=(z*a) ,output=(z+e)'
parts = text.split(',') # create list
#print(parts)
parts = [item.strip() for item in parts] # remove spaces
#print(parts)
parts = [item.split('=') for item in parts] # create tuples
#print(parts)
parts = dict(parts) # create dict
#print(parts)
pprint.pprint(parts)
for key1, val1 in parts.items():
for key2, val2 in parts.items():
parts[key2] = parts[key2].replace(key1, val1)
pprint.pprint(parts)
print('output:', parts['output'])

python - check if word is in list full of strings and if there is print the words in an another list

so basically it would be like:
MyList=["Monkey","Phone","Metro","Boom","Feet"]
and let's say I have the input be m so Boom and Monkey and Metro would be put in a list like so
output >> ["Monkey","Metro","Feet"]
and if I would've had the input be f then the output would be
output >> ["Feet"]
and my question is how would I put this in a def? This is what I came up with
def Find(word,MyList):
MyList2=[]
count=0
for i in MyList:
count+=1
if i[count] == MyList2: ##(at first i did if i[0:10])
for x in range(1):
MyList2.append(i)
print(MyList2)
and then somewhere there should be
word=input("Word, please.")
and then
Find(word,MyList)
thanks in advance!
Try this :
def find_words(input_char, my_list):
ret_list = []
for i in my_list:
if input_char.lower() in i.lower():
ret_list.append(i)
return ret_list
MyList=["Monkey","Phone","Metro","Boom","Feet"]
input_char=input("Input a character :").strip() # get a character and strip spaces if any.
find_words(input_char, MyList) # call the function with arguments
Output for sample input "M :
Input a character :>? "M"
>>> ['Monkey', 'Metro', 'Boom']
(Almost) One liner:
>>> MyList=["Monkey","Phone","Metro","Boom","Feet"]
>>> target = input("Input string: ")
Input string: Ph
>>> print([i for i in MyList if target.lower() in i.lower()])
['Phone']
Generally in Python you don't want to be playing with indexes, iterators are the way to go.
The in keyword checks for substrings so it will work whether you provide only one character or a full string too (i.e. if you input Ph you'll get a list containing only Phone)
Depending on how efficient you want your search would be. Throwing in one more approach to build a dictionary like this
from collections import defaultdict
d = defaultdict(set)
for i in MyList:
chars = set(i)
for c in chars:
d[c].add(i)
Now, your dictionary looks like this
defaultdict(set,
{'o': {'Boom', 'Metro', 'Monkey', 'Phone'},
'k': {'Monkey'},
'e': {'Feet', 'Metro', 'Monkey', 'Phone'},
'M': {'Metro', 'Monkey'},
'y': {'Monkey'},
'n': {'Monkey', 'Phone'},
'h': {'Phone'},
'P': {'Phone'},
't': {'Feet', 'Metro'},
'r': {'Metro'},
'm': {'Boom'},
'B': {'Boom'},
'F': {'Feet'}})
Now, you can simply search within your dict with O(1) complexity
d[your_input_char]
Here is how you can use a list comprehension:
def Find(letter, MyList):
print([word for word in MyList if letter.lower() in word.lower()])
Find('m', ["Monkey","Phone","Metro","Boom","Feet"])
Output:
['Monkey', 'Metro', 'Boom']

count words from list in another list in entry one

Hy,
I want to count given phrases from a list in another list on position zero.
list_given_atoms= ['C', 'Cl', 'Br']
list_of_molecules= ['C(B2Br)[Cl{H]Cl}P' ,'NAME']
When python find a match it should be safed in a dictionary like
countdict = [ 'Cl : 2', 'C : 1', 'Br : 1']
i tried
re.findall(r'\w+', list_of_molecules[0])
already but that resulsts in words like "B2Br", which is definitly not what i want.
can someone help me?
[a-zA-Z]+ should be used instead of \w+ because \w+ will match both letters and numbers, while you are just looking for letters:
import re
list_given_atoms= ['C', 'Cl', 'Br']
list_of_molecules= ['C(B2Br)[Cl{H]Cl}P' ,'NAME']
molecules = re.findall('[a-zA-Z]+', list_of_molecules[0])
final_data = {i:molecules.count(i) for i in list_given_atoms}
Output:
{'C': 1, 'Br': 1, 'Cl': 2}
You could use something like this:
>>> Counter(re.findall('|'.join(sorted(list_given_atoms, key=len, reverse=True)), list_of_molecules[0]))
Counter({'Cl': 2, 'C': 1, 'Br': 1})
You have to sort the elements by their length, so 'Cl' matches before 'C'.
Short re.findall() solution:
import re
list_given_atoms = ['C', 'Cl', 'Br']
list_of_molecules = ['C(B2Br)[Cl{H]Cl}P' ,'NAME']
d = { a: len(re.findall(r'' + a + '(?=[^a-z]|$)', list_of_molecules[0], re.I))
for a in list_given_atoms }
print(d)
The output:
{'C': 1, 'Cl': 2, 'Br': 1}
I tried your solutions and i figured out, that there are also several C after each other. So I came to this one here:
for element in re.findall(r'([A-Z])([a-z|A-Z])?'. list_of_molecules[0]):
if element[1].islower:
counter = element[0] + element[1]
if not (counter in counter_dict):
counter_dict[counter] = 1
else:
counter_dict[counter] += 1
The same way I checked for elements with just one case and added them to the dictionary. There is probably a better way.
You can't use a /w as a word character is equivalent to:
[a-zA-Z0-9_]
which clearly includes numbers so therefore "B2Br" is matched.
You also can't just use the regex:
[a-zA-Z]+
as that would produce one atom for something like "CO2"which should produce 2 separate molecules: C and 0.
However the regex I came up with (regex101) just checks for a capital letter and then between 0 and 1 (so optional) lower case letter.
Here it is:
[A-Z][a-z]{0,1}
and it will correctly produce the atoms.
So to incorporate this into your original lists of:
list_given_atoms= ['C', 'Cl', 'Br']
list_of_molecules= ['C(B2Br)[Cl{H]Cl}P' ,'NAME']
we want to first find all the atoms in list_of_molecules and then create a dictionary of the counts of the atoms in list_given_atoms.
So to find all the atoms, we can use re.findall on the first element in the molecules list:
atoms = re.findall("[A-Z][a-z]{0,1}", list_of_molecules[0])
which gives a list:
['C', 'B', 'Br', 'Cl', 'H', 'Cl', 'P']
then, to get the counts in a dictionary, we can use a dictionary-comprehension:
counts = {a: atoms.count(a) for a in list_given_atoms}
which gives the desired result of:
{'Cl': 2, 'C': 1, 'Br': 1}
And would also work when we have molecules like CO2 etc.

Python parse issue

I need to do a sort of reverse .format() to a string like
a = "01AA12345AB12345AABBCCDDEE".reverseformat({id:2d}{type:2s}{a:3d}{b:4s}{c:5d}{d:2s})
print a
>>>> {'id':1, 'type':'aa', 'a':'123', 'b':'45AB', 'c':'12345', 'd':'AA'}
I found this lib that makes almost what i need, the problem is that it gives me this result
msg = parse.parse("{id:2d}{type:3S}{n:5S}", "01D1dddffffffff")
print msg.named
>>>>{'type': 'D1dddfffffff', 'id': 1, 'n': 'f'}
and not
{'id':1, 'type':'D1d', 'n':'ddfffff'}
Does another lib/method/wathever that can "unpack" a string to a dict exists?
EDIT: Just for clarify, i already tryed the w and D format specification for string
Is there any reason you can't just slice it like a normal string if your format is always the same?
s = "01D1dddffffffff"
id = s[:2]
type = s[2:5]
n = s[5:]
Which gives id, type, and n as:
01
D1d
ddffffffff
And it's trivial to convert this into a dictionary from there if that's your need. If your parsing doesn't need to be dynamic (which it doesn't seem to be from your question in it's current state) then it's easy enough to wrap the slicing in a function which will extract all of the values.
This also has the advantage that from the slice it's clear how many characters and what position in the string you're extracting, but in the parse formatter the positions are all relative (i.e. finding which characters n extracts means counting how many characters id and type consume).
You can use regular expressions to do what you want here.
import re
a = "01AA12345AB12345AABBCCDDEE"
expr = re.compile(r"""
(?P<id>.{2}) # id:2d
(?P<type>.{2}) # type:2s
(?P<a>.{3}) # a:3d
(?P<b>.{4}) # b:4s
(?P<c>.{5}) # c:5d
(?P<d>.{2}) # d:2s""", re.X)
expr.match(a).groupdict()
# {'id': '01', 'b': '45AB', 'c': '12345', 'd': 'AA', 'a': '123', 'type': 'AA'}
You could even make a function that does this.
def unformat(s, formatting_str):
typingdict = {'s': str, 'f': float, 'd':int} # are there any more?
name_to_type = {}
groups = re.findall(r"{([^}]*)}", formatting_str)
expr_str = ""
for group in groups:
name, formatspec = group.split(":")
length, type_ = formatspec[:-1], typingdict.get(formatspec[-1], str)
expr_str += "(?P<{name}>.{{{length}}})".format(name=name, length=length)
name_to_type[name] = type_
g = re.match(expr_str, s).groupdict()
for k,v in g.items():
g[k] = name_to_type[k](v)
return g
Then calling like...
>>> a
'01AA12345AB12345AABBCCDDEE'
>>> result = unformat(a, "{id:2d}{type:2s}{a:3d}{b:4s}{c:5d}{d:2s}")
>>> result
{'id': 1, 'b': '45AB', 'c': 12345, 'd': 'AA', 'a': 123, 'type': 'AA'}
However I hope you can see how incredibly ugly this is. Don't do this -- just use string slicing.

building a boolean dictionary from 2 lists in python

I'm trying to make a dictionary with values 'True' or 'False' when comparing elements in 2 lists. This is probably a bit basic but I'm new to coding and I don't understand why it always assigns the 'True' value even though I can see its not true:
letters = [A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z]
randomData = []
f = open('randomData.txt', 'r')
for line in f:
randomData.append(line.rstrip().split()[0])
f.close()
The 'randomData.txt' file looks like:
A'\t'0003'\t'0025'\t'chr1
B'\t'0011'\t'0021'\t'chr7
D'\t'0043'\t'0068'\t'chr3
F'\t'0101'\t'0119'\t'chr7
The randomData list should now look like:
['A','B','D','F']
I tried:
sameLetters = {}
i=0
while i < len(letters):
if letters[i] and randomData:
#append to dictionary
sameLetters[letters[i]] = 'True'
else:
#append to dictionary
sameLetters[letters[i]] = 'False'
i=i+1
print sameLetters
I expected something like:
{'A': 'True', 'B': 'True', 'C': 'False', 'D': 'True', 'E': 'False', 'F': 'True', 'G': 'False', etc
Instead all values in the dictionary are 'True'. Can anyone see the problem? Or give any pointers or explanations? Any help would be great, many thanks.
Perhaps you meant if letters[i] in randomData
I think you want to do something like:
sameLetters = {l: l in randomData for l in letters}
Your current attempt doesn't work because you check
if letters[i] and randomData:
# ^ should be in
and Python interprets both non-empty strings (letters[i]) and non-empty lists (randomData) as True.
Also, note that letters is already available in Python:
from string import ascii_uppercase
This is a string, but you can iterate through and index a string just like a list, and in will still work.
Seems like you only care about which letter appears in your random data, so why not use a set?
from string import ascii_uppercase
randomData = ['A', 'B', 'D', 'F', 'A']
appeared = set(ascii_uppercase).intersection(set(randomData))
print appeared
And later you can us it like this:
char = 'z'
if char in appeared:
print 'yes'
else:
print 'no'
EDIT:
Then how about this:)
from string import ascii_uppercase
randomData = ['A', 'B', 'D', 'F', 'A']
appeared = set(ascii_uppercase).intersection(set(randomData))
d = dict(zip(ascii_uppercase, (False,) * 26))
for key in appeared:
d[key] = True
print d

Categories