Python find text in string - python

I have the following string for which I want to extract data:
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n
Every variable I want to extract starts with \n
The value I want to get starts with a colon ':' followed by more than 1 dot
When it doesnt start with a colon followed by dots, I dont want to extract that value.
For example my preferred output looks like:
LOA = 189.9
LBP = 176.0
BM = 26.4
DM = 9.2

import re
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n'
# capture all the characters BEFORE the ':' character
variables = re.findall(r'(.*?):', text_example)
# matches all floats and integers (does not account for minus signs)
values = re.findall(r'(\d+(?:\.\d+)?)', text_example)
# zip into dictionary (this is assuming you will have the same number of results for both regex expression.
result = dict(zip(variables, values))
print(result)
--> {'Length oa': '189.9', 'Breadth moulded': '26.4', 'Length bp': '176', 'Depth moulded to main deck': '9.2'}

You can create a regex and workaround the solution-
re.findall(r'(\\n|\n)([A-Za-z\s]*)(?:(\:\s*\.+))(\d*\.*\d*)',text_example)[2]
('\n', 'Breadth moulded', ': .......', '26.4')

Related

Need some help on extracting particular string using string manipulations with/without regex

I have an OCR program (not so accurate though) that outputs a string. I append it to a list. So, my ss list looks like this:
ss = [
'성 벼 | 5 번YAO LIAO거 CHINA P R체류자격 결혼이민F-1)말급일자', # 'YAO LIAO'
'성 별 F 등록번호명 JAO HALJUNGCHINA P R격 결혼이민(F-6)밥급인자', # 'JAO HALJUNG'
'성 별 F명 CHENG HAIJING국 가 CHINA P R 역체 가차격 결혼이민(C-4) 박급인자', # 'CHENG HAIJING'
'KOa MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE움첫;자격 거주(F-2)발급일자', # 'DOVUD TAREEQ SAID HAFIZULLAH'
'KOn 별 MDOVUD TAREEQ SAID- IIAFIZULLAH 감 TURKIYE동체나자격 거주F-2) 발급일자', # 'DOVUD TAREEQ SAID- IIAFIZULLAH'
'등록번호IN" 성 별 M명 TAREEQ SAD IIAFIZULLAH 값 TURKIYE8체주자격 거주-2)발급일자' # 'TAREEQ SAD IIAFIZULLAH'
]
I need to find some way to at least remove country names, or even better solution would be to extract clean full names as shown as comments above.
Here, the ss list stores the worst outputs, so if I can handle all 6 strings here with one universal solution, I hope the rest will be easier.
So far, I could think of looping through each element to extract upper English-only letters and filter out empty strings and any string whose len is less than 2, because I am assuming name consists of at least 2 letters:
for s in ss:
eng_parts = ''.join([i if 64 < ord(i) < 91 else ' ' for i in s])
#print("English-only strings: {}".format(eng_parts))
new_string = ''
spaced_string_list = eng_parts.split(" ")
for spaced_string in spaced_string_list:
if len(spaced_string) >= 2:
new_string += spaced_string + " "
new_string_list.append(new_string)
where new_string_list is ['YAO LIAO CHINA ', 'JAO HALJUNGCHINA ', 'CHENG HAIJING CHINA ', 'KO MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE ', 'KO MDOVUD TAREEQ SAID IIAFIZULLAH TURKIYE ', 'IN TAREEQ SAD IIAFIZULLAH TURKIYE ']
Could this result be improved further?
EDIT:
The desired name string could be of up to 5 space-separated substrings. Also, a part of the name string is at least two English-only upper letters. In some cases, a name substring could be separated by a - (refer to SAID- case) if it reaches the end of the ID card, where initially the whole string got extracted from.
It is a great idea to postulate that a name always is build of two upper-case words of Latin characters separated by a space (or more).
So you can loop through the elements and look for that pattern. regex is the library to use =):
import re
for el in ss:
m = re.search(r'[A-Z]{2,}(\s+[A-Z\-]{2,})+', el)
if m:
print(m.group())
YAO LIAO
JAO HALJUNGCHINA
CHENG HAIJING
MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE
MDOVUD TAREEQ SAID- IIAFIZULLAH
TAREEQ SAD IIAFIZULLAH
Let's examine the pattern in detail:
[A-Z]{2,} this searches for upper-case Latin characters of length 2 or more. The brackets indicate a symbol range and the curly brackets a numeric range.
\s+ looks for one ore more (+) widespaces (\s)
add special characters to the list of allowed character if necessary. Note that e.g. a dash needs to be escaped \- because it signifies a range otherwise -
group fractions of the pattern to make it repeatable: ( )+

Replacing characters with some restrictions

I have a list of strings in Python and I want to create a function that perform many replacements. Some of them are:
Replace "o agrícola" for "trabajo agrícola"
Replace "ob agricola" for "trabajo agrícola"
One solution that comes to my mind is:
text = str(text).replace('o agricola','trabajo agrícola')
text = str(text).replace('t agricola','trabajo agrícola')
One of the issues:
text = 'obrero agricola' is transformed into 'obrertrabajo agrícola'
Is there a solution that respects these two conditions?
'obrero agricola' is mapped into 'obrero agrícola'
'o something' is mapped into 'o something' (it fixes 'o' if it's mixed with other words)
Use word boundaries along with re.sub:
text = 'ob agricola and obrero agricola'
text = re.sub(r'\b(?:o agrícola|ob agricola)', 'trabajo agrícola', text)
print(text) # trabajo agrícola and obrero agricola

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks
The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']
I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

Python function to find similarity between differently formatted strings

I have 2 excel files with names of items. I want to compare the items but the only remotely similar column is the name column which too has different formatting of the names like
KIDS-Piano as kids piano
Butter Gel 100mg as Butter-Gel-100MG
I know it can't be 100% accurate so I would instead ask the human operating the code to make the final verification but how do I show the closest matching names?
The proper way of doing this is writing a regular expression.
But the vanilla code below might do the trick as well:
column_a = ["KIDS-Piano", "Butter Gel 100mg"]
column_b = ["kids piano", "Butter-Gel-100MG"]
new_column_a = []
for i in column_a:
# convert strings into lowercase
a = i.lower()
# replace dashes with spaces
a = a.replace('-', ' ')
new_column_a.append(a)
# do the same for column b
new_column_b = []
for i in column_b:
# convert strings into lowercase
a = i.lower()
# replace dashes with spaces
a = a.replace('-', ' ')
new_column_b.append(a)
as_not_found_in_b = []
for i in new_column_a:
if i not in new_column_b:
as_not_found_in_b.append(i)
bs_not_found_in_a = []
for i in new_column_b:
if i not in new_column_a:
bs_not_found_in_a.append(i)
# find the problematic ones and manually fix them
print(as_not_found_in_b)
print(bs_not_found_in_a)

Edit content of a list with split and find

I have a dictionary named dicitionario1. I need to replace the content of dicionario[chave][1] which is a list, for the list lista_atributos.
lista_atribtutos uses the content of dicionario[chave][1] to get a list where:
All the information is separed by "," except when it finds the characters "(#" and ")". In this case, it should create a list with the content between those characters (also separated by ","). It can find one or more entries of '(#' and I need to work with every single of them.
Although this might be easy, I'm stuck with the following code:
dicionario1 = {'#998' : [['IFCPROPERTYSET'],["'0siSrBpkjDAOVD99BESZyg',#41,'Geometric Position',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]],
'#1000' : [['IFCRELDEFINESBYPROPERTIES'],["'1dEWu40Ab8zuK7fuATUuvp',#41,$,$,(#973,#951),#998"]]}
for chave in dicionario1:
lista_atributos = []
ini = 0
for i in dicionario1[chave][1][0][ini:]:
if i == '(' and dicionario1[chave][1][0][dicionario1[chave][1][0].index(i) + 1] == '#':
ini = dicionario1[chave][1][0].index(i) + 1
fim = dicionario1[chave][1][0].index(')')
lista_atributos.append(dicionario1[chave][1][0][:ini-2].split(','))
lista_atributos.append(dicionario1[chave][1][0][ini:fim].split(','))
lista_atributos.append(dicionario1[chave][1][0][fim+2:].split(','))
print lista_atributos
Result:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", '#41', "'Geometric Position'", '$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757'], ['']]
Unfortunately I can figure out how to iterate over the dictionario1[chave][1][0] to get this result:
[["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", ['#41'], ["'Geometric Position'"], ['$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
I need the"["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$']..." in the result, also to turn into ["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$']...
Also If I modify "Geometric Position" to "(Geometric Position)" the result becomes:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
SOLUTION: (thanks to Rob Watts)
import re
dicionario1 =["'0siSrBpkjDAOVD99BESZyg',#41,'(Geometric) (Position)',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]
dicionario1 = re.findall('\([^)]*\)|[^,]+', dicionario1[0])
for i in range(len(dicionario1)):
if dicionario1[i].startswith('(#'):
dicionario1[i] = dicionario1[i][1:-1].split(',')
else:
pass
print dicionario1
["'0siSrBpkjDAOVD99BESZyg'", '#41', "'(Geometric) (Position)'", '$', ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
One problem I see with your code is the use of index:
ini = dicionario1[chave][1][0].index(i) + 2
fim = dicionario1[chave][1][0].index(')')
index returns the index of the first occurrence of the character. So if you have two ('s in your string, then both times it will give you the index of the first one. That (and your break statement) is why in your example you've got ['2.1', '2.2', '2.3'] correctly but also have '(#5.1', '5.2', '5.3)'.
You can get around this by specifying a starting index to the index method, but I'd suggest a different strategy. If you don't have any commas in the parsed strings, you can use a fairly simple regex to find all your groups:
'\([^)]*\)|[^,]+'
This will find everything inside parenthesis and also everything that doesn't contain a comma. For example:
>>> import re
>>> teststr = "'1',$,#41,(#10,#5)"
>>> re.findall('\([^)]*\)|[^,]+', teststr)
["'1'", '$', '#41', '(#10,#5)']
This leaves you will everything grouped appropriately. You still have to do a little bit of processing on each entry, but it should be fairly straightforward.
During your processing, the startswith method should be helpful. For example:
>>> '(something)'.startswith('(')
True
>>> '(something)'.startswith('(#')
False
>>> '(#1,#2,#3)'.startswith('(#')
True
This will make it easy for you to distinguish between (...) and (#...). If there are commas in the (...), you could always split on comma after you've used the regex.

Categories