Python Regex: How to find a substring

Python Regex: How to find a substring - python

I have a list of titles that I need to normalize. For example, if a title contains 'CTO', it needs to be changed to 'Chief Technology Officer'. However, I only want to replace 'CTO' if there is no letter directly to the left or right of 'CTO'. For example, 'Director' contains 'cto'. I obviously wouldn't want this to be replaced. However, I do want it to be replaced in situations where the title is 'Founder/CTO' or 'CTO/Founder'.
Is there a way to check if a letter is before 'CXO' using regex? Or what would be the best way to accomplish this task?
EDIT:
My code is as follows...
test = 'Co-Founder/CTO'
test = re.sub("[^a-zA-Z0-9]CTO", 'Chief Technology Officer', test)
The result is 'Co-FounderChief Technology Officer'. The '/' gets replaced for some reason. However, this doesn't happen if test = 'CTO/Co-Founder'.

What you want is a regex that excludes a list of stuff before a point:
"[^a-zA-Z0-9]CTO"
But you actually also need to check for when CTO occurs at the beginning of the line:
"^CTO"
To use the first expression within re.sub, you can add a grouping operator (()s) and then use it in the replacement to pull out the matching character (eg, space or /):
re.sub("([^a-zA-Z0-9])CTO","\\1Chief Technology Officer", "foo/CTO")
Will result in
'foo/Chief Technology Officer'

Answer: "(?<=[^a-zA-Z0-9])CTO|^CTO"
Lookbehinds are perfect for this
cto_re = re.compile("(?<=[^a-zA-Z0-9])CTO")
but unfortunately won't work for the start of lines (due only to the python implementation requiring fixed length).
for eg in "Co-Founder/CTO", "CTO/Bossy", "aCTOrMan":
print(cto_re.sub("Chief Technology Officer", eg))
Co-Founder/Chief Technology Officer
CTO/Bossy
aCTOrMan
You would have to check for that explicitly via |:
cto_re = re.compile("(?<=[^a-zA-Z0-9])CTO|^CTO")
for eg in "Co-Founder/CTO", "CTO/Bossy", "aCTOrMan":
print(cto_re.sub("Chief Technology Officer", eg))
Co-Founder/Chief Technology Officer
Chief Technology Officer/Bossy
aCTOrMan

Related

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random).
Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)
I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.
Short data example:
INPUT:
"WMIN CBOND12/05/2022 23554132121"
or
"WalMaInCBND 12/05/2022-23554132121"
or
"WalmartI CorpBond12/05/2022|23554132121"
EXPECTED OUTPUT:
"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"
So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)
As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)
First Problem is how to separate the information effectively without being able to use a systematic pattern.
Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.
Thanks for your help!

I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.
import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)

Using python and regular expressions:
import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")
filter("WMIN CBOND12/05/2022 23554132121")
The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").

welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.
We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
Gather a list of all the codes (hopefully you have that).
use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for.
If you have the code-list let me know and I'll post the code.

edit
s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")
ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID ID2 date num
1 WMIN CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3 WalmartI CorpBond 12/05/2022 23554132121
Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?

Python Regex: Why my pattern does not match?

Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""

Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq

The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.

how to restore a splitted word by removing the hyphen "-" because of hyphenation in a paragraph using python

simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?

I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/

regular expression negative lookahed

I'm trying to extract the names of firms from the text.
I found out that firm's names starts with Capital letters and some of them contains ' and ' or ' de ' or ' & ' or 'of' inside it.
So I wrote the regular expression that catches them
: (?:[A-Z]+[\w'-]*\s?(?:&\s|and\s|de\s|of\s)?)+%?
For example, from the sentence
"The companys largest customer, Wal-Mart Stores, Inc. and its
affiliated companies, accounted for approximately 25% of net sales
during fiscal year 2009 and 24% during fiscal years 2008 and 2007."
This regex matches out
"The", "Wal-Mart Stores", "Inc"
However, I am stuck with two problems.
Problem 1:
I found out that company's segment, product, division, category, sales names are also matched since It also begins with capitals. However, I don't want to extract those names along with companies names.
Problem 2 :
I don't want to get names which starts with S(s)ale(s) of/by/in or sold
For example,
;;;;;In fiscal 2005, the Company derived
approximately 21% ($4,782,852) of its consolidated revenues from
continuing operations from direct transactions with Kmart Corporation.
Sales of Computer products are important for us. However, Computer's Parts and
Display Segment sale has been decreasing.
According to my regex wrote above, it extracts
['In', "Company', 'Kmart Corporation', 'Sales of Computer', "Computer's Parts and Display Segment"]
Since, I don't want to get 'Sales of Computer' and 'Computer's Parts and Display Segment'
I tried to use negative look ahead / look behind
Bellows are what I've been trying so far:
I added negative look ahead ((?![Ss]egments?|[Pp]roducts?|programs?|[Dd]ivisions?|[Cc]ategor(?:y|ies)|[Ss]ales?))
(?:[A-Z]+[\w'-]\s?(?:&\s|and\s|de\s|of\s)?)+(?![Ss]egments?|[Pp]roducts?|programs?|[Dd]ivisions?|[Cc]ategor(?:y|ies)|[Ss]ales?)*
However, It still matches "Computer's Parts and Display Segment"...!
negative look behind is even worse...
I added (? at the beginning of my regex.
However, It seems like negative look behind expression cannot contain grouping or | ...
Whit such a huge frustration, I wrote few more regex for each cases and used set operations to deal with this problem.
However, I wonder is there any single regex that can do exactly what I expect in a one - shot??
Thanks for reading!

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list

The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]

open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.

This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.

The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.