regular expression negative lookahed - python

I'm trying to extract the names of firms from the text.
I found out that firm's names starts with Capital letters and some of them contains ' and ' or ' de ' or ' & ' or 'of' inside it.
So I wrote the regular expression that catches them
: (?:[A-Z]+[\w'-]*\s?(?:&\s|and\s|de\s|of\s)?)+%?
For example, from the sentence
"The companys largest customer, Wal-Mart Stores, Inc. and its
affiliated companies, accounted for approximately 25% of net sales
during fiscal year 2009 and 24% during fiscal years 2008 and 2007."
This regex matches out
"The", "Wal-Mart Stores", "Inc"
However, I am stuck with two problems.
Problem 1:
I found out that company's segment, product, division, category, sales names are also matched since It also begins with capitals. However, I don't want to extract those names along with companies names.
Problem 2 :
I don't want to get names which starts with S(s)ale(s) of/by/in or sold
For example,
;;;;;In fiscal 2005, the Company derived
approximately 21% ($4,782,852) of its consolidated revenues from
continuing operations from direct transactions with Kmart Corporation.
Sales of Computer products are important for us. However, Computer's Parts and
Display Segment sale has been decreasing.
According to my regex wrote above, it extracts
['In', "Company', 'Kmart Corporation', 'Sales of Computer', "Computer's Parts and Display Segment"]
Since, I don't want to get 'Sales of Computer' and 'Computer's Parts and Display Segment'
I tried to use negative look ahead / look behind
Bellows are what I've been trying so far:
I added negative look ahead ((?![Ss]egments?|[Pp]roducts?|programs?|[Dd]ivisions?|[Cc]ategor(?:y|ies)|[Ss]ales?))
(?:[A-Z]+[\w'-]\s?(?:&\s|and\s|de\s|of\s)?)+(?![Ss]egments?|[Pp]roducts?|programs?|[Dd]ivisions?|[Cc]ategor(?:y|ies)|[Ss]ales?)*
However, It still matches "Computer's Parts and Display Segment"...!
negative look behind is even worse...
I added (? at the beginning of my regex.
However, It seems like negative look behind expression cannot contain grouping or | ...
Whit such a huge frustration, I wrote few more regex for each cases and used set operations to deal with this problem.
However, I wonder is there any single regex that can do exactly what I expect in a one - shot??
Thanks for reading!

Related

Python Regex: How to find a substring

I have a list of titles that I need to normalize. For example, if a title contains 'CTO', it needs to be changed to 'Chief Technology Officer'. However, I only want to replace 'CTO' if there is no letter directly to the left or right of 'CTO'. For example, 'Director' contains 'cto'. I obviously wouldn't want this to be replaced. However, I do want it to be replaced in situations where the title is 'Founder/CTO' or 'CTO/Founder'.
Is there a way to check if a letter is before 'CXO' using regex? Or what would be the best way to accomplish this task?
EDIT:
My code is as follows...
test = 'Co-Founder/CTO'
test = re.sub("[^a-zA-Z0-9]CTO", 'Chief Technology Officer', test)
The result is 'Co-FounderChief Technology Officer'. The '/' gets replaced for some reason. However, this doesn't happen if test = 'CTO/Co-Founder'.
What you want is a regex that excludes a list of stuff before a point:
"[^a-zA-Z0-9]CTO"
But you actually also need to check for when CTO occurs at the beginning of the line:
"^CTO"
To use the first expression within re.sub, you can add a grouping operator (()s) and then use it in the replacement to pull out the matching character (eg, space or /):
re.sub("([^a-zA-Z0-9])CTO","\\1Chief Technology Officer", "foo/CTO")
Will result in
'foo/Chief Technology Officer'
Answer: "(?<=[^a-zA-Z0-9])CTO|^CTO"
Lookbehinds are perfect for this
cto_re = re.compile("(?<=[^a-zA-Z0-9])CTO")
but unfortunately won't work for the start of lines (due only to the python implementation requiring fixed length).
for eg in "Co-Founder/CTO", "CTO/Bossy", "aCTOrMan":
print(cto_re.sub("Chief Technology Officer", eg))
Co-Founder/Chief Technology Officer
CTO/Bossy
aCTOrMan
You would have to check for that explicitly via |:
cto_re = re.compile("(?<=[^a-zA-Z0-9])CTO|^CTO")
for eg in "Co-Founder/CTO", "CTO/Bossy", "aCTOrMan":
print(cto_re.sub("Chief Technology Officer", eg))
Co-Founder/Chief Technology Officer
Chief Technology Officer/Bossy
aCTOrMan

How to clean up text data in a pandas dataframe column

I am playing around with some banking information and I have a csv file of all my transactions. I have opened it up as a dataframe and it looks something like this:
banking.csv
my as you can see in the 2nd column, there is a whole bunch of text that I do not need, all I am interested in is the store name which typically comes at the end.
I have managed to get rid of the part that says 'Point of sale - INTERAC RETAIL PURCHASE'
by using
checking['POS'] = checking['POS'].str.replace('Point of Sale - Interac RETAIL PURCHASE', '')
my issue now comes when I try to delete the numbers that comes after that, just before the store names. I wanted to do something similar to above but the numbers are all unique so I am not sure how I can do this.
thanks for the help
You could do a replace with Regular expressions:
import re
checking['POS'].apply(lambda x: re.sub(r"Point of Sale - Interac RETAIL PURCHASE \d+", "", x))
\d+ = "one or more numbers".
So that will only work if there are only numbers between the text and the store name.
The .replace unfortunately does not allow regular expressions thats why you have to use the re module.

Extracting 25 words to both sides of a word from a text

I have the following text and I am trying to use this pattern to extract 25 words to each side of the matches. The challenge is that the matches overlap, thus python regex engine takes only one match. I would appreciate if anyone can help fix this
Text
2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average
I tried the following pattern
pattern = r'(?<=outlook\s)((\w+.*?){25})'
This creates one match whereas i need two matches and it should not matter whether one overlaps the other
I need basically two matches
I would not use regex at all -the python module re does not handle overlapping ranges...
text = """2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"""
lookfor = "outlook"
# split text at spaces
splitted = text.lower().split()
# get the position in splitted where the words match (remove .,-?! for comparison)
positions = [i for i,w in enumerate(splitted) if lookfor == w.strip(".,-?!")]
# printing here, you can put those slices in a list for later usage
for p in positions: # positions is: [1, 8, 21]
print( ' '.join(splitted[max(0,p-26):p+26]) )
print()
Output:
2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact
2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs.
2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs. revenues - based on the revenues from the fourth quarter of 2014, the
By iterating the splitted words you get the positions and slice the splitted list. Make sure to start at 0 for the slice even if p-26 is lower then 0, else you wont get any output. (Start of -4 means from the end of string)
A non-regex way:
string = "2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"
words = string.split()
starting25 = " ".join(words[:25])
ending25 = " ".join(words[-25:])
print(starting25)
print("\n")
print(ending25)

Regex split string to isolate substrings enclosed in square brackets

Here is an example substring from the text I'm trying to parse and a couple of the raw strings I'm trying to split this text with:
>>> test_string = "[shelter and transitional housing during shelter crisis - selection of sites;\nwaiver of certain requirements regarding contracting]\n\nsponsors: acting mayor breed; kim, ronen, sheehy and cohen\nordinance authorizing public works, the department of homelessness and supportive\nhousing, and the department of public health to enter into contracts without adhering to the\nadministrative code or environment code provisions regarding competitive bidding and other\nrequirements for construction work, procurement, and personal services relating to identified\nshelter crisis sites (1601 quesada avenue; 149-6th street; 125 bayshore boulevard; 13th\nstreet and south van ness avenue, southwest corner; 5th street and bryant street, northwest\ncorner; caltrans emergency shelter properties; and existing city navigation centers and\nshelters) that will provide emergency shelter or transitional housing to persons experiencing\nhomelessness; authorizing the director of property to enter into and amend leases or licenses\nfor the shelter crisis sites without adherence to certain provisions of the administrative code;\nauthorizing the director of public works to add sites to the list of shelter crisis sites subject to\nexpedited processing, procurement, and leasing upon written notice to the board of\nsupervisors, and compliance with conditions relating to environmental review and\nneighborhood notice; affirming the planning department’s determination under the californinenvironmental quality act; and making findings of consistency with the general plan, and the eight priority policies of planning code, section 101.1. assigned under 30 day rule to\nrules committee.\n[memorandum of understanding - service employees international union, local\n1021]\n\nsponsor: acting mayor breed"
>>> title = re.compile(r"\[([\s\S]*)\]")
>>> title = re.compile(r"\[.*\]")
What I want is to get a list of all strings enclosed in square brackets: []
>>> title.split(test_string)
['shelter and transitional housing during shelter crisis - selection of sites; waiver of certain requirements regarding contracting', 'memorandum of understanding - service employees international union, local 1021']
However, none of these raw strings split properly. It seems to me that re is including the closing criteria ] as part of the non-whitespace character set when it should the character that the string is split on.
I tried modifying the raw string to split on to be like this:
title = re.compile(r"\[([\s\S^\]]*)\]")
but that doesn't work either. I'm interpreting this last string to split on substrings that have [ in them, followed by any number of characters except for ], and followed by ].
How am I misunderstanding this?
[\s\S^\]] means: whitespace or non-whitespace or caret ^ or slash or ]. You cannot mix negated classes and regular ones. I think it's enough to use a class "all but closing ]": [^]], see example below.
You can also use -findall instead of split.
re.findall(r'\[([^]]*)\]', test_string)[0]

Regex pattern using muiltiple groups which may or may not exist with text inbetween

I using Regex on a list of strings (one string at a time) in order to extract information pertaining to the string. I have a almost functioning pattern which works on all the possible events i will potentially pass into it except one. I'm fairly new to Regex and therefore i am beginning to find it impossible to handle, especially as the pattern gets more complicated. I have multiple possible strings to match, and they all work except one.
Here are the possible strings, separated by lines. The format is consistent but the content such as the names, scores and additional information are not.
Goal scored Sunderland 4, Cardiff City 0. Connor Wickham (Sunderland) header from the centre of the box to the bottom left corner. Assisted by Emanuele Giaccherini with a cross following a corner.
Booking Sebastian Larsson (Sunderland) is shown the yellow card.
Foul by Jordon Mutch (Cardiff City).
Dismissal Cala (Cardiff City) is shown the red card.
Penalty conceded by Cala (Cardiff City) after a foul in the penalty area.
They all follow the same format other than goals, and therefore work with my current pattern however i would like the goal string to also work, but it will not due to the capitalization of team names. Ideally i would like to capture the team names and score into two separate groups, home team and away team, although it is not completely necessary.
Here is my current regex pattern which, other than for goals, correctly detects the event, players names, team and any extra information after it. I initially had .* instead of `[A-Z]*' which worked on goals but always cut off players first names, which i believe is due to it being optional within the group.
(?P<event>\A\w+)[^A-Z]*(?P<playername>(?:[A-Z]\w+)*\s\w+\s)(?P<team>\(.+\))(?P<extrainfo>[^\Z.]+)*
to break this down, this is what i am trying to look for currently
the first word that appears, which is under the event group (?P<event>\A\w+)
any number of characters which are not a capital(initial reason goal is broken) [^A-Z]*
a player name, which can be be of any length (some names are singular, others have multiple parts hence the non-matched group to detect any first names) (?P<playername>(?:[A-Z]\w+)*\s\w+\s)
a team name which is always enclosed in brackets after the player name (?P<team>\(.+\))
any extra information about the event, so anything which is after the team name. I make sure to also check its not just a . to ensure None in the result of the matched group (?P<extrainfo>[^\Z.]+)*
I am currently trying to find a solution along the lines of [^A-Z.]*(?P<hometeam>\w+[^,.])*(?P<awayteam>\w+[^,.])* but this is not working and i am struggling.
A further task which is trivial but if possible i would love to add would be somehow removing the brackets from the teamname group so instead of teamname (Cardiff City) it becomes teamname Cardiff City
Thanks for the help.
I would suggest splitting this into two tasks:
Extract the goals scored (r"^(?P<event>goal scored) (?P<hometeam>.*) (?P<homescore>\d), (?P<awayteam>.*) (?P<awayscore>\d). (?P<playername>.*) \((?P<scoringteam>.*)\).*$"); and
Extract the other events (r"^(?P<event>booking|foul|dismissal|penalty conceded) (?:by )?(?P<playername>.*) \((?P<teamname>.*)\).*$").
In your example, the former matches:
event [0-11] `Goal scored`
hometeam [12-23] `Sunderland`
homescore [23-24] `4`
awayteam [26-39] `Cardiff City`
awayscore [39-40] `0`
playername [42-56] `Connor Wickham`
scoringteam [58-68] `Sunderland`
And the latter, for example:
event [197-204] `Booking`
playername [205-222] `Sebastian Larsson`
teamname [224-234] `Sunderland`

Categories