I have following texts, each line has two phrases and separated with "\t"
RoadTunnel RouteOfTransportation
LaunchPad Infrastructure
CyclingLeague SportsLeague
Territory PopulatedPlace
CurlingLeague SportsLeague
GatedCommunity PopulatedPlace
What I want to get is to add _ to separate words, the results should be:
Road_Tunnel Route_Of_Transportation
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
There is no cases such as "ABTest" or "aBTest", and there are cases such as three words together "RouteOfTransportation" I tried several ways but not succeeded.
One of my tries is:
textProcessed = re.sub(r"([A-Z][a-z]+)(?=([A-Z][a-z]+))", r"\1_", text)
But there is no result
Use a regular expression and re.sub.
>>> import re
>>> s = '''LaunchPad Infrastructure
... CyclingLeague SportsLeague
... Territory PopulatedPlace
... CurlingLeague SportsLeague
... GatedCommunity PopulatedPlace'''
>>> subbed = re.sub('([A-Z][a-z]+)([A-Z])', r'\1_\2', s)
>>> print(subbed)
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
edit: Here's another one, since your test cases don't cover enough to be sure what exactly you want:
>>> re.sub('([a-zA-Z])([A-Z])([a-z])', r'\1_\2\3', 'ABThingThing')
'AB_Thing_Thing'
Combining re.findall and str.join:
>>> "_".join(re.findall(r"[A-Z]{1}[^A-Z]*", text))
Depending on your needs, a slightly different solution can be this:
import re
result = re.sub(r"([a-zA-Z])(?=[A-Z])", r"\1_", s)
It will insert a _ before any upper case letter that follows another letter (whether it is upper or lower case).
"TheRabbit IsBlue" => "The_Rabbit Is_Blue"
"ABThing ThingAB" => "A_B_Thing Thing_A_B"
It does not support special chars.
Related
def translate(string, translations):
'''
>>> translations = {'he':'she', 'brother':'sister'}
>>> translate('he', translations)
'she'
>>> translate('HE', translations)
'SHE'
>>> translate('He', translations)
'She'
>>> translate('brother', translations)
'sister'
>>> translate('my', translations)
'my'
'''
I have inputs like this. I used translations.get(string) to get he and sister and it worked well. But the thing is that I cant convert the strings to 'She' or 'sHe' (in the original format).
How to do it in Python?
You are going to need to either have a bigger dictionary, case sensitive, or your translate function is going to have to be modified to:
Detect the case of the original word or phrase, all lower, all upper, sentence or title.
Look up the translation case insensitive
Re-case the translated text to match the original.
But with some languages you will still have some issues, e.g.: in some languages all caps includes some lower case letters sometimes or capitalise the second letter rather than the first such as d' as a prefix would always be lower case or have different capitalisation rules, in SI units UK capitalisation rules say that if the unit is named after a person it should always be capitalise but other countries do this differently.
Just as you have a data structure of translations, we can create a data structure of case tests and corrections:
def iscapitalized(s):
return s and s[0].isupper() and s[1:].islower()
def translate(string, translations):
translation = translations.get(string.lower(), string)
for test, correction in corrections.items():
if test(string):
translation = correction(translation)
break
return translation
translations = {'he': 'she', 'brother': 'sister'}
corrections = {str.isupper: str.upper, str.islower: str.lower, iscapitalized: str.capitalize}
print(translate('he', translations))
print(translate('HE', translations))
print(translate('He', translations))
print(translate('brother', translations))
print(translate('my', translations))
OUTPUT
> python3 test.py
she
SHE
She
sister
my
>
I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
Input:
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes
You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
print(mo.group("action"))
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)
I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]
I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.
If all I have is a string of 10 or more digits, how can I format this as a phone number?
Some trivial examples:
555-5555
555-555-5555
1-800-555-5555
I know those aren't the only ways to format them, and it's very likely I'll leave things out if I do it myself. Is there a python library or a standard way of formatting phone numbers?
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting, storing and validating international phone numbers.
The readme is insufficient, but I found the code well documented.
Seems like your examples formatted with three digits groups except last, you can write a simple function, uses thousand seperator and adds last digit:
>>> def phone_format(n):
... return format(int(n[:-1]), ",").replace(",", "-") + n[-1]
...
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555555")
'555-555-5555'
>>> phone_format("18005555555")
'1-800-555-5555'
Here's one adapted from utdemir's solution and this solution that will work with Python 2.6, as the "," formatter is new in Python 2.7.
def phone_format(phone_number):
clean_phone_number = re.sub('[^0-9]+', '', phone_number)
formatted_phone_number = re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1-", "%d" % int(clean_phone_number[:-1])) + clean_phone_number[-1]
return formatted_phone_number
You can use the function clean_phone() from the library DataPrep. Install it with pip install dataprep.
>>> from dataprep.clean import clean_phone
>>> df = pd.DataFrame({'phone': ['5555555', '5555555555', '18005555555']})
>>> clean_phone(df, 'phone')
Phone Number Cleaning Report:
3 values cleaned (100.0%)
Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%)
phone phone_clean
0 5555555 555-5555
1 5555555555 555-555-5555
2 18005555555 1-800-555-5555
More verbose, one dependency, but guarantees consistent output for most inputs and was fun to write:
import re
def format_tel(tel):
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1 or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = f"{tel[:3]}-{tel[3:6]}-{tel[6:]}"
return tel
Output:
>>> format_tel("1-800-628-8737")
'800-628-8737'
>>> format_tel("800-628-8737")
'800-628-8737'
>>> format_tel("18006288737")
'800-628-8737'
>>> format_tel("1800-628-8737")
'800-628-8737'
>>> format_tel("(800) 628-8737")
'800-628-8737'
>>> format_tel("(800) 6288737")
'800-628-8737'
>>> format_tel("(800)6288737")
'800-628-8737'
>>> format_tel("8006288737")
'800-628-8737'
Without magic numbers; ...if you're not into the whole brevity thing:
def format_tel(tel):
AREA_BOUNDARY = 3 # 800.6288737
SUBSCRIBER_SPLIT = 6 # 800628.8737
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1, or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = (f"{tel[:AREA_BOUNDARY]}-"
f"{tel[AREA_BOUNDARY:SUBSCRIBER_SPLIT]}-{tel[SUBSCRIBER_SPLIT:]}")
return tel
A simple solution might be to start at the back and insert the hyphen after four numbers, then do groups of three until the beginning of the string is reached. I am not aware of a built in function or anything like that.
You might find this helpful:
http://www.diveintopython3.net/regular-expressions.html#phonenumbers
Regular expressions will be useful if you are accepting user input of phone numbers. I would not use the exact approach followed at the above link. Something simpler, like just stripping out digits, is probably easier and just as good.
Also, inserting commas into numbers is an analogous problem that has been solved efficiently elsewhere and could be adapted to this problem.
In my case, I needed to get a phone pattern like "*** *** ***" by country.
So I re-used phonenumbers package in our project
from phonenumbers import country_code_for_region, format_number, PhoneMetadata, PhoneNumberFormat, parse as parse_phone
import re
def get_country_phone_pattern(country_code: str):
mobile_number_example = PhoneMetadata.metadata_for_region(country_code).mobile.example_number
formatted_phone = format_number(parse_phone(mobile_number_example, country_code), PhoneNumberFormat.INTERNATIONAL)
without_country_code = " ".join(formatted_phone.split()[1:])
return re.sub("\d", "*", without_country_code)
get_country_phone_pattern("KG") # *** *** ***