Extract first word from string in Python - python

I have Python strings that follow one of two formats:
"#gianvitorossi/ FALL 2012 #highheels ..."
OR:
"#gianvitorossi FALL 2012 #highheels ..."
I want to extract just the #gianvitorossi portion.
I'm trying the following:
...
company = p['edge_media_to_caption']['edges'][0]['node']['text']
company = company.replace('/','')
company = company.replace('\t','')
company = company.replace('\n','')
c = company.split(' ')
company = c[0]
This works in some of the names. However, in the example below:
My code is returning #gianvitorossi FALL rather than just #gianvitorossi as expected.

You should split with the '/' character
company = "mystring"
c = company.split('/')
company = c[0]

well it worked on my machine. for ending characters such as slash, you can use rstrip(your_symbols).

you could do that using regular expression, here what you could do
import re
text1 = "#gianvitorossi/ FALL 2012 #highheels ..."
text2 = "#gianvitorossi FALL 2012 #highheels ..."
patt = "#[A-Za-z]+"
print(re.findall(patt, text1))
if your text might include numbers you could modify the code to be as follows
import re
text1 = "#gianvitorossi/ FALL 2012 #highheels ..."
text2 = "#gianvitorossi FALL 2012 #highheels ..."
patt = "#[A-Za-z0-9]+"
print(re.findall(patt, text1))

You can get it by using split and replace, which if your requirements above are exhaustive, should be enough:
s.split(' ')[0].replace('/','')
An example:
s = ["#gianvitorossi/ FALL 2012 #highheels ...","#gianvitorossi FALL 2012 #highheels ..."]
for i in s:
print(i.split(' ')[0].replace('/',''))
#gianvitorossi
#gianvitorossi

If you don‘t want to use regular expressions, you could use this:
original = "#gianvitorossi/ FALL 2012 #highheels ..."
extract = original.split(' ')[0]
if extract[-1] == "/":
extract = extract[:-1]

Related

Replace footer page number using regex

I have a footer extracted out using regex from a PDF. The footer example is as below
footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 2 Copyright 2001-2019 some relevant text here'
I want to find this string across all my text and replace it with a space since I dont need this in the middle of my text extraction. However I have the page number inbetween the text which changes each time so it is not a simple find and replace. I am able to find the page number using
result = re.search(r"\s[\d]\s", footer_text)
But I dont know how to replace this 2 with any number during my find and replace. Any pointers?
Assuming that footer text does contain something that matches r'\s\d+\s` (I am allowing for page numbers >= 10), then first you want to create a regex by replacing the page number with the regex that matches it:
regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
Now you can match any footer regardless of page number. The code then is:
>>> import re
...
... footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text h
... ere'
...
... regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
... replacement = ' ' # a single space (should this instead be '' for an empty string?)
...
... some_text = "abc" + footer_text + "def"
... print(regex)
... print(some_text)
... print(re.sub(regex, replacement, some_text))
...
company\ name\.\ \(ABC\)\ Q1\s\d+\sHere\ is\ some\ text\ 01\-Jan\-2019\ 1\-888\-1234567\ www\.company\.com\s\d+\sCopyright\ 2001\-2019some\ relevant\ text\ here
abccompany name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text heredef
abc def
For simpler copying:
import re
footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text here'
regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
replacement = ' ' # a single space (should this instead be '' for an empty string?)
some_text = "abc" + footer_text + "def"
print(regex)
print(some_text)
print(re.sub(regex, replacement, some_text))

Identifying dates in strings using NLTK

I'm trying to identify whether a date occurs in an arbitrary string. Here's my code:
import nltk
txts = ['Submitted on 1st January',
'Today is 1/3/15']
def chunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
for t in txts:
print t
chunk(t)
The output I'm getting is
Submitted on 1st January
(S (GPE Submitted/NNP) on/IN 1st/CD January/NNP)
Today is 1/3/15
(S Today/NN is/VBZ 1/3/15/CD)
Clearly the dates are not being tagged. Does anyone know how to have dates tagged?
Thanks
I took the date example from your comment 1/1/70 but this regex code will also find them if they are formatted differently like 1970/01/20 or 2-21-79
import re
x = 'asdfasdf sdf5sdf asd78fsadf 1/1/70 dfsdg fghdfgh 1970/01/20 gfh5fghh sdfgsdg 2-21-79 sdfgsdgf'
print re.findall(r'\d+\S\d+\S\d+', x)
Output:
['1/1/70', '1970/01/20', '2-21-79']
OR,
y = 'Asdfasdf Ddf5sdf asd78fsadf Jan 3 dfsdg fghdfgh February 10 sdfgsdgf'
print re.findall(r'[A-Z]\w+\s\d+', y)
Output:
['Jan 3', 'February 10']
NLTK will not by itself detect Dates, but combine it with Stanford's Named Entity Tagger, and it will. It can be difficult finding the right set of instructions that work effectively so here are a couple links:
Stanford tagger site - look for downloads: https://nlp.stanford.edu/software/CRF-NER.shtml
Stanford tagger API - http://www.nltk.org/api/nltk.tag.html#nltk.tag.stanford.StanfordTagger
Sorry, the linking wouldn't work for these last two.
Here is the code I used:
from nltk.tag import StanfordNERTagger
stanfordClassifier = '/path/to/classifier/classifiers/english.muc.7class.distsim.crf.ser.gz'
stanfordNerPath = '/path/to/jar/stanford-ner-2017-06-09/stanford-ner.jar'
st = StanfordNERTagger(stanfordClassifier, stanfordNerPath, encoding='utf8')
result = st.tag(word_tokenize("The date is October 13, 2017"))
print (result)
NLTK ne_chunk() does not recognize dates by default. You'll need to use timex.py by first obtaining it from nltk_contrib.

Extracting using a string pattern in Regex- Python

Cant we give a string in the Regex? For example, re.compile('((.*)?=<Bangalore>)'), in the below code i have mentioned <Bangalore> but its not displaying.
I want to extract the text before Bangalore.
import re
regex = re.compile('((.*)?=<>)')
line = ("Kathick Kumar, Bangalore who was a great person and lived from 29th
March 1980 - 21 Dec 2014")
result = regex.search(line)
print(result)
Desired output: Kathick Kumar, Bangalore
Something like this?
import re
regex = re.compile('(.*Bangalore)')
result = regex.search(line)
>>> print result.groups()
('Kathick Kumar, Bangalore',)
Use (.*)(?:Bangalore) pattern
>>> line = ("Kathick Kumar, Bangalore who was a great person and lived from 29thMarch 1980 - 21 Dec 2014")
>>> import re
>>> regex = re.compile('(.*)(?:Bangalore)')
>>> result = regex.search(line)
>>> print(result.group(0))
Kathick Kumar, Bangalore
>>>

How to combine several regex search patterns into on pattern (i.e., use an addition operator) in python

I'm trying to combine input from a user as part of a regex search. Since I only want the user-provided pattern to be searched in particular lines, I want to combine (i.e., concatenate) the pattern provided with a pattern. What's the best way to do this in python?
Here is my code, which at the moment, doesn't work because an addition operator doesn't seem to be supported by re:
import re
q1=re.compile(r'^Organism.*')
q2=re.compile(r'(moth)')
q3=re.compile(r'.*</td>')
s="Organism: moth </td>"
test=re.search(q1+q2+q3,s).group(1)
print "test", test
As best as I know, a compiled regex object cannot changed once it's compiled.
Instead, you could delay the compilation until after the user input:
import re
q1 = r'^Organism.*('
q2 = raw_input("Enter organism (e.g., moth)")
q3 = r').*</td>'
s="Organism: moth </td>"
regex = re.compile(q1+q2+q3)
test = re.search(regex,s).group(1)
print "test", test
Nothing wrong with the obvious way...
import re
q1 = "^Organism.*"
q2 = "(moth)"
q3 = ".*</td>"
rx = re.compile(q1 + q2 + q3)
s = "Organism: moth </td>"
test = rx.search(s).group(1)
In fact there's really no reason to compile a one-off regex, just use it as a string:
import re
q1 = "^Organism.*"
q2 = "(moth)"
q3 = ".*</td>"
s = "Organism: moth </td>"
test = re.search(q1 + q2 + q3, s).group(1)

What's the best way to format a phone number in Python?

If all I have is a string of 10 or more digits, how can I format this as a phone number?
Some trivial examples:
555-5555
555-555-5555
1-800-555-5555
I know those aren't the only ways to format them, and it's very likely I'll leave things out if I do it myself. Is there a python library or a standard way of formatting phone numbers?
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting, storing and validating international phone numbers.
The readme is insufficient, but I found the code well documented.
Seems like your examples formatted with three digits groups except last, you can write a simple function, uses thousand seperator and adds last digit:
>>> def phone_format(n):
... return format(int(n[:-1]), ",").replace(",", "-") + n[-1]
...
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555555")
'555-555-5555'
>>> phone_format("18005555555")
'1-800-555-5555'
Here's one adapted from utdemir's solution and this solution that will work with Python 2.6, as the "," formatter is new in Python 2.7.
def phone_format(phone_number):
clean_phone_number = re.sub('[^0-9]+', '', phone_number)
formatted_phone_number = re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1-", "%d" % int(clean_phone_number[:-1])) + clean_phone_number[-1]
return formatted_phone_number
You can use the function clean_phone() from the library DataPrep. Install it with pip install dataprep.
>>> from dataprep.clean import clean_phone
>>> df = pd.DataFrame({'phone': ['5555555', '5555555555', '18005555555']})
>>> clean_phone(df, 'phone')
Phone Number Cleaning Report:
3 values cleaned (100.0%)
Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%)
phone phone_clean
0 5555555 555-5555
1 5555555555 555-555-5555
2 18005555555 1-800-555-5555
More verbose, one dependency, but guarantees consistent output for most inputs and was fun to write:
import re
def format_tel(tel):
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1 or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = f"{tel[:3]}-{tel[3:6]}-{tel[6:]}"
return tel
Output:
>>> format_tel("1-800-628-8737")
'800-628-8737'
>>> format_tel("800-628-8737")
'800-628-8737'
>>> format_tel("18006288737")
'800-628-8737'
>>> format_tel("1800-628-8737")
'800-628-8737'
>>> format_tel("(800) 628-8737")
'800-628-8737'
>>> format_tel("(800) 6288737")
'800-628-8737'
>>> format_tel("(800)6288737")
'800-628-8737'
>>> format_tel("8006288737")
'800-628-8737'
Without magic numbers; ...if you're not into the whole brevity thing:
def format_tel(tel):
AREA_BOUNDARY = 3 # 800.6288737
SUBSCRIBER_SPLIT = 6 # 800628.8737
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1, or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = (f"{tel[:AREA_BOUNDARY]}-"
f"{tel[AREA_BOUNDARY:SUBSCRIBER_SPLIT]}-{tel[SUBSCRIBER_SPLIT:]}")
return tel
A simple solution might be to start at the back and insert the hyphen after four numbers, then do groups of three until the beginning of the string is reached. I am not aware of a built in function or anything like that.
You might find this helpful:
http://www.diveintopython3.net/regular-expressions.html#phonenumbers
Regular expressions will be useful if you are accepting user input of phone numbers. I would not use the exact approach followed at the above link. Something simpler, like just stripping out digits, is probably easier and just as good.
Also, inserting commas into numbers is an analogous problem that has been solved efficiently elsewhere and could be adapted to this problem.
In my case, I needed to get a phone pattern like "*** *** ***" by country.
So I re-used phonenumbers package in our project
from phonenumbers import country_code_for_region, format_number, PhoneMetadata, PhoneNumberFormat, parse as parse_phone
import re
def get_country_phone_pattern(country_code: str):
mobile_number_example = PhoneMetadata.metadata_for_region(country_code).mobile.example_number
formatted_phone = format_number(parse_phone(mobile_number_example, country_code), PhoneNumberFormat.INTERNATIONAL)
without_country_code = " ".join(formatted_phone.split()[1:])
return re.sub("\d", "*", without_country_code)
get_country_phone_pattern("KG") # *** *** ***

Categories