Identifying dates in strings using NLTK - python

I'm trying to identify whether a date occurs in an arbitrary string. Here's my code:
import nltk
txts = ['Submitted on 1st January',
'Today is 1/3/15']
def chunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
for t in txts:
print t
chunk(t)
The output I'm getting is
Submitted on 1st January
(S (GPE Submitted/NNP) on/IN 1st/CD January/NNP)
Today is 1/3/15
(S Today/NN is/VBZ 1/3/15/CD)
Clearly the dates are not being tagged. Does anyone know how to have dates tagged?
Thanks

I took the date example from your comment 1/1/70 but this regex code will also find them if they are formatted differently like 1970/01/20 or 2-21-79
import re
x = 'asdfasdf sdf5sdf asd78fsadf 1/1/70 dfsdg fghdfgh 1970/01/20 gfh5fghh sdfgsdg 2-21-79 sdfgsdgf'
print re.findall(r'\d+\S\d+\S\d+', x)
Output:
['1/1/70', '1970/01/20', '2-21-79']
OR,
y = 'Asdfasdf Ddf5sdf asd78fsadf Jan 3 dfsdg fghdfgh February 10 sdfgsdgf'
print re.findall(r'[A-Z]\w+\s\d+', y)
Output:
['Jan 3', 'February 10']

NLTK will not by itself detect Dates, but combine it with Stanford's Named Entity Tagger, and it will. It can be difficult finding the right set of instructions that work effectively so here are a couple links:
Stanford tagger site - look for downloads: https://nlp.stanford.edu/software/CRF-NER.shtml
Stanford tagger API - http://www.nltk.org/api/nltk.tag.html#nltk.tag.stanford.StanfordTagger
Sorry, the linking wouldn't work for these last two.
Here is the code I used:
from nltk.tag import StanfordNERTagger
stanfordClassifier = '/path/to/classifier/classifiers/english.muc.7class.distsim.crf.ser.gz'
stanfordNerPath = '/path/to/jar/stanford-ner-2017-06-09/stanford-ner.jar'
st = StanfordNERTagger(stanfordClassifier, stanfordNerPath, encoding='utf8')
result = st.tag(word_tokenize("The date is October 13, 2017"))
print (result)

NLTK ne_chunk() does not recognize dates by default. You'll need to use timex.py by first obtaining it from nltk_contrib.

Related

How to find a date at a specific word using regex

Sentence : "I went to hospital and admitted. Date of admission: 12/08/2019 and surgery of Date of surgery: 15/09/2015. Date of admission: 12/05/2018 is admitted Raju"
keyword: "Date of admission:"
Required solution: 12/08/2019,12/05/2018
Is there any solution to get the dates near "Date of admission:" only. Is there any solution
I was unable to reproduce the result in the answer by #Ders. Plus I think .findall() is more appropriate here anyway, so:
import re
pattern = re.compile(r"Date of admission: (\d{2}/\d{2}/\d{4})")
print(pattern.findall(s))
# ['12/08/2019', '12/05/2018']
Use a capturing group. If the re matches, then you can get the contents of the group.
import re
p = re.compile("Date of admission: (\d{2}/\d{2}/\d{4})")
m = p.match(s)
date = m.group(1)
# 12/08/2019

Extract first word from string in Python

I have Python strings that follow one of two formats:
"#gianvitorossi/ FALL 2012 #highheels ..."
OR:
"#gianvitorossi FALL 2012 #highheels ..."
I want to extract just the #gianvitorossi portion.
I'm trying the following:
...
company = p['edge_media_to_caption']['edges'][0]['node']['text']
company = company.replace('/','')
company = company.replace('\t','')
company = company.replace('\n','')
c = company.split(' ')
company = c[0]
This works in some of the names. However, in the example below:
My code is returning #gianvitorossi FALL rather than just #gianvitorossi as expected.
You should split with the '/' character
company = "mystring"
c = company.split('/')
company = c[0]
well it worked on my machine. for ending characters such as slash, you can use rstrip(your_symbols).
you could do that using regular expression, here what you could do
import re
text1 = "#gianvitorossi/ FALL 2012 #highheels ..."
text2 = "#gianvitorossi FALL 2012 #highheels ..."
patt = "#[A-Za-z]+"
print(re.findall(patt, text1))
if your text might include numbers you could modify the code to be as follows
import re
text1 = "#gianvitorossi/ FALL 2012 #highheels ..."
text2 = "#gianvitorossi FALL 2012 #highheels ..."
patt = "#[A-Za-z0-9]+"
print(re.findall(patt, text1))
You can get it by using split and replace, which if your requirements above are exhaustive, should be enough:
s.split(' ')[0].replace('/','')
An example:
s = ["#gianvitorossi/ FALL 2012 #highheels ...","#gianvitorossi FALL 2012 #highheels ..."]
for i in s:
print(i.split(' ')[0].replace('/',''))
#gianvitorossi
#gianvitorossi
If you don‘t want to use regular expressions, you could use this:
original = "#gianvitorossi/ FALL 2012 #highheels ..."
extract = original.split(' ')[0]
if extract[-1] == "/":
extract = extract[:-1]

Cleaning a dataset and removing special characters in python

I am fairly new to all of this so apologies in advance.
I've got a dataset (csv). One column contains strings with whole sentences. These sentences contain missinterpreted utf-8 charactes like ’ and emojis like 🥳.
So the dataframe (df) looks kind of like this:
date text
0 Jul 31 2020 it’s crazy. i hope post-covid we can get it done🥳
1 Jul 31 2020 just sayin’ ...
2 Jul 31 2020 nba to hold first games in 'bubble' amid pandemic
The goal is to do a sentiment analysis on the texts.
Would it be best to remove ALL special characters like , . ( ) [ ] + | - to do the sentiment analysis?
How do I do that and how do I also remove the missinterpreted utf-8 charactes like ’?
I've tried it myself by using some code I found and changing that to my problem.
This resulted in this piece of code which seems to do absolutly nothing. The charactes like ’ are still in the text.
spec_chars = ["…","🥳"]
for char in spec_chars:
df['text'] = df['text'].str.replace(char, ' ')
I'm a bit lost here.
I appreciate any help!
You can change the character encoding like this. x is one of the sentences in the original post.
x = 'it’s crazy. i hope post-covid we can get it done🥳'
x.encode('windows-1252').decode('utf8')
The result is 'it’s crazy. i hope post-covid we can get it done🥳'
As jsmart stated, use the .encode .decode. Since the column is a series, you's be using .str to access the values of the series as strings and apply the methods.
As far as the text sentiment, look at NLTK. And take a look at it's examples of sentiment analysis
import pandas as pd
df = pd.DataFrame([['Jul 31 2020','it’s crazy. i hope post-covid we can get it done🥳'],
['Jul 31 2020','just sayin’ ...'],
['Jul 31 2020',"nba to hold first games in 'bubble' amid pandemic"]],
columns = ['date','text'])
df['text'] = df['text'].str.encode('windows-1252').str.decode('utf8')
Try this. It's quite helpful for me.
df['clean_text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word.isalnum()])

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Categories