Extracting string after pattern - python

I have a series of url
www.domain.com/calendar.php?month=may.2019
www.domain.com/calendar.php?month=april.2019
www.domain.com/calendar.php?month=march.2019
www.domain.com/calendar.php?month=feb.2019
...
...
...
www.domain.com/calendar.php?month=feb.2007
I wanted to extract the year after month.
What I'm looking for
2019
2019
...
...
2007
and save them into another columns
Here's what I have:
data["urls"].str.extract('(?<=month=).*$')

Fix your code
df["urls"].str.extract('(?<=month=).*\.(\d{4})$')
If you can trust that all do have the same pattern, then these should work.
split
df["urls"].str.rsplit('.', 1).str[-1]
slice
df["urls"].str[-4:]

Here, we can also use simple expression without look-arounds, such as:
.+month=.+\.([0-9]{4})
or:
month=.+\.([0-9]{4})
Demo 1
or:
.+month=.+\.(.+)
or:
month=.+\.(.+)
Demo 2

Related

Cleaning a dataset and removing special characters in python

I am fairly new to all of this so apologies in advance.
I've got a dataset (csv). One column contains strings with whole sentences. These sentences contain missinterpreted utf-8 charactes like ’ and emojis like 🥳.
So the dataframe (df) looks kind of like this:
date text
0 Jul 31 2020 it’s crazy. i hope post-covid we can get it done🥳
1 Jul 31 2020 just sayin’ ...
2 Jul 31 2020 nba to hold first games in 'bubble' amid pandemic
The goal is to do a sentiment analysis on the texts.
Would it be best to remove ALL special characters like , . ( ) [ ] + | - to do the sentiment analysis?
How do I do that and how do I also remove the missinterpreted utf-8 charactes like ’?
I've tried it myself by using some code I found and changing that to my problem.
This resulted in this piece of code which seems to do absolutly nothing. The charactes like ’ are still in the text.
spec_chars = ["…","🥳"]
for char in spec_chars:
df['text'] = df['text'].str.replace(char, ' ')
I'm a bit lost here.
I appreciate any help!
You can change the character encoding like this. x is one of the sentences in the original post.
x = 'it’s crazy. i hope post-covid we can get it done🥳'
x.encode('windows-1252').decode('utf8')
The result is 'it’s crazy. i hope post-covid we can get it done🥳'
As jsmart stated, use the .encode .decode. Since the column is a series, you's be using .str to access the values of the series as strings and apply the methods.
As far as the text sentiment, look at NLTK. And take a look at it's examples of sentiment analysis
import pandas as pd
df = pd.DataFrame([['Jul 31 2020','it’s crazy. i hope post-covid we can get it done🥳'],
['Jul 31 2020','just sayin’ ...'],
['Jul 31 2020',"nba to hold first games in 'bubble' amid pandemic"]],
columns = ['date','text'])
df['text'] = df['text'].str.encode('windows-1252').str.decode('utf8')
Try this. It's quite helpful for me.
df['clean_text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word.isalnum()])

How to apply regex for multiple phrases on a dataframe column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd', 're', 'RE' from every row that starts with these phrases or contains these phrases. The issue I am facing is that I do not know how to apply regex for each case.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd:RE:Re: Please take action on the action needed items
4 Fix all the mistakes please
5 Fwd:Re: Take action on the attachments in this email
6 Fwd:RE: Action is required
I want a result dataframe like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 Fix all the mistakes please
5 Take action on the attachments in this email
6 Action is required
To get rid of 'Fwd' I used df['msg'].str.replace(r'^Fwd: ','')
If they can be anywhere in the string, you could use a repeating pattern:
^(?:(?:Fwd|R[eE]):)+\s*
^ Start of string
(?: Non capturing group
(?:Fwd|R[eE]): match either Fwd, Re or RE
)+ Close non capturing group and repeat 1+ times
\s* Match trailing whitespaces
Regex demo
In the replacement use an empty string.
You could also make the pattern case insensitive using re.IGNORECASE and use (?:fwd|re) if you want to match all possible variations.
For example
str.replace(r'^(?:(?:Fwd|R[eE]):)+\s*','')
The key concept in this case I believe is using the | operator which works as Either or Or for the pattern. It's very useful for these cases.
This is how I would solve the problem:
import pandas as pd
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7],
'summary':['Fwd: Please look at the attached documents and take action ',
'NSN for the ones who care',
'News for all team members ',
'Fwd:RE:Re: Please take action on the action needed items',
'Fix all the mistakes please ',
'Fwd:Re: Take action on the attachments in this email',
'Fwd:RE: Action is required',
'Redemption!']})
df['clean'] = df['summary'].str.replace(r'^Fwd:|R[eE]:\s*','')
print(df)
Output:
index ... clean
0 0 ... Please look at the attached documents and tak...
1 1 ... NSN for the ones who care
2 2 ... News for all team members
3 3 ... Please take action on the action needed items
4 4 ... Fix all the mistakes please
5 5 ... Take action on the attachments in this email
6 6 ... Action is required
7 7 ... Redemption!

Regex to extract date and time from email text

I've got a file that has a ton of text in it. Some of it looks like this:
X-DSPAM-Processed: Fri Jan 4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39771
Author: louis#media.berkeley.edu
Date: 2008-01-04 18:08:50 -0500 (Fri, 04 Jan 2008)
New Revision: 39771
Modified:
bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src/bundle/sitesetupgeneric.properties
bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src/java/org/sakaiproject/site/tool/SiteAction.java
Log:
BSP-1415 New (Guest) user Notification
I need to pull out only dates that follow this pattern:
2008-01-04 18:08:50 -0500
Here's what I tried:
import re
text = open('mbox-short.txt')
for line in text:
dates = re.compile('\d{4}(?P<sep>[-/])\d{2}(?P=sep)\d{2}\s\d{2}:\d{2}:]\d{2}\s[-/]\d{4}')
print(dates)
text.close()
The return I got was hundreds of:
\d{4}(?P<sep>[-/])\d{2}(?P=sep)\d{2}\s\d{2}:\d{2}:]\d{2}\s[-/]\d{4}
Two things:
First, the regex itself:
regex = re.compile(r'\b\d{4}[-/]\d{2}[-/]\d{2}\s\d{2}:\d{2}:\d{2}\s[-+]\d{4}\b')
Secondly, you need to call regex.findall(file) where file is a string:
>>> regex.findall(file)
['2008-01-04 18:08:50 -0500']
re.compile() produces a compiled regular expression object. findall is one of several methods of this object that let you do the actual searching/matching/finding.
Lastly: you're currently using named capturing groups. ((?P<sep>[-/])) From your question, "I need to pull out only dates that follow this pattern," it doesn't seem like you need these. You want to extract the entire expression, not capture the "separators," which is what capturing groups are designed for.
Full code block:
>>> import re
>>> regex = re.compile(r'\b\d{4}[-/]\d{2}[-/]\d{2}\s\d{2}:\d{2}:\d{2}\s[-+]\d{4}\b')
>>> with open('mbox-short.txt') as f:
... print(regex.findall(f.read()))
...
['2008-01-04 18:08:50 -0500']
Here's another solution.
import re
numberExtractRegex = re.compile(r'(\d\d\d\d[-]\d\d[-]\d\d\s\d\d[:]\d\d[:]\d\d\s[-]\d\d\d\d)')
print(numberExtractRegex.findall('Date: 2008-01-04 18:08:50 -0500 (Fri, 04 Jan 2008), Date: 2010-01-04 18:08:50 -0500 (Fri, 04 Jan 2010)'))

Python Zip Code

I am very new to Python and struggling to execute what I need.
I need to extract Zip codes out of the string "concat".
I was researching regex, but I am struggling on the functionality.
import pandas as pd
import re
from pandas import ExcelWriter
I imported the CSV, encoded text type of upload issues of string, established columns with data frame and made concat its own df
Client = pd.read_csv("CLZIPrevamp3.csv",encoding = "ISO-8859-1")
Client = Client[["clnum","concat"]]
clientzip = Client['concat']
CSV Examples
client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL
Example purposes
Zip Codes will also match international Zip codes, 4 digit and 5 digit zip codes and all fields do not have zip codes
I would then want to rewrite the results back into my Client dataframe as a third column for matching answers
Is the ZIP always a US zip code? 5 digits at the end of a field?
Then slice it off.
>>> 'smithjonllcRichmondVa23220'[-5:]
'23220'
If you have 4 digits, then you might want the regex
>>> import re
>>> re.findall('\d{4,5}$', 'smithjonllcRichmondVa3220')[0]
'3220'
For "long zip codes" like 21801-7824, it gets more complex, and it is situations when you are handed a CSV file when the columns themselves contain commas (see example)
AWBWA, Inc. 2200 Northhighschool,VA
that you need to just ask for a different data format because good luck parsing that.
As far as pandas is concerned, you can apply() a function over a column.
I'll provide 2 examples.
To be honest, if your CSV is consistently formatted in the way you mentioned in your example you can find the zipcodes using a simple albeit finite regex like this (It captures all non-space characters before the string "Attn" which seems to be a theme in your read string):
>>> def zipcodes():
import re
csv = '''client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL'''
zips = re.findall('([\S]+)Attn', csv)
print(zips)
OUTPUT:
>>> zipcodes()
['63141', '21801-7824']
...
...
Now if you want something slightly better, which discriminates by ignoring numbers that start a new line you can use a lookahead example like so (NOTE: Python's lookahead documentation is not the best... sheesh). What the lookahead below says is 'capture a string of digits in the range of 5 to 6, with 0 or 1 dahses beween them if applicable, potentially followed by any number of digits (in this case 0 or more than 0) but only capture these numbers if they are not preceded by a newline character'
>>> def zipcodes():
import re
csv = '''client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL'''
zips = re.findall('(?<!\n)[\d]{5,6}[\-]?[\d]*', csv)
print(zips)
OUTPUT:
>>> zipcodes()
['63141', '21801-7824']
Hope this helps.

Extract words between the 2nd and the 3rd comma

I am total newbie to regex, so this question might seem trivial to many of you.
I would like to extract the words between the second and the third comma, like in the sentence:
Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012
I have tried : (?<=,\s)[^,]+(?=,) but this doesn't return what I want...
data = "Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012"
import re
print re.match(".*?,.*?,\s*(.*?),.*", data).group(1)
Output
Cuvee Celine
But for this simple task, you can simply split the strings based on , like this
data.split(",")[2].strip()
In this case I see easier to use a simple split by comma.
>>> s = "Chateau d'Arsac, Bordeaux blanc, Cuvee Celine, 2012"
>>> s.split(',')[2]
' Cuvee Celine'
Why not just split the string by commas using str.split() ?
data.split(",")[2]

Categories