Python Zip Code - python

I am very new to Python and struggling to execute what I need.
I need to extract Zip codes out of the string "concat".
I was researching regex, but I am struggling on the functionality.
import pandas as pd
import re
from pandas import ExcelWriter
I imported the CSV, encoded text type of upload issues of string, established columns with data frame and made concat its own df
Client = pd.read_csv("CLZIPrevamp3.csv",encoding = "ISO-8859-1")
Client = Client[["clnum","concat"]]
clientzip = Client['concat']
CSV Examples
client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL
Example purposes
Zip Codes will also match international Zip codes, 4 digit and 5 digit zip codes and all fields do not have zip codes
I would then want to rewrite the results back into my Client dataframe as a third column for matching answers

Is the ZIP always a US zip code? 5 digits at the end of a field?
Then slice it off.
>>> 'smithjonllcRichmondVa23220'[-5:]
'23220'
If you have 4 digits, then you might want the regex
>>> import re
>>> re.findall('\d{4,5}$', 'smithjonllcRichmondVa3220')[0]
'3220'
For "long zip codes" like 21801-7824, it gets more complex, and it is situations when you are handed a CSV file when the columns themselves contain commas (see example)
AWBWA, Inc. 2200 Northhighschool,VA
that you need to just ask for a different data format because good luck parsing that.
As far as pandas is concerned, you can apply() a function over a column.

I'll provide 2 examples.
To be honest, if your CSV is consistently formatted in the way you mentioned in your example you can find the zipcodes using a simple albeit finite regex like this (It captures all non-space characters before the string "Attn" which seems to be a theme in your read string):
>>> def zipcodes():
import re
csv = '''client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL'''
zips = re.findall('([\S]+)Attn', csv)
print(zips)
OUTPUT:
>>> zipcodes()
['63141', '21801-7824']
...
...
Now if you want something slightly better, which discriminates by ignoring numbers that start a new line you can use a lookahead example like so (NOTE: Python's lookahead documentation is not the best... sheesh). What the lookahead below says is 'capture a string of digits in the range of 5 to 6, with 0 or 1 dahses beween them if applicable, potentially followed by any number of digits (in this case 0 or more than 0) but only capture these numbers if they are not preceded by a newline character'
>>> def zipcodes():
import re
csv = '''client number client add
40008 All, EdNULLNULLNULLNULLNULL
40009 EC, Inc. 4200 Exec-ParkwayS, MO 63141Attn: John Smith
40010 AWBWA, Inc. 2200 Northhighschool,VA 21801-7824Attn: TerryLongNULL NULL'''
zips = re.findall('(?<!\n)[\d]{5,6}[\-]?[\d]*', csv)
print(zips)
OUTPUT:
>>> zipcodes()
['63141', '21801-7824']
Hope this helps.

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.
Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

regex as a separator to read tables in python (Pandas)

I would like to ask some help to read a text file (Python 2.7, pandas library) that is using "|" as a separator, but you can also find the same character in the records followed by space. The first two rows don't have the problem, but the third one has the separator in between the 6th field TAT Fans | Southern
1. 4_230_0415_99312||||9500|Gedung|||||||||15000|6.11403|102.23061
2. 4_230_0415_99313||||9500|Pakatan|||||||||50450|3.15908|101.71431
3. 4_230_0117_12377||||9990|TAT Fans | Southern||||||||||3.141033333|101.727125
I have been trying to use regex in the separator, but I haven't been able to make it work :
pd.read_table("text_file.txt", sep = "\S+\|\S+")
Can Anyone help me find a solution to my problem?
Many thanks in advance!
You can use "\s?[|]+\s?"
import pandas as pd
pd.read_table("text_file.txt", sep="\s?[|]+\s?") #or "\s?\|+\s?"
Out[18]:
4_230_0415_99312 9500 Gedung 15000 6.11403 102.23061
0 4_230_0415_99313 9500 Pakatan 50450 3.159080 101.714310
1 4_230_0117_12377 9990 TAT Fans Southern 3.141033 101.727125

Using python regex with backreference matches

I have a doubt about regex with backreference.
I need to match strings, I try this regex (\w)\1{1,} to capture repeated values of my string, but this regex only capture consecutive repeated strings; I'm stuck to improve my regex to capture all repeated values, below some examples:
import re
str = 'capitals'
re.search(r'(\w)\1{1,}', str)
Output None
import re
str = 'butterfly'
re.search(r'(\w)\1{1,}', str)
<_sre.SRE_Match object; span=(2, 4), match='tt'>
I would use r'(\w).*\1 so that it allows any repeated character even if there are special characters or spaces in between.
However this wont work for strings with repeated characters overlapping the contents of groups like the string abcdabcd, in which it only recognizes the first group, ignoring the other repeated characters enclosed in the first group (b,c,d)
Check the demo: https://regex101.com/r/m5UfAe/1
So an alternative (and depending on your needs) is to sort the string analyzed:
import re
str = 'abcdabcde'
re.findall(r'(\w).*\1', ''.join(sorted(str)))
returning the array with the repeated characters ['a','b','c','d']
Hope the code below will help you understand the Backreference concept of Python RegEx
There are two sets of information available in the given string str
Employee Basic Info:
starting with #employeename and ends with employeename
eg: #daniel dxc chennai 45000 male daniel
Employee designation
starting with %employeename then designation and ends with employeename%
eg: %daniel python developer daniel%
import re
#sample input
str="""
#daniel dxc chennai 45000 male daniel #henry infosys bengaluru 29000 male hobby-
swimming henry
#raja zoho chennai 37000 male raja #ramu infosys bengaluru 99000 male hobby-badminton
ramu
%daniel python developer daniel% %henry database admin henry%
%raja Testing lead raja% %ramu Manager ramu%
"""
#backreferencing employee name (\w+) <---- \1
#----------------------------------------------
basic_info=re.findall(r'#+(\w+)(.*?)\1',str)
print(basic_info)
#(%) <-- \1 and (\w+) <--- \2
#-------------------------------
designation=re.findall(r'(%)+(\w+)(.*?)\2\1',str)
print(designation)
for i in range(len(designation)):
designation[i]=(designation[i][1],designation[i][2])
print(designation)

Extracting #mentions from tweets using findall python (Giving incorrect results)

I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Categories