finding a regex to match and also trying to avoid using groups

finding a regex to match and also trying to avoid using groups - python

Policy Id NPOH123414699 xyz OH 12605-12345 NASHVILLE TN 37101
Policy Id 9214234451
Policy Id AS12345FD ..... other info
I am trying to grab the number right after Id and only that and stop as soon as a space is encountered after the id. I don't know how to it.
my regex -->> "[P|p]olicy\s*[I|i]d\s*"
However it is capturing everything after Policy number and giving me this
NPOH123414699xyzOH1260512345NASHVILLETN37101

This is the regex without groups only full matches that u need to iterate (?<=Policy Id\s)\w+
here it is: https://regex101.com/r/8GEDs1/1

this simple pattern should be more than enough if you insist to use regex
a = re.findall(r"Id\s([0-9A-Z]+)", "Policy Id B1231232131 xysa da")[0]

You can:
(?<=\bPolicy Id\s)(\w+)
Reference:
https://regex101.com/r/LrzDX5/1

Related

Optional Regex Component

I am using a regex function to return four OR 5 new fields: Store name, Details, Reason (optional), Pause time start, and Pause time end. Reason does not show up in every case like the other four fields. If it does show up, then it is between Store and Details within the text itself.
I am currently using this code to find the four required fields (which works):
parser = re.compile(r"your store, ([^,]+).*Details: ([^\n]*).*Created at: ([^\n]*).*Scheduled end time: ([^\n]*)", flags=re.DOTALL | re.MULTILINE)
df1['STORE']=''
df1['DETAILS']=''
df1['TIME_PAUSE_CREATED']=''
df1['TIME_PAUSE_END']=''
for index,i in enumerate(df1.DESCRIPTION):
txt = parser_reg.findall(i)
for field in txt:
df1['STORE'][index]=field[0]
df1['DETAILS'][index]=field[1]
df1['TIME_PAUSE_CREATED'][index]=field[2]
df1['TIME_PAUSE_END'][index]=field[3]
Is there a way to make an optional regex field and append that (else append 'Null') and continue scraping the other fields? I have tried using the following, but this only returns null values after store name:
parser = re.compile(r"your store, ([^,]+).*(Reason: ([^\n]*))?.*|Details: ([^\n]*).*)Created at: ([^\n]*).*Scheduled end time: ([^\n]*)", flags=re.DOTALL | re.MULTILINE)
Ideally I would be able to add the same respective column for 'Reason' like the other fields, but the regex expression still isn't working for me.
Thank you!

I take it from your example that Reason: is not always supplied? That's OK, just add it as an optional (one or zero occurrences) group. If it's not present, that capture group will be null. Between Store and Details add (?:Reason: (.*?))?. The final question mark says the whole Reason: section can occur zero or one times, making it optional. The whole regex (after a little extra cleanup) should read:
your store, ([^,]+).*?(?:Reason: (.*?))?\sDetails: (.*?)(?:\sDeactivation Time)?\sCreated at: (.*?[AP]M).*Scheduled end time: (.*?[AP]M)
Remember that Reason: will now be in field[1] and the other capture groups will be shifted down one.
I included this regex scanning your example string above from the Regex101 website.

How to extract URL from Pandas DataFrame?

I need to extract URLs from a column of DataFrame which was created using following values
creation_date,tweet_id,tweet_text
2020-06-06 03:01:37,1269102116364324865,#Webinar: Sign up for #SumoLogic's June 16 webinar to learn how to navigate your #Kubernetes environment and unders… https://stackoverflow.com/questions/42237666/extracting-information-from-pandas-dataframe
2020-06-06 01:29:38,1269078966985461767,"In this #webinar replay, #DisneyStreaming's #rothgar chats with #SumoLogic's #BenoitNewton about how #Kubernetes is… https://stackoverflow.com/questions/46928636/pandas-split-list-into-columns-with-regex
column name tweet_text contains URL. I am trying following code.
df["tweet_text"]=df["tweet_text"].astype(str)
pattern = r'https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)'
df['links'] = ''
df['links']= df["tweet_text"].str.extract(pattern, expand=True)
print(df)
I am using regex from answer of this question and it matches URL in both rows.
But I am getting NaN as values of new column df['links]'. I have also tried solution provided in first answer of this question, which was
df['links']= df["tweet_text"].str.extract(pattern, expand=False).str.strip()
But I am getting following error
AttributeError: 'DataFrame' object has no attribute 'str'
Lastly I created an empty column using df['links'] = '', because I was getting ValueError: Wrong number of items passed 2, placement implies 1 error. If that's relevant.
Can someone help me out here?

The main problem is that your URL pattern contains capturing groups where you need non-capturing ones. You need to replace all ( with (?: in the pattern.
However, it is not enough since str.extract requires a capturing group in the pattern so that it could return any value at all. Thus, you need to wrap the whole pattern with a capturing group.
You may use
pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()#:%_+.~#?&/=]*)'
Note the + is not necessary to escape inside a character class. Also, there is no need to use // inside a character class, one / is enough.

FIX market data request issue - Incorrect NumInGroup count for repeating group

Im trying to get a market data by FIX protocol. This is what Im sending as a request:
8=FIX.4.4|9=120|35=V|34=2|49=icmarkets.3540639|52=20190917-05:55:39.114|56=CSERVER|262=2|263=1|264=0|265=0|269=0|146=1|55=EUR/USD|267=2|10=173
and this is the response received:
8=FIX.4.4|9=163|35=3|34=2|49=CSERVER|50=TRADE|52=20190917-05:55:39.142|56=icmarkets.3540639|45=2|58=Incorrect NumInGroup count for repeating group, field=267|371=267|372=V|373=16|10=216
So there is obviously an error Incorrect NumInGroup count for repeating group, field=267
but I have no idea how to fix this. Any tips?
Thanks!

Your tags are in a complete different order than they should be. Tag 267/NoMDEntryTypes is the count tag for the repeating group and should precede that repeating group.
You send 267=2 at the end of the message. It really should be before tag 269/MdEntryType. 269 is the delimiter tag which marks the beginning of a new instance of that repeating group. But you only have one tag 269 (i.e. group count is only 1). That is probably the next problem that your counterparty will report.
Please get the FIX rules of engagement of your counterparty to check what they expect. Maybe they even have some example messages in there.
The order of tags outside of repeating groups is irrelevant. But the count and delimiter tag of a repeating group need to be in order.
Edit: here is the common description of a MarketDataRequest per the FIX 4.4 spec: https://fiximate.fixtrading.org/legacy/en/FIX.4.4/body_505786.html
But your counterparty might have slight differences.

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random).
Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)
I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.
Short data example:
INPUT:
"WMIN CBOND12/05/2022 23554132121"
or
"WalMaInCBND 12/05/2022-23554132121"
or
"WalmartI CorpBond12/05/2022|23554132121"
EXPECTED OUTPUT:
"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"
So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)
As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)
First Problem is how to separate the information effectively without being able to use a systematic pattern.
Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.
Thanks for your help!

I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.
import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)

Using python and regular expressions:
import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")
filter("WMIN CBOND12/05/2022 23554132121")
The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").

welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.
We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
Gather a list of all the codes (hopefully you have that).
use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for.
If you have the code-list let me know and I'll post the code.

edit
s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")
ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID ID2 date num
1 WMIN CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3 WalmartI CorpBond 12/05/2022 23554132121
Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?

Python - how to pull an address from a string or how to get the word before something thats on a different line?

my sample content is below
content ="""
Dear Customer,
Detail of service affected:
Bobs Builders
Retail park
The Aavenue
London
LDN 4DX
Start Time & Date: 04/01/2017 00:05
Completion Time & Date: 04/01/2017 06:00
Details of Work:
....
Im already pulling out the postcode with
postcodes = re.findall(r"[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}", content)
I would also like to get the City from this content, is that even possible? would i have to provide it with a list of Citys first? and then check against that?
or is there a way of getting the line before the postcode? as the addresses are always sent that way.
could i use the postcodes regex to get the word before the postcode?
Thanks

Here's an example :
import re
postcodes = re.findall(r"(\w+)\s+([A-Z]{3} \d[A-Z]{2})", content)
print postcodes
# => [('London', 'LDN 4DX')]
You get 2 groups, the first one is the word right before the postcode (possibly on another line), the second one is the postcode itself.
The postcode regex has been simplified in order to make the example more readable.
If you want to match any UK code, here is a good reference.
The regex you mentioned doesn't match LDN 4DX by the way. Adding a ? for [0-9R] would do :
postcodes = re.findall(r"[A-Z]{1,2}[0-9R]?[0-9A-Z]? [0-9][A-Z]{2}", content)

There are multiple ways to approach this problem:
1- Use Google API geolocation
If you can extract the address part by doing pattern matching, you can pass the address to Google Map Geocode API and let it parse the address for you.
2- Regex search
If you are sure that the address is always well-formatted, and postcode always precede by city name, you can use regex to handle these situations:
(\w*)\s+([A-Z]{3}\s+\d[A-Z]{2})
3- Use a database of city names
If the addresses are not always well-formatted, your best bet would use a database of city names such as OpenAddresses.
4- Use an entity extraction API [BEST]
This is a classic application of Information extraction in Natural Language Processing. You can implement your own using nltk, or even better you can use a web service such as AlchemyAPI. Copy and Paste your text in their demo and see how powerful it is by yourself.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

finding a regex to match and also trying to avoid using groups - python

This is the regex without groups only full matches that u need to iterate (?<=Policy Id\s)\w+ here it is: https://regex101.com/r/8GEDs1/1

this simple pattern should be more than enough if you insist to use regex a = re.findall(r"Id\s([0-9A-Z]+)", "Policy Id B1231232131 xysa da")[0]

You can: (?<=\bPolicy Id\s)(\w+) Reference: https://regex101.com/r/LrzDX5/1

Related

Optional Regex Component

How to extract URL from Pandas DataFrame?

FIX market data request issue - Incorrect NumInGroup count for repeating group

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

Python - how to pull an address from a string or how to get the word before something thats on a different line?

Categories

Resources