Extract values based on a pattern in a list python - python

I would like to extract values based on certain pattern in a list.
**Example:**
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
**Expected Output:**
Tickercode-HF;BPO
StockCode-NYSE;NEW YORK
Relevancescore-81;0
**My code**:
Tickercode=[x for x in ticker if re.match(r'[\w\.-]+[\w\.-]+', x)]
Stockcode=[x for x in ticker if re.match(r'[\w\.-]+(%)+[\w\.-]+', x)]
Relevancescore=[x for x in ticker if re.match(r'[\w\.-]+(%)+[\w\.-]+', x)]
**My output:**
['HF (NYSE) (81%);BPO (NEW YORK)]']
[]
[]
But i am getting wrong output. Please help me to resolve the issue.
Thanks

Firs, each item of ticker contains multiple records separated by semicolon, so I recommend normalize ticker. Then iterate over strings and extract info using
pattern '(\w+) \(([\w ]+)\)( \(([\d]+)%\))?'.
import re
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
ticker=[y for x in ticker for y in x.split(';')]
Tickercode=[]
Stockcode=[]
Relevancescore=[]
for s in ticker:
m = re.search(r'(\w+) \(([\w ]+)\)( \(([\d]+)%\))?', s)
Tickercode.append(m.group(1))
Stockcode.append(m.group(2))
Relevancescore.append(m.group(4))
print(Tickercode)
print(Stockcode)
print(Relevancescore)
Output:
['HF', 'BPO']
['NYSE', 'NEW YORK']
['81', None]
Update:
Using re.search instead of re.match which will match pattern from start of string. Your input have a leading white space, causing it failed.
You can add this to print which string doesn't match.
if m is None:
print('%s cannot be matched' % s)
continue

The problem with your code is that you're building up each of your lists from the input. You're telling it, "make a list of the input if the input matches my regular expression". The re.match() only matches against the beginning of a string, so the only regex that matches is the one that matches against the ticker symbol itself.
I've reorganized your code a bit below to show how it can work.
Use re.compile() to the regex doesn't have to be created each time
Use re.search() so you can find your embedded patterns
Use match.group(1) to get the matching part of the query, not the whole of the input.
Break up your input so you're only handling one group at a time
#!/usr/bin/env python
import re
# Example:
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
# **Expected Output:**
# Tickercode-HF;BPO
# StockCode-NYSE;NEW YORK
# Relevancescore-81;0
tickercode=[]
stockcode=[]
relevancescore=[]
ticker_re = re.compile(r'^\s*([A-Z]+)')
stock_re = re.compile(r'\(([\w ]+)\)')
relevance_re = re.compile(r'\((\d+)%\)')
for tick in ticker:
for stockinfo in tick.split(";"):
ticker_match = ticker_re.search(stockinfo)
stock_match = stock_re.search(stockinfo)
relevance_match = relevance_re.search(stockinfo)
ticker_code = ticker_match.group(1) if ticker_match else ''
stock_code = stock_match.group(1) if stock_match else ''
relevance_score = relevance_match.group(1) if relevance_match else '0'
tickercode.append(ticker_code)
stockcode.append(stock_code)
relevancescore.append(relevance_score)
print 'Tickercode-' + ';'.join(tickercode)
print 'StockCode-' + ';'.join(stockcode)
print 'Relevancescore-' + ';'.join(relevancescore)

Related

Split a string into Name and Time

I want to split a string as it contains a combined string of name and time.
I want to split as shown in example below:
Complete string
cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478
Desired output
cDOT_storage01_esx_infra02 07-19-2021
Efforts performed, not giving desired output
j['name'].split("-")[0], j['name'].split("-")[1][0:10]
Use rsplit. The only two _ you care about are the last two, so you can limit the number of splits rsplit will attempt using _ as the delimiter.
>>> "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478".rsplit("_", 2)
['cDOT_storage01_esx_infra02', '07-19-2021', '04.45.00.0478']
You can index the resulting list as necessary to get your final result.
If all the strings follow the same pattern (separated by an underscore(_)), you can try this.
(Untested)
string = "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478"
splitted = list(map(str, string.split('_')))
# splitted[-1] will be "04.45.00.0478"
# splitted[-2] will be "07-19-2021"
# Rest of the list will contain the front part
other = splitted.pop()
date = splitted.pop()
name = '_'.join(splitted)
print(name, date)
You use regex for searching and printing.
import re
txt = "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478"
# searching the date in the string
x = re.search("\d{2}-\d{2}-\d{4}", txt)
if x:
print("Matched")
a = re.split("[0-9]{2}-[0-9]{2}-[0-9]{4}", txt)
y = re.compile("\d{2}-\d{2}-\d{4}")
print(a[0][:-1] , " ", y.findall(txt)[0])
else:
print("No match")
Output:
Matched
cDOT_storage01_esx_infra02 07-19-2021

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

searching a word in the column pandas dataframe python

I have two text columns and I would like to find whether a word from one column is present in another. I wrote the below code, which works very well, but it detects if a word is present anywhere in the string. For example, it will find "ha" in "ham". I want to use regex expression instead, but I am stuck. I came across this post and looked at the second answer, but I haven't been able to modify it for my purpose. I would like to do something similar.
I would appreciate help and/or any pointers
d = {'emp': ['abc d. efg', 'za', 'sdfadsf '], 'vendor': ['ABCD enterprise', 'za industries', '' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')
def check_subset(vendor, employee):
s = []
for n in employee.split():
# n=" " + n +"[^a-zA-Z\d:]"
if ((str(n) in vendor.lower()) & (len(str(n))>1)):
s.append(n)
return s
check_subset("ABC-xy 54", "54 xy")
df['emp_name_find_in_vendor'] = df.apply(lambda row: check_subset(row['vendor'],row['clean_empy_name']), axis=1)
df
#########update 2
i updated my dataframe as below
d = {'emp': ['abc d. efg', 'za', 'sdfadsf ','abc','yuma'], 'vendor': ['ABCD enterprise', 'za industries', '','Person Vue\Cisco','U OF M CONTLEARNING' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')
I used code provided by first answer and it fails
in case of 'Person Vue\Cisco' it throws the error error: bad escape \c. If i remove \ in 'Person Vue\Cisco', code runs fine
in case of 'U OF M CONTLEARNING' it return u and m when clearly they are not a match
Yes, you can! It is going to be a little bit messy so let me construct in a few steps:
First, let's just create a regular expression for the single case of check_subset("ABC-xy 54", "54 xy"):
We will use re.findall(pattern, string) to find all the occurrences of pattern in string
The regex pattern will basically say "any of the words":
for the "any" we use the | (or) operator
for constructing words we need to use the parenthesis to group together... However, parenthesis (word) create a group that keeps track, so we could later call reuse these groups, since we are not interested we can create a non-capturing group by adding ?: as follows: (?:word)
import re
re.findall('(?:54)|(?:xy)', 'ABC-xy 54')
# -> ['xy', '54']
Now, we have to construct the pattern each time:
Split into words
Wrap each word inside a non-capturing group (?:)
Join all of these groups by |
re.findall('|'.join(['(?:'+x+')' for x in '54 xy'.split()]), 'ABC-xy 54')
One minor thing, since the last row's vendor is empty and you seem to want no matches (technically, the empty string matches with everything) we have to add a minor check. So we can rewrite your function to be:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
pattern = '|'.join(['(?:'+x+')' for x in vendor.lower().split(' ')])
return re.findall(pattern, employee)
And then we can apply the same way:
df['emp_name_find_in_vendor_regex'] = df.apply(lambda row: check_subset_regex(row['vendor'],row['clean_empy_name']), axis=1)
One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.
EDIT: Removing punctuation marks from vendors
You could either add a new column as you did with clean_employee, or simply add the removal to the function, as so (you will need to import string to get the string.punctuation, or just add in there a string with all the symbols you want to substitute):
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+')' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In the spirit of teaching to fish :), in regex the [] denote any of these characters... So [abc] would be the same to a|b|c.
So the re.sub line will substitute any occurrence of the string.punctuation (which evaluates to !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~) characters by a '' (removing them).
EDIT2: Adding the possibility of a single non-alphanumeric character at the end of each searchword:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+'[^a-zA-Z0-9]?)' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In this case we are using:
- ^ as the first character inside a [] (called character class), denotes any character except for those specified in the character class, e.g. [^abc] would match anything that is not a or b or c (so d, or a white space, or #)
- and the ?, which means the previous symbol is optional...
So, [^a-zA-Z0-9]? means an optional single non-alphanumeric character.

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Replacing variable length items in a list using regex in python

I am trying to replace variable length items in a list using regex. For example this item "HD479659" should be replaced by "HD0000000479659". I need just to insert 7 0s in between.I have made the following program but every time I run it I got the following error:"TypeError: object of type '_sre.SRE_Pattern' has no len()". Can you please help me how to solve this error.
thank you very much
Here is the program
import xlrd
import re
import string
wb = xlrd.open_workbook("3_1.xls")
sh = wb.sheet_by_index(0)
outfile=open('out.txt','w')
s_pat=r"HD[1-9]{1}[0-9]{5}"
s_pat1=r"HD[0]{7}[0-9]{6}"
pat = re.compile(s_pat)
pat1 = re.compile(s_pat1)
for rownum1 in range(sh.nrows):
str1= str(sh.row_values(rownum1))
m1=[]
m1 = pat.findall(str1)
m1=list(set(m1))
for a in m1:
a=re.sub(pat,pat1,a)
print >> outfile, m1
I think your solution is quite to complicated. This one should do the job and is much simpler:
import re
def repl(match):
return match.group(1) + ("0"*7) + match.group(2)
print re.sub(r"(HD)([1-9]{1}[0-9]{5})", repl, "HD479659")
See also: http://docs.python.org/library/re.html#re.sub
Update:
To transform a list of values, you have to iterate over all values. You don't have to search the matching values first:
import re
values_to_transform = [
'HD479659',
'HD477899',
'HD423455',
'does not match',
'but does not matter'
]
def repl(match):
return match.group(1) + ("0"*7) + match.group(2)
for value in values_to_transform:
print re.sub(r"(HD)([1-9]{1}[0-9]{5})", repl, value)
The result is:
HD0000000479659
HD0000000477899
HD0000000423455
does not match
but does not matter
What you need to do is extract the variable length portion of the ID explicitly, then pad with 0's based on the desired length - matched length.
If I understand the pattern correctly you want to use the regex
r"HD(?P<zeroes>0*)(?P<num>\d+)"
At that point you can do
results = re.search(...bla...).groupdict()
Which returns the dict {'zeroes': '', 'num':'479659'} in this case. From there you can pad as necessary.
It's 5am at the moment or I'd have a better solution for you, but I hope this helps.

Categories