I want to split a string as it contains a combined string of name and time.
I want to split as shown in example below:
Complete string
cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478
Desired output
cDOT_storage01_esx_infra02 07-19-2021
Efforts performed, not giving desired output
j['name'].split("-")[0], j['name'].split("-")[1][0:10]
Use rsplit. The only two _ you care about are the last two, so you can limit the number of splits rsplit will attempt using _ as the delimiter.
>>> "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478".rsplit("_", 2)
['cDOT_storage01_esx_infra02', '07-19-2021', '04.45.00.0478']
You can index the resulting list as necessary to get your final result.
If all the strings follow the same pattern (separated by an underscore(_)), you can try this.
(Untested)
string = "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478"
splitted = list(map(str, string.split('_')))
# splitted[-1] will be "04.45.00.0478"
# splitted[-2] will be "07-19-2021"
# Rest of the list will contain the front part
other = splitted.pop()
date = splitted.pop()
name = '_'.join(splitted)
print(name, date)
You use regex for searching and printing.
import re
txt = "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478"
# searching the date in the string
x = re.search("\d{2}-\d{2}-\d{4}", txt)
if x:
print("Matched")
a = re.split("[0-9]{2}-[0-9]{2}-[0-9]{4}", txt)
y = re.compile("\d{2}-\d{2}-\d{4}")
print(a[0][:-1] , " ", y.findall(txt)[0])
else:
print("No match")
Output:
Matched
cDOT_storage01_esx_infra02 07-19-2021
Related
string = 'ID::ID123
PUBLISHED_TWEET::ABC
DEF
GHI
EMPLOYEE_ID::ID234
TWEET::ABC
DEF
GHI
ID::ID345
TWEET::##ABC
DEF
GHI#.[]
USER_IDD::ID456
TWEET::google.com
123456789'
Required output
id = ['ID123', 'ID234', 'ID345', 'ID456'] - I got this output
Struggling with the tweet text. I need to Extract tweet text using regex python.
tweet-output = ['ABC
DEF
GHI',
'ABC
DEF
GHI',
'##ABC
DEF
GHI#.[]',
'google.com
123456789']
I tried using regex expressions by getting
# pattern_01 = r'PUBLISHED_TWEET::(.*)EMPLOYEE_ID::'
# pattern_02 = r'PUBLISHED_TWEET::(.*)ID::'
# pattern_03 = r'PUBLISHED_TWEET::(.*)USER_IDD::'
# pattern_04 = r'TWEET::(.*)EMPLOYEE_ID::'
# pattern_05 = r'TWEET::(.*)ID::'
# pattern_06 = r'TWEET::(.*)USER_IDD::
# if(re.findall(pattern = pattern_01, string = str(data))):
# new_tweet_list.append(re.findall(pattern = pattern_01, string = str(data)))
# elif(re.findall(pattern = pattern_02, string = str(data))):
# new_tweet_list.append(re.findall(pattern = pattern_02, string = str(data)))
# elif(re.findall(pattern = pattern_03, string = str(data))):
# new_tweet_list.append(re.findall(pattern = pattern_03, string = str(data)))
# elif(re.findall(pattern = pattern_04, string = str(data))):
# new_tweet_list.append(re.findall(pattern = pattern_04, string = str(data)))
# elif(re.findall(pattern = pattern_05, string = str(data))):
# new_tweet_list.append(re.findall(pattern = pattern_05, string = str(data)))
# elif(re.findall(pattern = pattern_06, string = str(data))):
# new_tweet_list.append(re.findall(pattern = pattern_06, string = str(data)))
but I am not getting any output. The string is empty or it gives the entire string that is passed.
It looks like your matches are pairs of <id, tweet>, whose matching text is separated from the identifying keywords with the double colons ::.
You can use the following regex to retrieve any string that is preceeded by the double colons and followed by a space, without lazy matching.
(?<=::)[^:]+(?=\n|$)
You can use it in python with the following code:
import re
pattern = '(?<=::)[^:]+(?=\n|$)'
pairs = re.findall(pattern, string)
tweets = {pairs[i]: pairs[i+1] for i in range(0, len(pairs), 2)}
Check the regex demo here and the python demo here.
Your question is not clear. It's hard to understand what you really want, but I'll try to guess.
I believe in this case you should split the string instead of trying search for patterns. It's way much easier.
Use
pattern = r'([A-Z_]+)::'
split_string = re.split(pattern, string)
Basically it will split the string in each piece composed by uppercase letters from a to z and underscores followed by double colon.
In this case, the output will be
['',
'ID',
'ID123\n',
'PUBLISHED_TWEET',
'ABC\nDEF\nGHI\n',
'EMPLOYEE_ID',
'ID234\n',
'TWEET',
'ABC\nDEF\nGHI\n',
'ID',
'ID345\n',
'TWEET',
'##ABC\nDEF\nGHI#.[]\n',
'USER_IDD',
'ID456\n',
'TWEET',
'google.com\n123456789']
Here we have all the labels and values split. The first item of the list is an empty string because of the split of the first line. We must just ignore it. We can also see that all the values ends with '\n'. We will correct all these things.
Lets just pair each label with its respective value. This way I believe you will be able to use these values as you want.
I will separate the labels from the values and zip them together.
labels = split_string[1::2]
values = map(lambda x: x.rstrip('\n'), split_string[2::2])
output = zip(labels, values)
You can either use the output as an generator or keep it as an iterable doing something like
output = tuple(output)
In this case the output will be
(('ID', 'ID123'),
('PUBLISHED_TWEET', 'ABC\nDEF\nGHI'),
('EMPLOYEE_ID', 'ID234'),
('TWEET', 'ABC\nDEF\nGHI'),
('ID', 'ID345'),
('TWEET', '##ABC\nDEF\nGHI#.[]'),
('USER_IDD', 'ID456'),
('TWEET', 'google.com\n123456789'))
You can check the python demo here.
I have a list of phone numbers like so:
numbers=[
‘(080)3453421256’,
‘(04256)6679345390’,
‘(022)1135643320‘]
and i have to get the prefixes of those numbers which have different lengths.
numbers.split(‘)’, 0) gives the output without the bracket.
How can i include the bracket and get the prefixes?
Try this code :
for i in numbers:
a = i.split(")")
s = a[0]
print(s[0:]+")",end = ",")
Or Try this :
for i in numbers:
a = i.index(")")
s = i[0:a+1]
print(s,end = ", ")
Output :
(080),(04256),(022),
Using Python's regular expression package might be more straightforward for you:
import re
numbers=[
"(080)3453421256",
"(04256)6679345390",
"(022)1135643320"]
pattern = "\(\d+\)" # L_PAREN, at least one digit, R_PAREN
for num in numbers:
print(re.match(pattern, num).group()) # Your data has only one match in each line.
Output:
(080)
(04256)
(022)
How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)
I would like to extract values based on certain pattern in a list.
**Example:**
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
**Expected Output:**
Tickercode-HF;BPO
StockCode-NYSE;NEW YORK
Relevancescore-81;0
**My code**:
Tickercode=[x for x in ticker if re.match(r'[\w\.-]+[\w\.-]+', x)]
Stockcode=[x for x in ticker if re.match(r'[\w\.-]+(%)+[\w\.-]+', x)]
Relevancescore=[x for x in ticker if re.match(r'[\w\.-]+(%)+[\w\.-]+', x)]
**My output:**
['HF (NYSE) (81%);BPO (NEW YORK)]']
[]
[]
But i am getting wrong output. Please help me to resolve the issue.
Thanks
Firs, each item of ticker contains multiple records separated by semicolon, so I recommend normalize ticker. Then iterate over strings and extract info using
pattern '(\w+) \(([\w ]+)\)( \(([\d]+)%\))?'.
import re
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
ticker=[y for x in ticker for y in x.split(';')]
Tickercode=[]
Stockcode=[]
Relevancescore=[]
for s in ticker:
m = re.search(r'(\w+) \(([\w ]+)\)( \(([\d]+)%\))?', s)
Tickercode.append(m.group(1))
Stockcode.append(m.group(2))
Relevancescore.append(m.group(4))
print(Tickercode)
print(Stockcode)
print(Relevancescore)
Output:
['HF', 'BPO']
['NYSE', 'NEW YORK']
['81', None]
Update:
Using re.search instead of re.match which will match pattern from start of string. Your input have a leading white space, causing it failed.
You can add this to print which string doesn't match.
if m is None:
print('%s cannot be matched' % s)
continue
The problem with your code is that you're building up each of your lists from the input. You're telling it, "make a list of the input if the input matches my regular expression". The re.match() only matches against the beginning of a string, so the only regex that matches is the one that matches against the ticker symbol itself.
I've reorganized your code a bit below to show how it can work.
Use re.compile() to the regex doesn't have to be created each time
Use re.search() so you can find your embedded patterns
Use match.group(1) to get the matching part of the query, not the whole of the input.
Break up your input so you're only handling one group at a time
#!/usr/bin/env python
import re
# Example:
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
# **Expected Output:**
# Tickercode-HF;BPO
# StockCode-NYSE;NEW YORK
# Relevancescore-81;0
tickercode=[]
stockcode=[]
relevancescore=[]
ticker_re = re.compile(r'^\s*([A-Z]+)')
stock_re = re.compile(r'\(([\w ]+)\)')
relevance_re = re.compile(r'\((\d+)%\)')
for tick in ticker:
for stockinfo in tick.split(";"):
ticker_match = ticker_re.search(stockinfo)
stock_match = stock_re.search(stockinfo)
relevance_match = relevance_re.search(stockinfo)
ticker_code = ticker_match.group(1) if ticker_match else ''
stock_code = stock_match.group(1) if stock_match else ''
relevance_score = relevance_match.group(1) if relevance_match else '0'
tickercode.append(ticker_code)
stockcode.append(stock_code)
relevancescore.append(relevance_score)
print 'Tickercode-' + ';'.join(tickercode)
print 'StockCode-' + ';'.join(stockcode)
print 'Relevancescore-' + ';'.join(relevancescore)
I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).