I am fairly new to Python. An external simulation software I use gives me reports which include data in the following format:
1 29 Jan 2013 07:33:19.273 29 Jan 2013 09:58:10.460 8691.186
I am looking to split the above data into four strings namely;
'1', '29 Jan 2013 07:33:19.273', '29 Jan 2013 09:58:10.460', '8691.186'
I cannot use str.split since it splits out the date into multiple strings. There appears to be four white spaces between 1 and the first date and between the first and second dates. I don't know if this is four white spaces or tabs.
Using '\t' as a delimiter on split doesn't do much. If I specify ' ' (4 spaces) as a delimiter, I get the first three strings. I also then get an empty string and leading spaces in the final string. There are 10 spaces between the second date and the number.
Any suggestions on how to deal with this would be much helpful!
Thanks!
You can split on more than one space with a simple regular expression:
import re
multispace = re.compile(r'\s{2,}') # 2 or more whitespace characters
fields = multispace.split(inputline)
Demonstration:
>>> import re
>>> multispace = re.compile(r'\s{2,}') # 2 or more whitespace characters
>>> multispace.split('1 29 Jan 2013 07:33:19.273 29 Jan 2013 09:58:10.460 8691.186')
['1', '29 Jan 2013 07:33:19.273', '29 Jan 2013 09:58:10.460', '8691.186']
If the data is fixed width you can use character addressing in the string
n=str[0]
d1=str[2:26]
d2=str[27:51]
l=str[52:]
However, if Jan 02 is shown as Jan 2 this may not work as the width of the string may be variable
Related
I have a string (python), that has the date, model, make, and the year it is below:
string = "Mar 17 1997 H569, CAT: 2022"
I want to write a program that will ask the user to enter the string, and the program will automatically do something like:
date: data
model:data
make: data
year: data
The question, how can I deliminate,since I have space, comma, colon, etc. If I use characters then the problem will be not all makes and model have the same number of characters. What I am trying to do is to deliminate a string with more than once deliminater randomly mixed, in python?
One option is to use a regex:
import re
regex = re.compile('(\w+ \d+ \d+) (\w+), (\w+): (\d+)')
string = "Mar 17 1997 H569, CAT: 2022"
regex.findall(string)
output: [('Mar 17 1997', 'H569', 'CAT', '2022')]
string = input('date :') #"Mar 17 1997 H569, CAT: 2022"
mounth,day,dyear,model,make,year=string.split()
print(f'date:{day}/{mounth}/{year}')
print(f'model:{model.strip(",")}')
print(f'make:{make.strip(":")}')
print(f'year:{year}')
I have data which I'm reading in a string format as
>>> 26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18
I want to seperate '26 24 16', 'Panelboards', 10/05/18 and '26 26 00i', 'Power Distribution Units – Install', 10/05/18 as sub section, name, and date.
Also after every date, new item can begin. In this case, after 10/05/18, new sub section begins.
I have used regular expression to filter out sub section as but it creates unstructuring in my data.
re.split(r'\d\d \d\d \d\d',sentence)
If anyone has solution to efficiently retrieve these 3 features for two items.
Also, I can't use two spaces as regex due to change in structural file
Try:
s = """26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"""
out = re.split(r"\s{2,}", s)
print(out)
Prints:
['26 24 16', 'Panelboards', '10/05/18 26 26 00i', 'Power Distribution Units – Install', '10/05/18']
EDIT: If you want to split the 2nd item, use str.split() with maxsplit=1:
from itertools import chain
s = """26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"""
out = re.split(r"\s{2,}", s)
out = list(chain(out[:2], out[2].split(maxsplit=1), out[3:]))
print(out)
Prints:
['26 24 16', 'Panelboards', '10/05/18', '26 26 00i', 'Power Distribution Units – Install', '10/05/18']
You can use
\b(?P<subsection>\d+(?:\s+\d\w*)+)\s+(?P<name>.*?)\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\b
See the regex demo. Details:
\b - word boundary
(?P<subsection>\d+(?:\s+\d\w*)+) - Group "subsection": one or more digits and then one or more occurrences of one or more whitespaces followed with a digit and then zero or more word chars
\s+ - one or mor whitespaces
(?P<name>.*?) - Group "name": zero or more chars other than line break chars as few as possible
\s+ - one or mor whitespaces
(?P<date>\d{1,2}/\d{1,2}/\d{2}) - Group "date": one or two digits, /, one or two digits, /, two digits
\b - word boundary
See a Python demo:
import re
pattern = r"\b(?P<subsection>\d+(?:\s+\d\w*)+)\s+(?P<name>.*?)\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\b"
text = "26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"
print([x.groupdict() for x in re.finditer(pattern, text)])
Output:
[
{'subsection': '26 24 16', 'name': 'Panelboards', 'date': '10/05/18'},
{'subsection': '26 26 00i', 'name': 'Power Distribution Units – Install', 'date': '10/05/18'}
]
Be as specific as you can:
/^(\d\d \d\d \d\d) +(.+?) +(\d\d\/\d\d\/\d\d)$/
Match group 1 for the subsection, 2 for the name and 3 for the date.
If you need to split the string first into each line, you could hook that into the end of the date:
\/\d\d\s
I have the string like this :
str = '4 167213860 Mar 7 2017 10:37:42 +00:00 c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5'
I want to recover only one part of this word (c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5)
I looking for the regex pattern, but I can't find. I do something like that in python :
import re
str = '4 167213860 Mar 7 2017 10:37:42 +00:00 c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5'
output = re.findall(r'[a-z0-9]rsp[a-zA-Z0-9_-]+$',string)
This return me []
If some one of you can help me I will be very happy.
Use a regex that gets all adjacent non whitespace at the end of the string: \S+$
string = '4 167213860 Mar 7 2017 10:37:42 +00:00 c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5'
output = re.findall(r'\S+$',string)
Working example: https://regex101.com/r/lXFRNT/1
#Ruzhim's answer is good, but if you want to keep on doing it the way you thought about it you could just replace the "rsp" bit with a \w+
output = re.findall(r'[a-z0-9]\w+[a-zA-Z0-9_-]+$', str)
>>>['c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5']
I've got a file that has a ton of text in it. Some of it looks like this:
X-DSPAM-Processed: Fri Jan 4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39771
Author: louis#media.berkeley.edu
Date: 2008-01-04 18:08:50 -0500 (Fri, 04 Jan 2008)
New Revision: 39771
Modified:
bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src/bundle/sitesetupgeneric.properties
bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src/java/org/sakaiproject/site/tool/SiteAction.java
Log:
BSP-1415 New (Guest) user Notification
I need to pull out only dates that follow this pattern:
2008-01-04 18:08:50 -0500
Here's what I tried:
import re
text = open('mbox-short.txt')
for line in text:
dates = re.compile('\d{4}(?P<sep>[-/])\d{2}(?P=sep)\d{2}\s\d{2}:\d{2}:]\d{2}\s[-/]\d{4}')
print(dates)
text.close()
The return I got was hundreds of:
\d{4}(?P<sep>[-/])\d{2}(?P=sep)\d{2}\s\d{2}:\d{2}:]\d{2}\s[-/]\d{4}
Two things:
First, the regex itself:
regex = re.compile(r'\b\d{4}[-/]\d{2}[-/]\d{2}\s\d{2}:\d{2}:\d{2}\s[-+]\d{4}\b')
Secondly, you need to call regex.findall(file) where file is a string:
>>> regex.findall(file)
['2008-01-04 18:08:50 -0500']
re.compile() produces a compiled regular expression object. findall is one of several methods of this object that let you do the actual searching/matching/finding.
Lastly: you're currently using named capturing groups. ((?P<sep>[-/])) From your question, "I need to pull out only dates that follow this pattern," it doesn't seem like you need these. You want to extract the entire expression, not capture the "separators," which is what capturing groups are designed for.
Full code block:
>>> import re
>>> regex = re.compile(r'\b\d{4}[-/]\d{2}[-/]\d{2}\s\d{2}:\d{2}:\d{2}\s[-+]\d{4}\b')
>>> with open('mbox-short.txt') as f:
... print(regex.findall(f.read()))
...
['2008-01-04 18:08:50 -0500']
Here's another solution.
import re
numberExtractRegex = re.compile(r'(\d\d\d\d[-]\d\d[-]\d\d\s\d\d[:]\d\d[:]\d\d\s[-]\d\d\d\d)')
print(numberExtractRegex.findall('Date: 2008-01-04 18:08:50 -0500 (Fri, 04 Jan 2008), Date: 2010-01-04 18:08:50 -0500 (Fri, 04 Jan 2010)'))
I am trying to extract immunization records of this form:
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
and also of this form:
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Here is my pattern string:
"Immunization:(.*?)\n[.\n*?]*?Date Received:(.*?)\n"
This is identifying the second pattern and extracting vaccination name and date but not the first pattern. I thought that [.\n*?]*? would take care of the two possibilities (that there are other fields between vaccination name and vaccination date...or not...but this doesn't seem to be doing the trick. What is wrong with my regex and how cna I fix it?
You can use:
import re
matches = re.findall(r"Immunization:\s+(.*?)\s+.*?Date Received:\s+(.*?)$", subject, re.IGNORECASE | re.DOTALL | re.MULTILINE)
Regex Demo | Python Demo
Regex Explanation:
Tested this on pythex with MULTILINE and DOTALL:
Input
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Pattern: Immunization:\s+(\w+).*?Date Received:\s+([^\n]+)
Match 1
Tetanus
07 Jan 2013
Match 2
TETANUS
07 Dec 2012 # 1155
Pythex
Pythex with different grouping
The . in [.\n] is taken as a literal '.', not as a symbol for any-character. This is why the date line immediately following the immunisation is accepted but you fail to jump across a character that is not a newline or a dot.
(.*\n)* comes to mind to help you out in the closest way to what you already have. However, it is a bit unfortunate to have so many nested * since this means a long breath for parsing the record and as a human I also find it more difficult to understand. It may be preferable to start every loop with a literal to help the decision making if a loop shall be entered/continued at all.
If I did not mess it up then
Immunization:(.*?)(\n.*)*\nDate Received:(.*)\n
would do without left recursion and "Date Received" would only be detected at the beginning of the line.