Check for blanks at specified positions in a string - python

I have the following problem, which I have been able to solve in a very long way and I would like to know if there is any other way to solve it. I have the following string structure:
text = 01 ARA 22 - 02 GAG 23
But due to processing sometimes the spaces are not added properly and it may look like this:
text = 04 GOR23- 02 OER 23
text = 04 ORO 21-02 RRO 24
text = 04 DRE25- 12 RIS21
When they should look as follows:
text = 04 GOR 23 - 02 OER 23
text = 04 ORO 21 - 02 RRO 24
text = 04 DRE 25 - 12 RIS 21
To add the space in those specific positions, basically I check if in that position of the string the space exists, if it does not exist I add it.
Is there another way in python to do it more efficiently?
I appreciate any advice.

You can use a regex to capture each of the components in the text, and then replace any missing spaces with a space:
import re
regex = re.compile(r'(\d{2})\s*([A-Z]{3})\s*(\d{2})\s*-\s*(\d{2})\s*([A-Z]{3})\s*(\d{2})')
text = ['04 GOR23- 02 OER 23',
'04 ORO 21-02 RRO 24',
'04 DRE25- 12 RIS21']
[regex.sub(r'\1 \2 \3 - \4 \5 \6', t) for t in text]
Output:
['04 GOR 23 - 02 OER 23',
'04 ORO 21 - 02 RRO 24',
'04 DRE 25 - 12 RIS 21']

Here is another way to do so:
data = '04 GOR23- 02 OER 23'
new_data = "".join(char if not i in [2, 5, 7, 8, 10, 13] else f" {char}" for i, char in enumerate(data.replace(" ", "")))

Related

How to parse a string which may or maynot contain newline in python?

I have a text dataset like this separated by comma-
TABLE DATA IS ALPHANUMERIC
TABLE IS UNSORTED
VALUES ARE ( A S Q M N C H )
SEARCH IS LINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E )
SEARCH IS NONLINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P
17 T 08 K 09 E )
SEARCH IS LINEAR
sample output will be like this:
sample output
I have to parse the data to form a pandas dataframe containing columns like is_alphanumeric, sorted/unsorted and values.
I have separated the file based on comma delimiter and running a for loop through each item.
value_name=[]
vname = re.search('VALUES ARE \( (.*) \)', line)
value_name.append(vname.group(1).replace("' '", "''"))
but this regex only fetches values which are in a single line. I am unable to fetch those values which are spread in multiline. One data item here shows 2 lines for values, there can be 3 as well. How do I fetch in that case. How to remove the newline character and remove the extra spaces in that case?
You can supply flags to the regex methods that influence how matching is done - I supply 're.DOTALL' so '.' may also match newlines and adjust the pattern for your values a bit:
split at ','
extract into dictionary via regex
split + strip + join every dataline if '\n' in to remove newline and leading spaces
set to DataFrame
text = """TABLE DATA IS ALPHANUMERIC
TABLE IS UNSORTED
VALUES ARE ( A S Q M N C H )
SEARCH IS LINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E )
SEARCH IS NONLINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P
17 T 08 K 09 E )
SEARCH IS LINEAR"""
Code:
import re
import pandas as pd
# map a regexpattern to the columnname in the df to store its info
patterns = {"(IS ALPHANUMERIC)": "alphanum",
"(IS SORTED)": "sorting",
"(IS UNSORTED)": "sorting",
"(IS NONLINEAR)": "search",
"(IS LINEAR)": "search",
"VALUES ARE \((.+?)\)": "values"}
data = []
df = pd.DataFrame(columns=["alphanum", "sorting", "search", "values"])
for i,parts in enumerate(text.split(",")):
print()
found = {"alphanum":None, "sorting":None, "search":None, "values":None}
for pattern, heading in patterns.items():
match = re.search(pattern, parts, flags=re.DOTALL)
if match:
found[heading] = ' '.join(map(str.strip, match[1].split("\n")))
df.loc[i] = found
print(df)
Output:
alphanum sorting search values
0 IS ALPHANUMERIC IS UNSORTED IS LINEAR A S Q M N C H
1 IS ALPHANUMERIC IS SORTED IS NONLINEAR 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E
2 IS ALPHANUMERIC IS SORTED IS LINEAR 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P 17 T...
#PatrickArtner provided the regex solution, just want to add solution using string methods
dataset = """TABLE DATA IS ALPHANUMERIC
TABLE IS UNSORTED
VALUES ARE ( A S Q M N C H )
SEARCH IS LINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E )
SEARCH IS NONLINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P
17 T 08 K 09 E )
SEARCH IS LINEAR"""
def parse_chunk(chunk):
data = {}
headers = {'TABLE DATA IS ': 'DATA TYPE', 'TABLE IS ': 'IS_SORTED',
'SEARCH IS ': 'SEARCH'}
for line in chunk:
if line.startswith('VALUES ARE ( '):
data['VALUES'] = [line[13:-2]]
elif line.startswith(' '):
data['VALUES'].append(line.strip(' )'))
else:
for prefix, key in headers.items():
if line.startswith(prefix):
data[key] = line[len(prefix):]
break
data['VALUES'] = ' '.join(data['VALUES']) # combine all values chunks
return data
import pandas as pd
data = [parse_chunk(chunk.splitlines()) for chunk in dataset.split('\n,\n')]
print(data)
df = pd.DataFrame(data)
print(df)
output
[{'DATA TYPE': 'ALPHANUMERIC', 'IS_SORTED': 'UNSORTED', 'VALUES': 'A S Q M N C H', 'SEARCH': 'LINEAR'}, {'DATA TYPE': 'ALPHANUMERIC', 'IS_SORTED': 'SORTED', 'VALUES': "0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E", 'SEARCH': 'NONLINEAR'}, {'DATA TYPE': 'ALPHANUMERIC', 'IS_SORTED': 'SORTED', 'VALUES': "02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 17 T 08 K 09 E", 'SEARCH': 'LINEAR'}]
DATA TYPE IS_SORTED VALUES SEARCH
0 ALPHANUMERIC UNSORTED A S Q M N C H LINEAR
1 ALPHANUMERIC SORTED 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E NONLINEAR
2 ALPHANUMERIC SORTED 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 17 T ... LINEAR

Regex - Skip subsequent matches (Python)

I am trying to delimit the following into a table, but am running into issues with the name having 2 spaces in it or else "[\s]{2,}" would work. I also can't ignore whitespace between letters since the 1st column ends with a letter and the 2nd column starts with a letter.
I would like to skip any whitespace in between letters after the 1st occurrence.
String:
> TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00
TESTID
DR5 777777 0
50000
TEST NAME
23.40
600000.00
1000000 20 5
09 05 18
09 07 18
3876.00
I will try to solve your stated problem (vs the regex thing), because I don't fully understand the regex question.
If I were going to make that string into a list, I would do it like this:
my_str = "TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00"
my_list = [section for section in my_str.split(" ") if section != ""]
This uses list comprehension to filter out the blank strings from the split.
You can also use a regular expression as the separator.
import re
my_str = "TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00"
my_list = re.split(r'\s{2, }', my_str)

processing independent values from iterating through a list of dictionaries

Faultysats=[{'G11': '16 01 13 09 43 50.0000000'},
{'G11': '16 01 13 09 43 51.0000000'},
{'G03': '16 01 13 09 43 52.0000000'}]
SATS=['G01', 'G03', 'G04', 'G08', 'G11', 'G19', 'G28', 'G32']
EPOCH='16 01 13 09 43 51.0000000'
I have a these lists- faultysats, a list of dictionaries containing varying satellite and epoch times; SATS, a list containing sats; and EPOCH, a time value.
If a satellite from faultysats (e.g'G11') appears in SATS AND its corresponding epoch (eg. '16 01 13 09 43 50.0000000') from faultysats appears in EPOCH, I want to know which index the satellite is at in the SATS list.
Hope that makes sense, im struggling because i dont know how to ascertain varying values in a list of dictionaries. Is there a certain operator that deals with extracting data from a list of sats?
To get a list of the indexes you could use:
Faultysats=[{'G11': '16 01 13 09 43 50.0000000'},
{'G11': '16 01 13 09 43 51.0000000'},
{'G03': '16 01 13 09 43 52.0000000'}]
SATS=['G01', 'G03', 'G04', 'G08', 'G11', 'G19', 'G28', 'G32']
EPOCH='16 01 13 09 43 51.0000000'
indexes = []
for i in range(0, len(Faultysats)):
for key in Faultysats[i]:
if (key in SATS) and (Faultysats[i][key] == EPOCH):
indexes.append(i)
print(indexes)
How about this:
Get first the list of dictionaries's keys whose corresponding value is equal to EPOCH then get the index of that key from SATS:
>>> Faultysats=[{'G11': '16 01 13 09 43 50.0000000'},
{'G11': '16 01 13 09 43 51.0000000'},
{'G03': '16 01 13 09 43 52.0000000'}]
>>> SATS=['G01', 'G03', 'G04', 'G08', 'G11', 'G19', 'G28', 'G32']
>>> EPOCH='16 01 13 09 43 51.0000000'
>>> f = [k for d in Faultysats for k,v in d.items() if EPOCH == v]
>>> indx = [SATS.index(x) for x in f]
>>>
>>> indx
[4]
First, let's get a list of all satellites from Faultysats that are in SATS and have the EPOCH timestamp.
sats = [sat for pair in Faultysats for sat, epoch in pair.iteritems()
if sat in SATS and epoch == EPOCH]
>>> sats
['G11']
Now you can use a dictionary comprehension to provide the index location of each satellite in SATS (assuming there are no duplicates).
>>> {s: SATS.index(s) for s in sats}
{'G11': 4}

Python regex is not extracting a substring from my log file

I'm using
date = re.findall(r"^(?:\w{3} ){2}\d{2} (?:[\d]{2}:){2}\d{2} \d{4}$", message)
in Python 2.7 to extract the substrings:
Wed Feb 04 13:29:49 2015
Thu Feb 05 13:45:08 2015
from a log file like this:
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29
It is not working, and I'm required to use regex for this task, otherwise I would have split() it. What am I doing wrong?
As your sub-strings doesn't began from the first part of your string you dont need to assert position at start and end of the string so you can remove ^ and $
:
>>> s ="""
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29"""
>>> date = re.findall(r"(?:\w{3} ){2}\d{2} (?:[\d]{2}:){2}\d{2} \d{4}", s)
>>> date
['Wed Feb 04 13:29:49 2015', 'Thu Feb 05 13:45:08 2015']
Also as an alternative proposition you can just use a positive look-behind :
>>> date = re.findall(r"(?<=\d{4},).*", s)
>>> date
['Wed Feb 04 13:29:49 2015,51', 'Thu Feb 05 13:45:08 2015,29']
or without using regex you can use str.split() and str.partition() for such tasks :
>>> s ="""
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29"""
>>> [i.partition(',')[-1] for i in s.split('\n')]
['Wed Feb 04 13:29:49 2015,51', 'Thu Feb 05 13:45:08 2015,29']
a simple way to do this is just match by the commas
message = '1424,Wed Feb 04 13:29:49 2015,51 1424,Thu Feb 05 13:45:08 2015,29'
date = re.findall(r",(.*?),", message)
print date
>>> ['Wed Feb 04 13:29:49 2015', 'Thu Feb 05 13:45:08 2015']
DEMO
You dont need regex, use split.
line = "1424,Wed Feb 04 13:29:49 2015,51"
date = line.split(",")[1]
print date
>>>Wed Feb 04 13:29:49 2015

Python: hexadecimal regular expression question

I want to parse the output of a serial monitoring program called Docklight (I highly recommend it)
It outputs 'hexadecimal' strings: or a sequence of (two capital hex digits followed by a space). the corresponding regular expression is: ([0-9A-F]{2} )+ for example: '05 03 DA 4B 3F '
When program detects particular sequences of characters it places comments in the 'hexadecimal ' string. for example:
'05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
comments are strings of the following format ' .+ ' (a sequence of characters preceded by a space and followed by a space)
I want to get rid of the comments. for example, the 'hexadecimal' string above filtered would be:
'05 03 04 01 0A 03 08 0B BD AF 0D 0A '
how do i go about doing this with A regular expression?
You could try re.findall():
>>> a='05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> re.findall(r"\b[0-9A-F]{2}\b", a)
['05', '03', '04', '01', '0A', '03', '08', '0B', 'BD', 'AF', '0D', '0A']
The \b in the regular expression matches a "word boundary".
Of course, your input is ambiguous if the serial monitor inserts something like THIS BE THE HEADER.
Using your regex
hexa = '([0-9A-F]{2} )+'
" ".join(re.findall(hexa, line))
It might be easier to find all the hexadecimal numbers, assuming the inserted strings won't contain a match:
>>> data = '05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> import re
>>> pattern = re.compile("[0-9A-F]{2} ")
>>> "".join(pattern.findall(data))
'05 03 04 01 0A 03 08 0B BD AF AD 0D 0A '
Otherwise you could use the fact that the inserted strings are preceed by two spaces:
>>> data = '05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> re.sub("( .*?)(?=( [0-9A-F]{2} |$))","",data)
'05 03 04 01 0A 03 08 0B BD AF 0D 0A'
This uses a look ahead to work out when the inserted string ends. It looks for either a hexadecimal string surround by spaces or the end of the source string.
How about a solution that actually uses regex negation? ;)
result = re.sub(r"[ ]+(?:(?!\b[0-9A-F]{2}\b).)+", "", subject)
While you already received two answers that find you all hexadecimal numbers, here's the same with a direct regular expression that finds you all text that does not look like a hexadecimal number (assuming that's two letter/digits in uppercase / lowercase 0-9 and A-F range, followed by a space).
Something like this (sorry, I'm not a pythoneer, but you get the idea):
newstring = re.sub(r"[^ ]+(?<![0-9A-Fa-f ]{2}|^.)", "", yourstring)
It works by "looking back". It finds every consecutive non-space substring, then negatively looks back with (?<!....). It says: "if the previous two characters were not a hex number, then succeed". The little ^. at the end prevents to incorrectly match the first character of the string.
Edit
As suggested by Alan Moore, here's the same idea with a positive lookahead expression:
newstring = re.sub(r"(?>\b[0-9A-Fa-f ]{2}\b)", "", yourstring)
Why regexp? More pythonic for me is (fixed for hexdigit not regular digit):
command='05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
print ' '.join(com for com in command.split()
if len(com)==2 and all(c.upper() in '0123456789ABCDEF' for c in com))

Categories