Extract multiple line data between two symbols - Regex and Python3

Extract multiple line data between two symbols - Regex and Python3 - python

I have a huge file from which I need data for specific entries. File structure is:
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
What I have to achieve is that I need to extract all the lines (complete record) for certain entries. For e.x. I need record for Entry1.1 than I can use name of entry '>Entry1.1' till next '>' as markers in REGEX to extract lines in between. But I do not know how to build such complex REGEX expressions. Once I have such expression the I will put it a FOR loop:
For entry in entrylist:
GET record from big_file
DO some processing
WRITE in result file
What could be the REGEX to perform such extraction of record for specific entries? Is there any more pythonic way to achieve this? I would appreciate your help on this.
AK

With regex
import re
ss = '''
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
'''
patbase = '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))'
while True:
x = raw_input('What entry do you want ? : ')
found = re.findall(patbase % x, ss, re.DOTALL)
if found:
print 'found ==',found
for each_entry in found:
print '\n%s\n' % each_entry
else:
print '\n ** There is no such an entry **\n'
Explanation of '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))' :
1)
%s receives the reference of entry: 1.1 , 2 , 2.1 etc
2)
The portion (?![^\n]+?\d) is to do a verification.
(?![^\n]+?\d) is a negative look-ahead assertion that says that what is after %s must not be [^\n]+?\d that is to say any characters [^\n]+? before a digit \d
I write [^\n] to mean "any character except a newline \n".
I am obliged to write this instead of simply .+? because I put the flag re.DOTALL and the pattern portion .+? would be acting until the end of the entry.
However, I only want to verify that after the entered reference (represented by %s in the pattern), there won't be supplementary digits before the end OF THE LINE, entered by error
All that is because if there is an Entry2.1 but no Entry2 , and the user enters only 2 because he wants Entry2 and no other, the regex would detect the presence of the Entry2.1 and would yield it, though the user would really like Entry2 in fact.
3)
At the end of '(>Entry *%s(?![^\n]+?\d).+?) , the part .+? will catch the complete block of the Entry, because the dot represents any character, comprised a newline \n
It's for this aim that I put the flag re.DOTALLin order to make the following pattern portion .+? capable to pass the newlines until the end of the entry.
4)
I want the matching to stop at the end of the Entry desired, not inside the next one, so that the group defined by the parenthesises in (>Entry *%s(?![^\n]+?\d).+?) will catch exactly what we want
Hence, I put at the end a positive look-ahaed assertion (?=>|(?:\s*\Z)) that says that the character before which the running ungreedy .+? must stop to match is either > (beginning of the next Entry) or the end of the string \Z.
As it is possible that the end of the last Entry wouldn't exactly be the end of the entire string, I put \s* that means "possible whitespaces before the very end".
So \s*\Z means "there can be whitespaces before to bump into the end of the string"
Whitespaces are a blank , \f, \n, \r, \t, \v

I'm no good with regexes, so I try to look for non-regex solutions whenever I can. In Python, the natural place to store iteration logic is in a generator, and so I'd use something like this (no-itertools-required version):
def group_by_marker(seq, marker):
group = []
# advance past negatives at start
for line in seq:
if marker(line):
group = [line]
break
for line in seq:
# found a new group start; yield what we've got
# and start over
if marker(line) and group:
yield group
group = []
group.append(line)
# might have extra bits left..
if group:
yield group
In your example case, we get:
>>> with open("entry0.dat") as fp:
... marker = lambda line: line.startswith(">Entry")
... for group in group_by_marker(fp, marker):
... print(repr(group[0]), len(group))
...
'>Entry1.1\n' 10
'>Entry2.1\n' 9
'>Entry3.2\n' 4
One advantage to this approach is that we never have to keep more than one group in memory, so it's handy for really large files. It's not nearly as fast as a regex, although if the file is 1 GB you're probably I/O bound anyhow.

Not entirely sure what you're asking. Does this get you any closer? It will put all your entries as dictionary keys and a list of all its entries. Assuming it is formatted like I believe it is. Does it have duplicate entries? Here's what I've got:
entries = {}
key = ''
for entry in open('entries.txt'):
if entry.startswith('>Entry'):
key = entry[1:].strip() # removes > and newline
entries[key] = []
else:
entries[key].append(entry)

Related

Removing signs and repeating numbers

I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200
So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.
I have the following data:
import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)
Here's what I have tried to remove some of the signs:
salary = []
for i in data.salary:
space = re.sub(" ",'',i)
lower = re.sub("[a-z]",'',space)
upper = re.sub("[A-Z]",'',lower)
bracket = re.sub("/",'',upper)
comma = re.sub(",", '', bracket)
plus = re.sub("\+",'',comma)
percentage = re.sub("\%",'', plus)
dot = re.sub("\.",'', percentage)
bracket1 = re.sub("\(",'',dot)
bracket2 = re.sub("\)",'',bracket1)
salary.append(bracket2)
Which gives me:
'£26768-£30136',
'£26000-£28000',
'£21000',
'£26768-£30136',
'£33',
'£18500-£20500-',
'£27500-£30000£27500£30000',
'£35000-£40000',
'£24000-£27000',
'£19000-£24000',
'£30000-£35000',
'£44000-£6600015',
'£75-£90£75-£90'
However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.
Expected output:
'26768-30136',
'26000-28000',
'21000',
'26768-30136',
'33',
'18500-20500',
'27500-30000',
'35000-40000',
'24000-27000',
'19000-24000',
'30000-35000',
'44000-66000',
'75-90

Another way using pandas.Series.str.partition with replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen

You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
£ - a literal char
\d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.

You can do it in only two regex passes.
First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row.
The advantage of this solution is that is really only extracts monetary digits, not other possible numbers that would be there if the input is not clean.
(data['salary'].str.extractall(r'£([,\d]+)')[0] # extract £123,456 digits
.str.replace(r'\D', '', regex=True) # remove separator
.groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)
output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90

You can use replace with a pattern and optional capture groups to match the data format, and use those groups in the replacement.
import pandas as pd
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df).salary.replace(
r"^£(\d+)(?:,(\d+))?(?:\s*(-)\s*£(\d+)(?:,(\d+))?)?/.*",
r"\1\2\3\4\5", regex=True
)
print(data)
The pattern matches
^ Start of string
£ Match literally
(\d+) Capture 1+ digits in group 1
(?:,(\d+))?Optionally capture 1+ digits in group 2 that is preceded by a comma to match the data format
(?: Non capture group to match as a whole
\s*(-)\s*£ capture - between optional whitespace chars in group 3 and match £
(\d+)(?:,(\d+))? The same as previous, now in group 4 and group 5
)? Close non capture group and make it optional
See a regex demo.
Output
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90

Use regex pattern to replace numbers followed by a substring or numbers followed by a space and then substring

For a column in a pandas dataframe, I want to remove any number either immediately followed by "gb" or "mb" or with a space in-between, in it's entirety. I.e. remove strings such as "500 gb" and "500mb".
Column_To_Fix
0 coolblue 100gb
1 connector 500 mb for thing
2 5gb for user
3 load 800 mb
4 1000 add-on
5 20 gb
The function below only works for row 0 and row 2, not sure how to add in the space requirement for the pattern:
pat = '(^|\s)\d+(gb|mb)($|\s)'
df['Column_To_Fix'].str.lower().replace(pat, ' ', regex=True)
Desired Output:
Column_To_Fix
0 coolblue
1 connector for thing
2 for user
3 load
4 1000 add-on
5

Try this pattern
pat = '\d+ *(gb|mb)'
df['Column_To_Fix'].str.lower().str.replace(pat, ' ')
Out[462]:
0 coolblue
1 connector for thing
2 for user
3 load
4 1000 add-on
5
Name: Column_To_Fix, dtype: object
If you prefer series.replace
df['Column_To_Fix'].str.lower().replace(pat, ' ', regex=True)

I had assumed the text was (no line numbers):
coolblue 100gb
connector 500 mb for thing
5gb for user
load 800 mb
1000 add-on
20 gb
and that the desired result (which maintains proper alignment and spacing) was:
coolblue
connector for thing
for user
load
1000 add-on
with an empty string on the last line. That can be achieved by replacing matches of the following regular expression with empty strings (using re.sub).
r'(?:^\d+ ?[gm]b | \d+ ?[gm]b(?= |$))'
Demo

Manipulate time-range in a pandas Dataframe

Need to clean up a csv import, which gives me a range of times (in string form). Code is at bottom; I currently use regular expressions and replace() on the df to convert other chars. Just not sure how to:
select the current 24 hour format numbers and add :00
how to select the 12 hour format numbers and make them 24 hour.
Input (from csv import):
break_notes
0 15-18
1 18.30-19.00
2 4PM-5PM
3 3-4
4 4-4.10PM
5 15 - 17
6 11 - 13
So far I have got it to look like (remove spaces, AM/PM, replace dot with colon):
break_notes
0 15-18
1 18:30-19:00
2 4-5
3 3-4
4 4-4:10
5 15-17
6 11-13
However, I would like it to look like this ('HH:MM-HH:MM' format):
break_notes
0 15:00-18:00
1 18:30-19:00
2 16:00-17:00
3 15:00-16:00
4 16:00-16:10
5 15:00-17:00
6 11:00-13:00
My code is:
data = pd.read_csv('test.csv')
data.break_notes = data.break_notes.str.replace(r'([P].|[ ])', '').str.strip()
data.break_notes = data.break_notes.str.replace(r'([.])', ':').str.strip()

Here is the converter function that you need based on your requested input data. convert_entry takes complete value entry, splits it on a dash, and passes its result to convert_single, since both halfs of one entry can be converted individually. After each conversion, it merges them with a dash.
convert_single uses regex to search for important parts in the time string.
It starts with a some numbers \d+ (representing the hours), then optionally a dot or a colon and some more number [.:]?(\d+)? (representing the minutes). And after that optionally AM or PM (AM|PM)? (only PM is relevant in this case)
import re
def convert_single(s):
m = re.search(pattern="(\d+)[.:]?(\d+)?(AM|PM)?", string=s)
hours = m.group(1)
minutes = m.group(2) or "00"
if m.group(3) == "PM":
hours = str(int(hours) + 12)
return hours.zfill(2) + ":" + minutes.zfill(2)
def convert_entry(value):
start, end = value.split("-")
start = convert_single(start)
end = convert_single(end)
return "-".join((start, end))
values = ["15-18", "18.30-19.00", "4PM-5PM", "3-4", "4-4.10PM", "15 - 17", "11 - 13"]
for value in values:
cvalue = convert_entry(value)
print(cvalue)

Python regex to detect string in multiline

I am trying to detect a string, sometime it appears as one line and sometimes it appears as multiline.
case 1:
==================== 1 error in 500.14 seconds =============
case 2:
================= 3 tests deselected by "-m 'not regression'" ==================
21 failed, 553 passed, 35 skipped, 3 deselected, 4 error, 51 rerun in 6532.96 seconds
I have tried the following thing but it's not working
==+.*(?i)(?m)(error|failed).*(==+|seconds)

Use the below regex:
==+[\s\S]*?(\d+)\s(error|failed).*(==+|seconds)
[\s\S] instead of . allows line delimiters as well
(\d+) is the first matching group so matches[0] will always contains the number such as 1 or 21
(error|failed) is the second matching group so matches[1] will contain either 'error' or 'failed'
Regex101 Demo
Testing in Python:
import re
pattern = "==+[\s\S]*?(\d+)\s(error|failed).*(==+|seconds)"
case1 = "==================== 1 error in 500.14 seconds ============="
p = re.compile(pattern)
matches = p.match(case1).groups()
matches[0] + " " + matches[1] # Output: '1 error'
case2 = """================= 3 tests deselected by -m 'not regression' ==================
21 failed, 553 passed, 35 skipped, 3 deselected, 4 error, 51 rerun in 6532.96 seconds"""
matches = p.match(case2).groups()
matches[0] + " " + matches[1] # Output: '21 failed'
Hope this helps!

replacing appointed characters in a string in txt file

Hello all…I want to pick up the texts ‘DesingerXXX’ from a text file which contains below contents:
C DesignerTEE edBore 1 1/42006
Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
EngineBore 11/16 DesignerTDT 8Length 3Width 3
EngineCy DesignerHEE Inline2008Bore 1
Height 4TheChallen DesignerTET e 1Stroke 1P 305
Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
Height DesignerEQE C 60150ccGas2007
Anidea is to use the ‘Designer’ as a key, to consider each line into 2 parts, before the key, and after the key.
file_object = open('C:\\file.txt')
lines = file_object.readlines()
for line in lines:
if 'Designer' in line:
where = line.find('Designer')
before = line[0:where]
after = line[where:len(line)]
file_object.close()
In the ‘before the key’ part, I need to find the LAST space (‘ ’), and replace to another symbol/character.
In the ‘after the key’ part, I need to find the FIRST space (‘ ’), and replace to another symbol/character.
Then, I can slice it and pick up the wanted according to the new symbols/characters.
is there a better way to pick up the wanted texts? Or not, how can I replace the appointed key spaces?
In the string replace function, I can limit the times of replacing but not exactly which I can replace. How can I do that?
thanks

Using regular expressions, its a trivial task:
>>> s = '''C DesignerTEE edBore 1 1/42006
... Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
... EngineBore 11/16 DesignerTDT 8Length 3Width 3
... EngineCy DesignerHEE Inline2008Bore 1
... Height 4TheChallen DesignerTET e 1Stroke 1P 305
... Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
... Height DesignerEQE C 60150ccGas2007'''
>>> import re
>>> exp = 'Designer[A-Z]{3}'
>>> re.findall(exp, s)
['DesignerTEE', 'DesignerHHJ', 'DesignerTDT', 'DesignerHEE', 'DesignerTET', 'DesignerQBG', 'DesignerEQE']
The regular expression is Designer[A-Z]{3} which means the letters Designer, followed by any letter from capital A to capital Z that appears 3 times, and only three times.
So, it won't match DesignerABCD (4 letters), it also wont match Desginer123 (123 is not valid letters).
It also won't match Designerabc (abc are small letters). To make it ignore the case, you can pass an optional flag re.I as a third argument; but this will also match designerabc (you have to be very specific with regular expressions).
So, to make it so that it matches Designer followed by exactly 3 upper or lower case letters, you'd have to change the expression to Designer[Aa-zZ]{3}.
If you want to search and replace, then you can use re.sub for substituting matches; so if I want to replace all matches with the word 'hello':
>>> x = re.sub(exp, 'hello', s)
>>> print(x)
C hello edBore 1 1/42006
Cylinder SingleVerticalB hello e 1 1/8Cooling 1
EngineBore 11/16 hello 8Length 3Width 3
EngineCy hello Inline2008Bore 1
Height 4TheChallen hello e 1Stroke 1P 305
Height 8C 606Wall15ccG hello ccGasEngineJ 142
Height hello C 60150ccGas2007
and what if both before and after 'Designer', there are characters,
and the length of character is not fixed. I tried
'[Aa-zZ]Designer[Aa-zZ]{0~9}', but it doesn't work..
For these things, there are special characters in regular expressions. Briefly summarized below:
When you want to say "1 or more, but at least 1", use +
When you want to say "0 or any number, but there maybe none", use *
When you want to say "none but if it exists, only repeats once" use ?
You use this after the expression you want to be modified with the "repetition" modifiers.
For more on this, have a read through the documentation.
Now your requirements is "there are characters but the length is not fixed", based on this, we have to use +.

Try with re.sub. The regular expression match with your keyword surrounded by spaces. The second parameter of sub, replace the surrounder spaces by your_special_char (in my script a hyphen)
>>> import re
>>> with open('file.txt') as file_object:
... your_special_char = '-'
... for line in file_object:
... formated_line = re.sub(r'(\s)(Designer[A-Z]{3})(\s)', r'%s\2%s' % (your_special_char,your_special_char), line)
... print formated_line
...
C -DesignerTEE-edBore 1 1/42006
Cylinder SingleVerticalB-DesignerHHJ-e 1 1/8Cooling 1
EngineBore 11/16-DesignerTDT-8Length 3Width 3
EngineCy-DesignerHEE-Inline2008Bore 1
Height 4TheChallen-DesignerTET-e 1Stroke 1P 305
Height 8C 606Wall15ccG-DesignerQBG-ccGasEngineJ 142
Height-DesignerEQE-C 60150ccGas2007

Maroun Maroun mentioned 'Why not simply split the string'. so guessing one of the working way is:
import re
file_object = open('C:\\file.txt')
lines = file_object.readlines()
b = []
for line in lines:
a = line.split()
for aa in a:
b.append(aa)
for bb in b:
if 'Designer' in bb:
print bb
file_object.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract multiple line data between two symbols - Regex and Python3 - python

Related

Removing signs and repeating numbers

Use regex pattern to replace numbers followed by a substring or numbers followed by a space and then substring

Manipulate time-range in a pandas Dataframe

Python regex to detect string in multiline

replacing appointed characters in a string in txt file

Categories

Resources