I have a string that holds data. And I want everything in between ({ and })
"({Simple Data})"
Should return "Simple Data"
Or regex:
s = '({Simple Data})'
print(re.search('\({([^})]+)', s).group(1))
Output:
'Simple Data'
You could try the following:
^\({(.*)}\)$
Group 1 will contain Simple Data.
See an example on regexr.
If the brackets are always positioned at the beginning and the end of the string, then you can do this:
l = "({Simple Data})"
print(l[2:-2])
Which resulst in:
"Simple Data"
In Python you can access single characters via the [] operator. With this you can access the sequence of characters starting with the third one (index = 2) up to the second-to-last (index = -2, second-to-last is not included in the sequence).
You could try this regex (?s)\(\{(.*?)\}\)
which simply captures the contents between the delimiters.
Beware though, this doesn't account for nesting.
If nesting is a concern, the best you can to with standard Python re engine
is to get the inner nest only, using this regex:
\(\{((?:(?!\(\{|\}\).)*)\}\)
Hereby I designed a tokenizer aimming at nesting data. OP should check out here.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('DATA', r'[ \t]*[\w]+[\w \t]*'),
('SKIP', r'[ \t\f\v]+'),
('NEWLINE', r'\n|\r\n'),
('BOUND_L', r'\(\{'),
('BOUND_R', r'\}\)'),
('MISMATCH', r'.'),
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
lines = code.splitlines()
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
else:
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
statements = '''
({Simple Data})
({
Parent Data Prefix
({Nested Data (Yes)})
Parent Data Suffix
})
'''
queue = collections.deque()
for token in tokenize(statements):
if token.typ == 'DATA' or token.typ == 'MISMATCH':
queue.append(token.value)
elif token.typ == 'BOUND_L' or token.typ == 'BOUND_R':
print(''.join(queue))
queue.clear()
Output of this code should be:
Simple Data
Parent Data Prefix
Nested Data (Yes)
Parent Data Suffix
Related
How can I read a csv file without using any external import (e.g. csv or pandas) and turn it into a list of lists? Here's the code I worked out so far:
m = []
for line in myfile:
m.append(line.split(','))
Using this for loop works pretty fine, but if in the csv I get a ',' is in one of the fields it breaks wrongly the line there.
So, for example, if one of the lines I have in the csv is:
12,"This is a single entry, even if there's a coma",0.23
The relative element of the list is the following:
['12', '"This is a single entry', 'even if there is a coma"','0.23\n']
While I would like to obtain:
['12', '"This is a single entry, even if there is a coma"','0.23']
I would avoid trying to use a regular expression, but you would need to process the text a character at a time to determine where the quote characters are. Also normally the quote characters are not included as part of a field.
A quick example approach would be the following:
def split_row(row, quote_char='"', delim=','):
in_quote = False
fields = []
field = []
for c in row:
if c == quote_char:
in_quote = not in_quote
elif c == delim:
if in_quote:
field.append(c)
else:
fields.append(''.join(field))
field = []
else:
field.append(c)
if field:
fields.append(''.join(field))
return fields
fields = split_row('''12,"This is a single entry, even if there's a coma",0.23''')
print(len(fields), fields)
Which would display:
3 ['12', "This is a single entry, even if there's a coma", '0.23']
The CSV library though does a far better job of this. This script does not handle any special cases above your test string.
Here is my go at it:
line ='12, "This is a single entry, more bits in here ,even if there is a coma",0.23 , 12, "This is a single entry, even if there is a coma", 0.23\n'
line_split = line.replace('\n', '').split(',')
quote_loc = [idx for idx, l in enumerate(line_split) if '"' in l]
quote_loc.reverse()
assert len(quote_loc) % 2 == 0, "value was odd, should be even"
for m, n in zip(quote_loc[::2], quote_loc[1::2]):
line_split[n] = ','.join(line_split[n:m+1])
del line_split[n+1:m+1]
print(line_split)
I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow
If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.
If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)
I am new to regex, and I don't have just a string to extract the desired text but, I also need to allow other type of strings that don't match the regex to iterate in my other functions.
Here is what I'm trying to achieve:
I am running through device names from a csv file, If it has only DeviceName without the string mentioned below, it should simply return that to the function and let other function take care of it.
The string where I want to use regex will be like
'Card ADFGTR43567 on "DeviceName"' where I want to extract only the DeviceName from it.
ADFGTR43567 is a serial number and there are 11 letters in it, consisting of numbers and alphabets with no definite positions.
Here DeviceName could be anything, for EX: it could be ARIEFRETO002 or ARIERDTOT5968.na.abc.com or even just a plain mac address like 1234.abcd.5678
So even if the string has a pattern like 'Card serialnumber on DeviceName'.
I would want it to extract DeviceName and run against other functions in my code. If the device name in my csv is without such pattern, I would still want it to extract them and give it to the other function.
I have written a code with my functions, but I am not able to use regex here. This is what I have tried so far, only pasting the necessary info.
def validnames():
idx = col[0].find('-')
if idx > -1:
user = col[0][idx + 1:idx + 4]
if user.upper() in d:
return col[0].split('.')[0]
else:
return 'Not Found'
else:
return 'Not Found'
def pingable():
response = subprocess.Popen(['ping.exe', validnames()], stdout=subprocess.PIPE).communicate()[0]
response = response.decode()
if 'bytes=32' in response:
status = 'Up'
return status
else:
status = 'Down'
return status
with open("Book2.csv", 'r') as lookuplist:
for col in csv.reader(lookuplist):
if validnames() == 'Not Found' : continue
if pingable() == 'Down' : continue
if validnames().lower() not in data:
with open('Test.csv', 'a', newline='') as csvoutput:
output = csv.writer(csvoutput)
output.writerows([[validnames()]+[pingable()]])
print("Device: %s" % validnames(), pingable())
def validnames(): is a function to check if that device is eligible for the ping operation (based on the condition). I was thinking to put regex in that function, and I got lost there completely!) Maybe another function, but not quite getting how to do with regex.
UPDATE: This is how I have integrated two functions, based on the accepted answer.
def regexfilter():
try:
rx = r'\bon\s+(\S+)'
m = re.search(rx, col[0])
if m:
return m.group(1)
else:
return col[0]
except:
return col[0]
def validnames():
idx = regexfilter().find('-')
if idx > -1:
user = regexfilter()[idx + 1:idx + 4]
if user.upper() in d:
return regexfilter().split('.')[0]
else:
return 'Not Found'
else:
return 'Not Found'
Since you want to match any text inside double quotation marks that follows an on preposition, you may use the following regex:
\bon\s+(\S+)
See the regex demo.
Details
\b - a word boundary
on - on word
\s+ - 1+ whitespaces
(\S+) - Capturing group 1: one or more non-whitespace chars.
See Python demo:
import re
rx = r'\bon\s+(\S+)'
s = "Card ADFGTR43567 on DeviceName"
m = re.search(rx, s)
if m:
print(m.group(1)) # => DeviceName
Seeking advice on how to mine items from multiple text files to build a dictionary.
This text file: https://pastebin.com/Npcp3HCM
Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view
There are thousands of such text files and they may have different section headings as shown in these examples:
https://pastebin.com/wWSPGaLX
https://pastebin.com/9Up4RWHu
I started off by reading the files
from glob import glob
txtPth = '../tr-txt/*.txt'
txtFiles = glob(txtPth)
with open(txtFiles[0],'r') as tf:
allLines = [line.rstrip() for line in tf]
sectionHeading = ['Corporate Participants',
'Conference Call Participiants',
'Presentation',
'Questions and Answers']
for lineNum, line in enumerate(allLines):
if line in sectionHeading:
print(lineNum,allLines[lineNum])
My idea was to look for the line numbers where section headings existed and try to extract the content in between those line numbers, then strip out separators like dashes. That didn't work and I got stuck in trying to create a dictionary of this kind so that I can later run various natural language processing algorithms on quarried items.
{file-name1:{
{date-time:[string]},
{corporate-name:[string]},
{corporate-participants:[name1,name2,name3]},
{call-participants:[name4,name5]},
{section-headings:{
{heading1:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]}],
{heading2:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name1:[speechOrderNum, text-content]},
{name4:[speechOrderNum, text-content]}],
{heading3:[text-content]},
{heading4:[text-content]}
}
}
}
The challenge is that different files may have different headings and number of headings. But there will always be a section called "Presentation" and very likely to have "Question and Answer" section. These section headings are always separated by a string of equal-to signs. And content of different speaker is always separated by string of dashes. The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning of the document with an asterisks before their name and their tile is always on the next line.
Any suggestion on how to parse the text files is appreciated. The ideal help would be to provide guidance on how to produce such a dictionary (or other suitable data structure) for each file that can then be written to a database.
Thanks
--EDIT--
One of the files looks like this: https://pastebin.com/MSvmHb2e
In which the "Question & Answer" section is mislabeled as "Presentation" and there is no other "Question & Answer" section.
And final sample text: https://pastebin.com/jr9WfpV8
The comments in the code should explain everything. Let me know if anything is under specified, and needs more comments.
In short I leverage regex to find the '=' delimiter lines to subdivide the entire text into subsections, then handle each type of sections separately for clarity sake ( so you can tell how I am handling each case).
Side note: I am using the word 'attendee' and 'author' interchangeably.
EDIT: Updated the code to sort based on the '[x]' pattern found right next to the attendee/author in the presentation/QA section. Also changed the pretty print part since pprint does not handle OrderedDict very well.
To strip any additional whitespace including \n anywhere in the string, simply do str.strip(). if you specifically need to strip only \n, then just do str.strip('\n').
I have modified the code to strip any whitespace in the talks.
import json
import re
from collections import OrderedDict
from pprint import pprint
# Subdivides a collection of lines based on the delimiting regular expression.
# >>> example_string =' =============================
# asdfasdfasdf
# sdfasdfdfsdfsdf
# =============================
# asdfsdfasdfasd
# =============================
# >>> subdivide(example_string, "^=+")
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n']
def subdivide(lines, regex):
equ_pattern = re.compile(regex, re.MULTILINE)
sections = equ_pattern.split(lines)
sections = [section.strip('\n') for section in sections]
return sections
# for processing sections with dashes in them, returns the heading of the section along with
# a dictionary where each key is the subsection's header, and each value is the text in the subsection.
def process_dashed_sections(section):
subsections = subdivide(section, "^-+")
heading = subsections[0] # header of the section.
d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])}
index_pattern = re.compile("\[(.+)\]", re.MULTILINE)
# sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number.
# Then this is passed as a compare function to 'sorted' to sort based on 'x'.
def cmp(d):
mat = index_pattern.findall(d[0])
if mat:
print(mat[0])
return int(mat[0])
# There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern.
# This is just to deal with that small issue.
else:
return 0
o_d = OrderedDict(sorted(d.items(), key=cmp))
return heading, o_d
# this is to rename the keys of 'd' dictionary to the proper names present in the attendees.
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key.
# >>> d = {'mr. man ceo of company [1]' : ' This is talk a' ,
# ... 'ms. woman ceo of company [2]' : ' This is talk b'}
# >>> l = ['mr. man', 'ms. woman']
# >>> new_d = assign_attendee(d, l)
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'}
def assign_attendee(d, attendees):
new_d = OrderedDict()
for key, value in d.items():
a = [a for a in attendees if a in key]
if len(a) == 1:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[a[0]] = value.strip()
elif len(a) == 0:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[key] = value.strip()
return new_d
if __name__ == '__main__':
with open('input.txt', 'r') as input:
lines = input.read()
# regex pattern for matching headers of each section
header_pattern = re.compile("^.*[^\n]", re.MULTILINE)
# regex pattern for matching the sections that contains
# the list of attendee's (those that start with asterisks )
ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE)
# regex pattern for matching sections with subsections in them.
dash_pattern = re.compile("^-+", re.MULTILINE)
ppl_d = dict()
talks_d = dict()
# Step1. Divide the the entire document into sections using the '=' divider
sections = subdivide(lines, "^=+")
header = []
print(sections)
# Step2. Handle each section like a switch case
for section in sections:
# Handle headers
if len(section.split('\n')) == 1: # likely to match only a header (assuming )
header = header_pattern.match(section).string
# Handle attendees/authors
elif ppl_pattern.match(section):
ppls = ppl_pattern.findall(section)
d = {key.strip(): value.strip() for (_, key, value) in ppls}
ppl_d.update(d)
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: ppl_d})
# Handle subsections
elif dash_pattern.findall(section):
heading, d = process_dashed_sections(section)
talks_d.update({heading: d})
# Else its just some random text.
else:
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: section})
#pprint(talks_d)
# To assign the talks material to the appropriate attendee/author. Still works if no match found.
for key, value in talks_d.items():
talks_d[key] = assign_attendee(value, ppl_d.keys())
# ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print.
print(json.dumps(talks_d, indent=4))
Could you please confirm that whether you only require "Presentation" and "Question and Answer" sections?
Also, regarding the output is it ok to dump CSV format similar to what you have "manually transformed".
Updated solution to work for every sample file you provided.
Output is from Cell "D:H" as per "Parsed-transcript" file shared.
#state = ["other", "head", "present", "qa", "speaker", "data"]
# codes : 0, 1, 2, 3, 4, 5
def writecell(out, data):
out.write(data)
out.write(",")
def readfile(fname, outname):
initstate = 0
f = open(fname, "r")
out = open(outname, "w")
head = ""
head_written = 0
quotes = 0
had_speaker = 0
for line in f:
line = line.strip()
if not line: continue
if initstate in [0,5] and not any([s for s in line if "=" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 1
elif initstate in [0,5] and not any([s for s in line if "-" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 4
elif initstate == 1 and line == "Presentation":
initstate = 2
head = "Presentation"
head_written = 0
elif initstate == 1 and line == "Questions and Answers":
initstate = 3
head = "Questions and Answers"
head_written = 0
elif initstate == 1 and not any([s for s in line if "=" != s]):
initstate = 0
elif initstate in [2, 3] and not any([s for s in line if ("=" != s and "-" != s)]):
initstate = 4
elif initstate == 4 and '[' in line and ']' in line:
comma = line.find(',')
speech_st = line.find('[')
speech_end = line.find(']')
if speech_st == -1:
initstate = 0
continue
if comma == -1:
firm = ""
speaker = line[:speech_st].strip()
else:
speaker = line[:comma].strip()
firm = line[comma+1:speech_st].strip()
head_written = 1
if head_written:
writecell(out, head)
head_written = 0
order = line[speech_st+1:speech_end]
writecell(out, speaker)
writecell(out, firm)
writecell(out, order)
had_speaker = 1
elif initstate == 4 and not any([s for s in line if ("=" != s and "-" != s)]):
if had_speaker:
initstate = 5
out.write('"')
quotes = 1
had_speaker = 0
elif initstate == 5:
line = line.replace('"', '""')
out.write(line)
elif initstate == 0:
continue
else:
continue
f.close()
if quotes:
out.write('"')
out.close()
readfile("Sample1.txt", "out1.csv")
readfile("Sample2.txt", "out2.csv")
readfile("Sample3.txt", "out3.csv")
Details
in this solution there is a state machine which works as follows:
1. detects whether heading is present, if yes, write it
2. detects speakers after heading is written
3. writes notes for that speaker
4. switches to next speaker and so on...
You can later process the csv files as you want.
You can also populate the data in any format you want once basic processing is done.
Edit:
Please replace the function "writecell"
def writecell(out, data):
data = data.replace('"', '""')
out.write('"')
out.write(data)
out.write('"')
out.write(",")
I have a large list of sub-strings that I want to search through and find if two particular sub-strings can be found in a row. The logic is intended to look for the first sequence, then if it is found, look at the second sub-string and return all the matches (based off the first 15 characters of the 16 character sequence). If the first sequence can not be found, it just looks for the second sequence only, and finally, if that can not be found, defaults to zero. The matches are then appended to a list, which is processed further. The current code used is as follows:
dataA = ['0100101010001000',
'1001010100010001',
'0010101000100010',
'0101010001000110',
'1010100010001110',
'0101000100011100',
'1010001000111010',
'0100010001110100',
'1000100011101000',
'0001000111010000']
A_vein_1 = [0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
joined_A_Search_1 = ''.join(map(str,A_vein_1))
print 'search 1', joined_A_Search_1
A_vein_2 = [1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
joined_A_Search_2 = ''.join(map(str,A_vein_2))
match_A = [] #empty list to append closest match to
#Match search algorithm
for i,text in enumerate(data):
if joined_A_Search_1 == text:
if joined_A_Search_2 == data[i+1][:-1]:
print 'logic stream 1'
match_A.append(data[i+1][-1])
if joined_A_Search_1 != text:
if joined_A_Search_2 == text[:-1]:
print 'logic stream 2'
#print 'match', text[:-1]
match_A.append(text[-1])
print ' A matches', match_A
try:
filter_A = max(set(match_A), key=match_A.count)
except:
filter_A = 0
print 'no match A'
filter_A = int(filter_A)
print '0utput', filter_A
The issue is that I get a return of both logic stream 1 and logic stream 2, when I actually want it to be a strict one or the other, in this case only logic stream 1. An example of the output looks like this:
search 1 0100101010001000
search 2 100101010001000
logic stream 1
logic stream 2
logic stream 1
logic stream 2
logic stream 2
(Note: The list has been shortened, and the data inputs have been substituted in directly, as well as the print outs for the purposes of this post and error tracking)
Input :
dataA = ['0100101010001000',
'1001010100010001',
'0010101000100010',
'0101010001000110',
'1010100010001110',
'0101000100011100',
'1010001000111010',
'0100010001110100',
'1000100011101000',
'0001000111010000']
A_vein_1 = [0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
A_vein_2 = [1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
code :
av1_str = "".join(map(str,A_vein_1))
av2_str = "".join(map(str,A_vein_2))
y=[av1_str,av2_str]
print [(y,dataA.index(x)) for x in dataA for y in dataB if y in x]
Output :
[('0100101010001000', 0), ('100101010001000', 0), ('100101010001000', 1)]
Your code confuses me. But I think I understand your issue:
#!/usr/env/env python
dataA = ['0100101010001000',
'1001010100010001',
'0010101000100010',
'0101010001000110',
'1010100010001110',
'0101000100011100',
'1010001000111010',
'0100010001110100',
'1000100011101000',
'0001000111010000']
A_vein_1 = [0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
A_vein_2 = [1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
av1_str = "".join(map(str,A_vein_1))
av2_str = "".join(map(str,A_vein_2))
for i, d in enumerate(dataA):
if av1_str in d:
print av1_str, 'found in line', i
elif av2_str in d:
print av2_str, 'found in line', i
This gives me :
jcg#jcg:~/code/python/stack_overflow$ python find_str.py
0100101010001000 found in line 0
100101010001000 found in line 0
100101010001000 found in line 1
After edit to elif:
jcg#jcg:~/code/python/stack_overflow$ python find_str.py
0100101010001000 found in line 0
100101010001000 found in line 1