python regex to construct a structured data structure - python

I have some data which looks like:
key abc key
value 1
value 2
value 3
key bcd key
value 2
value 3
value 4
...
...
Based on it, what I want is to construct a data structure like:
{'abc':[1,2,3]}
{'bcd':[2,3,4]}
...
Is regular expression a good choice to do that? If so, how to write the regular expression so that the process behaves like a for loop (inside the loop, I can do some job to construct a data structure with the data I got) ?
Thanks.

Using regular expression can be more robost relative to using string slicing to identify values in text file. If you have confidence in the format of your data, using string slicing will be fine.
import re
keyPat = re.compile(r'key (\w+) key')
valuePat = re.compile(r'value (\d+)')
result = {}
for line in open('data.txt'):
if keyPat.search(line):
match = keyPat.search(line).group(1)
tempL = []
result[match] = tempL
elif valuePat.search(line):
match = valuePat.search(line).group(1)
tempL.append(int(match))
else:
print('Did not match:', line)
print(result)

x="""key abc key
value 1
value 2
value 3
key bcd key
value 2
value 3
value 4"""
j= re.findall(r"key (.*?) key\n([\s\S]*?)(?=\nkey|$)",x)
d={}
for i in j:
k=map(int,re.findall(r"value (.*?)(?=\nvalue|$)",i[1]))
d[i[0]]=k
print d

The following code should work if the data is always in that format.
str=""
with open(FILENAME, "r") as f:
str =f.read()
regex = r'key ([^\s]*) key\nvalue (\d)+\nvalue (\d)+\nvalue (\d+)'
matches=re.findall(regex, str)
dic={}
for match in matches:
dic[match[0]] = map(int, match[1:])
print dic
EDIT: The other answer by meelo is more robust as it handles cases where values might be more or less than 3.

Related

How can I clean this data for easier visualizing?

I'm writing a program to read a set of data rows and quantify matching sets. I have the code below however would like to cut, or filter the numbers which is not being recognized as a match.
import collections
a = "test.txt" #This can be changed to a = input("What's the filename? ", )
line_file = open(a, "r")
print(line_file.readable()) #Readable check.
#print(line_file.read()) #Prints each individual line.
#Code for quantity counter.
counts = collections.Counter() #Creates a new counter.
with open(a) as infile:
for line in infile:
for number in line.split():
counts.update((number,))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()
This is what it outputs, however I'd like for it to not read the numbers at the end and pair the matching sets accordingly.
A2-W-FF-DIN-22: x1
A2-FF-DIN: x1
A2-W-FF-DIN-11: x1
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
C1-GH-KK-LOP: x1
What I'm aiming for is so that it ignored the "-77" in this, and instead counts the total as x3
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
Split each element on the dashes and check the last element is a number. If so, remove it, then continue on.
from collections import Counter
def trunc(s):
parts = s.split('-')
if parts[-1].isnumeric():
return '-'.join(parts[:-1])
return s
with open('data.txt') as f:
data = [trunc(x.rstrip()) for x in f.readlines()]
counts = Counter(data)
for k, v in counts.items():
print(k, v)
Output
A2-W-FF-DIN 2
A2-FF-DIN 1
B12-H-BB-DD 3
C1-GH-KK-LOP 1
You could use a regular expression to create a matching group for a digit suffix. If each number is its own string, e.g. "A2-W-FF-DIN-11", then a regular expression like (?P<base>.+?)(?:-(?P<suffix>\d+))?\Z could work.
Here, (?P<base>.+?) is a non-greedy match of any character except for a newline grouped under the name "base", (?:-(?P<suffix>\d+))? matches 0 or 1 occurrences of something like -11 occurring at the end of the "base" group and puts the digits in a group named "suffix", and \Z is the end of the string.
This is what it does in action:
>>> import re
>>> regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
>>> regex.match("A2-W-FF-DIN-11").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': '11'}
>>> regex.match("A2-W-FF-DIN").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': None}
So you can see, in this instance, whether or not the string has a digital suffix, the base is the same.
All together, here's a self-contained example of how it might be applied to data like this:
import collections
import re
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
sample_data = [
"A2-FF-DIN",
"A2-W-FF-DIN-11",
"A2-W-FF-DIN-22",
"B12-H-BB-DD",
"B12-H-BB-DD",
"B12-H-BB-DD-77",
"C1-GH-KK-LOP"
]
counts = collections.Counter()
# Iterates through the data and updates the counter.
for datum in sample_data:
# Isolates the base of the number from any digit suffix.
number = regex.match(datum)["base"]
counts.update((number,))
# Prints each number and prints how many instances were found.
for key, count in counts.items():
print(f"{key}: x{count}")
For which the output is
A2-FF-DIN: x1
A2-W-FF-DIN: x2
B12-H-BB-DD: x3
C1-GH-KK-LOP: x1
Or in the example code you provided, it might look like this:
import collections
import re
# Compiles a regular expression to match the base and suffix
# of a number in the file.
regex = re.compile(r"(?P<base>.+?)(?:-(?P<suffix>\d+))?\Z")
a = "test.txt"
line_file = open(a, "r")
print(line_file.readable()) # Readable check.
# Creates a new counter.
counts = collections.Counter()
with open(a) as infile:
for line in infile:
for number in line.split():
# Isolates the base match of the number.
counts.update((regex.match(number)["base"],))
for key, count in counts.items():
print(f"{key}: x{count}")
line_file.close()

RegEx for capturing groups using dictionary key

I'm having trouble displaying the right named capture in my dictionary function. My program reads a .txt file and then transforms the text in that file into a dictionary. I already have the right regex formula to capture them.
Here is my File.txt:
file Science/Chemistry/Quantum 444 1
file Marvel/CaptainAmerica 342 0
file DC/JusticeLeague/Superman 300 0
file Math 333 0
file Biology 224 1
Here is the regex link that is able to capture the ones I want:
By looking at the link, the ones I want to display is highlighted in green and orange.
This part of my code works:
rx= re.compile(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+')
i = sub_pattern.match(data) # 'data' is from the .txt file
x = (i.group(1), i.group(3))
print(x)
But since I'm making the .txt into a dictionary I couldn't figure out how to make .group(1) or .group(3) as keys to display specifically for my display function. I don't know how to make those groups display when I use print("Title: %s | Number: %s" % (key[1], key[3])) and it will display those contents. I hope someone can help me implement that in my dictionary function.
Here is my dictionary function:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.findall(r'file (?P<path>.*?)( |\/.*?)? (?P<views>\d+).+', line)
dictionary[line] = line_pattern
content = dictionary[line]
print(content)
return dictionary
I'm trying to make my output look like this from my text file:
Science 444
Marvel 342
DC 300
Math 333
Biology 224
You may create and populate a dictionary with your file data using
def create_dict(data):
dictionary = {}
for line in data:
m = re.search(r'file\s+([^/\s]*)\D*(\d+)', line)
if m:
dictionary[m.group(1)] = m.group(2)
return dictionary
Basically, it does the following:
Defines a dictionary dictionary
Reads data line by line
Searches for a file\s+([^/\s]*)\D*(\d+) match, and if there is a match, the two capturing group values are used to form a dictionary key-value pair.
The regex I suggest is
file\s+([^/\s]*)\D*(\d+)
See the Regulex graph explaining it:
Then, you may use it like
res = {}
with open(filepath, 'r') as f:
res = create_dict(f)
print(res)
See the Python demo.
You already used named group in your 'line_pattern', simply put them to your dictionary. re.findall would not work here. Also the character escape '\' before '/' is redundant. Thus your dictionary function would be:
def create_dict(data):
dictionary = {}
for line in data:
line_pattern = re.search(r'file (?P<path>.*?)( |/.*?)? (?P<views>\d+).+', line)
dictionary[line_pattern.group('path')] = line_pattern.group('views')
content = dictionary[line]
print(content)
return dictionary
This RegEx might help you to divide your inputs into four groups where group 2 and group 4 are your target groups that can be simply extracted and spaced with a space:
(file\s)([A-Za-z]+(?=\/|\s))(.*)(\d{3})

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

How to take a word from a dictionary by its definition

I am creating a code where I need to take a string of words, convert it into numbers where hi bye hi hello would turn into 0 1 0 2. I have used dictionary's to do this and this is why I am having trouble on the next part. I then need to compress this into a text file, to then decompress and reconstruct it into a string again. This is the bit I am stumped on.
The way I would like to do it is by compressing the indexes of the numbers, so the 0 1 0 2 bit into the text file with the dictionary contents, so in the text file it would have 0 1 0 2 and {hi:0, bye:1, hello:3}.
Now what I would like to do to decompress or read this into the python file, to use the indexes(this is how I will refer to the 0 1 0 2 from now on) to then take each word out of the dictionary and reconstruct the sentence, so if a 0 came up, it would look into the dictionary and then find what has a 0 definition, then pull that out to put into the string, so it would find hi and take that.
I hope that this is understandable and that at least one person knows how to do it, because I am sure it is possible, however I have been unable to find anything here or on the internet mentioning this subject.
TheLazyScripter gave a nice workaround solution for the problem, but the runtime characteristics are not good because for each reconstructed word you have to loop through the whole dict.
I would say you chose the wrong dict design: To be efficient, lookup should be done in one step, so you should have the numbers as keys and the words as items.
Since your problem looks like a great computer science homework (I'll consider it for my students ;-) ), I'll just give you a sketch for the solution:
use word in my_dict.values() #(adapt for py2/py3) to test whether the word is already in the dictionary.
If no, insert the next available index as key and the word as value.
you are done.
For reconstructing the sentence, just
loop through your list of numbers
use the number as key in your dict and print(my_dict[key])
Prepare exception handling for the case a key is not in the dict (which should not happen if you are controlling the whole process, but it's good practice).
This solution is much more efficient then your approach (and easier to implement).
Yes, you can just use regular dicts and lists to store the data. And use json or pickle to persist the data to disk.
import pickle
s = 'hi hello hi bye'
words = s.split()
d = {}
for word in word:
if word not in d:
d[word] = len(d)
data = [d[word] for word in words]
with open('/path/to/file', 'w') as f:
pickle.dump({'lookup': d, 'data': data}, f)
Then read it back in
with open('/path/to/file', 'r') as f:
dic = pickle.load(f)
d = d['lookup']
reverse_d = {v: k for k, v in d.iteritems()}
data = d['data']
words = [reverse_d[index] for index in data]
line = ' '.join(words)
print line
Because I don't know exactly how you have your keymap created the best I could do is guess. Here I have created 2 functions than can be used to write a string to a txt file based on a keymap and read a txt file and return a string based on a keymap. I hope this works for you or at least gives you a solid understanding on the process! Good Luck!
import os
def pack(out_file, string, conversion_map):
out_string = ''
for word in string.split(' '):
for key,value in conversion_map.iteritems():
if word.lower() == value.lower():
out_string += str(key)+' '
break
else:
out_string += word+' '
with open(out_file, 'wb') as file:
file.write(out_string)
return out_string.rstrip()
def unpack(in_file, conversion_map, on_lookup_error=None):
if not os.path.exists(in_file):
return
in_file = ''.join(open(in_file, 'rb').readlines())
out_string = ''
for word in in_file.split(' '):
for key, value in conversion_map.iteritems():
if word.lower() == str(key).lower():
out_string += str(value)+' '
break
else:
if on_lookup_error:
on_lookup_error()
else:
out_string += str(word)+' '
return out_string.rstrip()
def fail_on_lookup():
print 'We failed to find all words in our key map.'
raise Exception
string = 'Hello, my first name is thelazyscripter'
word_to_int_map = {0:'first',
1:'name',
2:'is',
3:'TheLazyScripter',
4:'my'}
d = pack('data', string, word_to_int_map) #pack and write the data based on the conversion map
print d #the data that was written to the file
print unpack('data', word_to_int_map) #here we unpack the data from the file
print unpack('data', word_to_int_map, fail_on_lookup)

python how to split text into new list

Have numerous lines of text I would like to put into a list:
123456 123456 123456 234567 234567 4567890
243564 194563 432423 764575 542354 6564536
I think you get the idea. Space separated values, each value should be it's own value. 73 values per line and something like 144 lines. I know how to split based on the column:
d = list(zip(*(e.split() for e in b)))
How I split based on the row. I want d[0] = '123456,123456,123456,234567,234567,4567890'
not d[0] = '123456,243564'
The above line splits the list up the way I don't want it split up.
EXTRA: Let me add one more thing in.
The data in the list are decimal numbers. Is there a way when I go to separate out the list that is can also round the numbers.
f = np.round(float([e.split() for e in d]),2)
That only gives me the error 'float() argument must be a string or a number'
Remove the zip(); a list comprehension is enough here:
d = [e.split() for e in b]
If you need integers, you could use:
d = [[int(v) for v in e.split()] for e in b]
If you're insistent on the commas:
with open('data.txt', 'r') as f:
d = [",".join(var.rstrip().split()) for var in f.readlines()]
print(d[0])
print(d[1])
Output:
123456,123456,123456,234567,234567,4567890
243564,194563,432423,764575,542354,6564536

Categories