Regex line by line over large string

Regex line by line over large string - python

I have a lot of rows like below in a file:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
I first tried importing this as a dictionary with the json module so I could just print the values of the keys. The problem is some of the lines are missing the right curly bracket or have other issues and the fields aren't in the same order per line. That is preventing the import.
So now I am trying to do this with a regex. I have this:
fo = open("c:\\newgoodtestsample.txt", "r")
x = fo.read()
match1 = re.search('first_name"(.*?)"(.*?)"', x)
if match1:
print match1.group(2)
That returns the value of just the name. I would like to be able to return other fields as well. This worked in a regex tester but I can't get it to work in my code:
(first_name|last_name|age)"(.*?)"(.*?)"
Lastly, once that is figured out, I need to read each line in the file (not just the first one) and print the requested regex data from each line into a file. I have tried inserting a for loop but I keep getting the first line repeated over and over so I must be inserting it incorrectly. Any assistance is appreciated.

The following seems to do what you want, the regex should give you back as matching groups all the value fields from the JSON (although not the keywords under which those values are stored).
I also encourage you to use the with context manager as that will close the file handle automatically after all lines have been read, which is easily done just with a for loop.
with open("c:\\newgoodtestsample.txt", "r") as fo:
for line in fo:
result = re.findallr'"(\w*?)":"?(\w*)"?', line)
d = {k:v for k,v in re.findall(r'"(\w*?)":"?(\w*)"?', line)}
if 'first_name' in d:
# print first_name into file
else:
# print empty first_name field

Related

delete only 1 instance of a string from a file

I have a file that looks like this:
1234:AnneShirly:anneshirley#seneca.ca:4:5\[SRT111,OPS105,OPS110,SPR100,ENG100\]
3217:Illyas:illay#seneca.ca:2:4\[SRT211,OPS225,SPR200,ENG200\]
1127:john Marcus:johnmarcus#seneca.ca:1:4\[SRT111,OPS105,SPR100,ENG100\]
0001:Amin Malik:amin_malik#seneca.ca:1:3\[OPS105,SPR100,ENG100\]
I want to be able to ask the user for an input(the student number at the beginning of each line) and then ask which course they want to delete(the course codes are the list). So the program would delete the course from the list in the student number without deleting other instances of the course. Cause other students have the same courses.
studentid = input("enter studentid")
course = input("enter the course to delete")
with open("studentDatabase.dat") as file:
f = file.readlines()
with open("studentDatabase.dat","w") as file:
for line in lines:
if line.find(course) == -1:
file.write(line)
This just deletes the whole line but I only want to delete the course

Welcome to the site. You have a little ways to go to make this work. It would be good if you put some additional effort in to this before asking somebody to code this up. Let me suggest a structure for you that perhaps you can work on/augment and then you can re-post if you get stuck by editing your question above and/or commenting back on this answer. Here is a framework that I suggest:
make a section of code to read in your whole .dat file into memory. I would suggest putting the data into a dictionary that looks like this:
data = {1001: (name, email, <whatever the digits stand for>, [SRT111, OPS333, ...],
1044: ( ... )}
basically a dictionary with the ID as the key and the rest in a tuple or list. Test that, make sure it works OK by inspecting a few values.
Make a little "control loop" that uses your input statements, and see if you can locate the "record" from your dictionary. Add some "if" logic to do "something" if the ID is not found or if the user enters something like "quit" to exit/break the loop. Test it to make sure it can find the ID's and then test it again to see that it can find the course in the list inside the tuple/list with the data. You probably need another "if" statement in there to "do something" if the course is not in the data element. Test it.
Make a little "helper function" that can re-write a data element with the course removed. A suggested signature would be:
def remove_course(data_element, course):
# make the new data element (name, ... , [reduced course list]
return new_data_element
Test it, make sure it works.
Put those pieces together and you should have the ingredients to change the dictionary by using the loop and function to put the new data element into the dictionary, over-writing the old one.
Write a widget to write the new .dat file from the dictionary in its entirety.
EDIT:
You can make the dictionary from a data file with something like this:
filename = 'student_data.dat'
data = {} # an empty dictionary to stuff the results in
# use a context manager to handle opening/closing the file...
with open(filename, 'r') as src:
# loop through the lines
for line in src:
# strip any whitespace from the end and tokenize the line by ":"
tokens = line.strip().split(':')
# check it... (remove later)
print(tokens)
# gather the pieces, make conversions as necessary...
stu_id = int(tokens[0])
name = tokens[1]
email = tokens[2]
some_number = int(tokens[3])
# splitting the number from the list of courses is a little complicated
# you *could* do this more elegantly with regex, but for your level,
# here is a simple way to find the "chop points" and split this up...
last_blobs = tokens[4].split('[')
course_count = int(last_blobs[0])
course_list = last_blobs[1][:-1] # everything except the last bracket
# split up the courses by comma
courses = course_list.split(',')
# now stuff that into the dictionary...
# a little sanity check:
if data.get(stu_id):
print(f'duplicate ID found: {stu_id}. OVERWRITING')
data[stu_id] = (name,
email,
some_number,
course_count,
courses)
for key, value in data.items():
print(key, value)

i got something for you. What you want to do is to find the student first and then delete the course: like this.
studentid = input("enter studentid")
course = input("enter the course to delete")
with open("studentDatabase.dat") as file:
f = file.readlines()
with open("studentDatabase.dat","w") as file:
for line in lines:
if studentid in line: # Check if it's the right sudent
line = line.replace(course, "") # replace course with nothing
file.write(line)
You want to check if we are looking at the correct student, then replace the line but without the course code. Hope you can find it useful.

My python code is not importing JSON data correctly to CSV

So i am trying to import json data from file and want to export in CSV file. Only few tags like "authors" and "title" work fine with this code but when i try that for "abstract" it split every word of abstract in new column of csv. Before I try split() it was doing the same for every character
here is my code
import json
import csv
filename="abc.json"
csv_file= open('my.csv', 'w',encoding="utf-8")
csvwriter = csv.writer(csv_file)
with open(filename, 'r') as f:
for line in f:
data = json.loads(line)
if 'abstract' in data:
csvwriter.writerow(data['abstract'].split())
elif 'authors' in data:
csvwriter.writerow(data['authors'])
else:
f="my"
sample json file can be downloaded from here
http://s000.tinyupload.com/?file_id=28925213311182593120

Like Ben said, it would be great to see a sample from the JSON file, but the issue could be with how your trying to split your abstract data. With what you're doing now, you're asking it to split at every space. Try something like this if you're wanting to split by line:
if 'abstract' in data:
csvwriter.writerow(data['abstract'].split(","))

The reason this happened in abstract is because the value of abstract is a string (in contrast, the value of authors is a list). writerow receives an iterable, and when you iterate over a string in python you get a letter each time.
So before you used split, python took the string and divided it into letters, thereby giving you one letter per column.
When you used split, you transformed the string into a list of words, so when you iterate over it you get a word each time.
If you want to split abstract by periods, just do the same thing with .split('.')

Using python RE to replace a string in Word Document?

So i am trying to run through a word document to replace all text strings with 'aaa' (just for example) to replace it with a variable from user input, i have been bashing my head with a few answers on stackoverflow to figure out and came across Regular expressions which i have never used before, after using a tutorial for a bit I just can't seem to get my head round it.
This is all the code i have tried exampling but just can't seem to get python to actually change the text string in this Word Document.
from docx import Document
import re
signature = Document ('test1.docx')
person = raw_input('Name?')
person = person+('.docx')
save = signature.save(person)
name_change = raw_input('Change name?')
line = re.sub('[a]{3}',name_change,signature)
print line
save
for line in signature.paragraphs:
line = re.sub('[a]{3}',name_change,signature)
for table in signature.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
if 'aaa' in paragraph.text:
print paragraph.text
paragraph.text= replace('aaa',name_change)
save
Thank you in advance for any help.

for line in signature.paragraphs:
line = re.sub('[a]{3}',name_change,signature)
The above code is redundant since you update the variable line with the re.sub, but it doesn't cause an update in the actual origin, as shown below:
data = ['aaa', 'baaa']
for item in data:
item = re.sub('[a]{3}', 't', item)
print(data)
#['aaa', 'baaa']
Also, you are iterating over signature.paragraphs but just calling re.sub on the entirety of signature everytime. Try something like this:
signature = re.sub('[a]{3}', name_change, signature)
save

Storing multiple lines from a file to a variable using a delimiter

I am using Python to make a filter to search through thousands of text files for specific queries. These text files consist of several sections, and they do not all have consistent formatting. I want each of these sections to be checked for specific criteria, so in the section of the text file called "DESCRIPTION OF RECORD", I was doing something like this to store the string to a variable:
with open(some_file, 'r') as r:
for line in r:
if "DESCRIPTION OF RECORD" in line:
record = line
Now this works pretty well for most files, but some files have a line break in the the section, so it does not store the whole section to the variable. I was wondering how I could use a delimiter to control how many lines are stored to the variable. I would probably use the title of the next section called "CORRELATION" for the delimiter. Any ideas?
An example structure of the file could look like:
CLINICAL HISTORY: Some information.
MEDICATIONS: Other information
INTRODUCTION: Some more information.
DESCRIPTION OF THE RECORD: Some information here....
another line of information
IMPRESSION: More info
CLINICAL CORRELATION: The last bit of information

You could use builtin re module like that:
import re
# I assume you have a list of all possible sections
sections = [
'CLINICAL HISTORY',
'MEDICATIONS',
'INTRODUCTION',
'DESCRIPTION OF THE RECORD',
'IMPRESSION',
'CLINICAL CORRELATION'
]
# Build a regexp that will match any of the section names
exp = '|'.join(sections)
with open(some_file, 'r') as r:
contents_of_file = r.read()
infos = list(re.split(exp, contents_of_file)) # infos is a list of what's between the section names
infos = [info.strip('\n :') for info in infos] # let's get rid of colons and whitespace in our infos
print(infos) # you don't have to print it :)
If I use your example text instead of a file, it prints something like that:
['', 'Some information.', 'Other information', 'Some more information.', 'Some information here....\nanother line of information', 'More info', 'The last bit of information']
The first element is empty, but you can get rid of it simply by doing so:
infos = infos[1:]
By the way, if we merge lines in which we deal with infos, into one, it would probably be cleaner, and would surely be more efficient (but maybe a little bit less understandable):
infos = [info.strip('\n :') in re.split(exp, contents_of_file)][1:]

If you do not know the sections you'll find, here's a version which seems to work, as long as the text is formatted as in your example :
import itertools
text = """
CLINICAL HISTORY: Some information.
MEDICATIONS: Other information
INTRODUCTION: Some more information.
DESCRIPTION OF THE RECORD: Some information here....
another line of information
IMPRESSION: More info
CLINICAL CORRELATION: The last bit of information
"""
def method_tuple(s):
# sp holds strings which finish with the section names.
sp = s.split(":")
# This line removes spurious "\n" at both end of the strings in sp.
# It then splits them once at "\n" starting from their end, effectively
# seperating the sections and the descriptions.
# It builds a list of strings alternating section names and information.
fragments = list(itertools.chain.from_iterable( p.strip("\n").rsplit("\n", 1) for p in sp ))
# You can now build a list of 2-uples.
pairs = [ (fragments[i*2],fragments[i*2+1]) for i in range(len(fragments)//2)]
# Or you could build a dict
# pairs = { fragments[i*2]:fragments[i*2+1] for i in range(len(fragments)//2)}
return pairs
print(method_tuple(text))
The timings compared the regex version of Ilya are roughly equivalent, although building a dictionnary seems to start winning over building a list of tuples or using regexp, on the sample text at 1 billion loops...

I found another possible solution for this using the indexes of the line. I first opened the check file, and stored its f.read() contents into a variable called info. I then did this:
with open(check_file, 'r') as r:
for line in r:
if "DESCRIPTION" in line:
record_Index = info.index(line)
record = info[info.index(line):]
if "IMPRESSION" in record:
impression_Index = info.index("IMPRESSION")
record = info[record_Index:impression_Index]
This method worked as well, although I don't know how efficient it is memory and speed wise. Instead of using with open(...) multiple times, it might be better just to store it all in the variable called info and then do everything with that.

Output py2neo recordlist to text file

I am trying to use python (v3.4) to act as a 'sandwich' between Neo4j and a text output file. This code gives me a py2neo RecordList:
from py2neo import Graph
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
graph = Graph('http://localhost:7474/db/data/')
sCypher = 'MATCH (a) RETURN count(a)'
results = graph.cypher.execute(sCypher)
I also have some really simple text file interaction:
f = open('Output.txt','a') #open for append access
f.write ('\n Hello world')
f.close
What I really want to do is f.write (str(results)) but it really didn't like that at all. Can someone help me to convert my RecordList into a string please? I'm assuming I'll need to loop through the columns to get each column name, then loop through each record and write it individually, but I don't know how to go about this. Where I'm eventually planning to go with this is to change the Cypher every time.
Closest related question I could find is this one: How to convert Neo4j return types to python types. I'm sure there's someone out there who'll read this and say that using the REST API directly is a much better way of spitting out text, but I'm not quite at that level...
Thanks in advance,
Andy

Here is how you can iterate a RecordList and print the columns of the individual Records to a file (e.g. comma separated). If the properties you return are lists you would need some more formatting to get strings for your output file.
# use with to open files, this makes sure that it's properly closed after an exception
with open('output.txt', 'a') as f:
# iterate over individual Records in RecordList
for record in results:
# concatenate all columns of the Record into a string, comma separated
# list comprehension with str() to handle int and other types
output_string = ','.join([str(x) for x in record])
# actually write to file
print(output_string, file=f)
The format of the output file depends on what you want to do with it of course.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex line by line over large string - python

Related

delete only 1 instance of a string from a file

My python code is not importing JSON data correctly to CSV

Using python RE to replace a string in Word Document?

Storing multiple lines from a file to a variable using a delimiter

Output py2neo recordlist to text file

Categories

Resources