Tips for reading in a complex file - Python - python

I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc.
The files look something like:
Program
Username: X Laser: X Em: X
exp 1
sample 1
Time: X Notes: X
Read 1 X data
Read 2 X data
# unknown number of reads
sample 2
Time: X Notes: X
Read 1 X data
...
# Unknown number of samples
exp 2
sample 1
...
# Unknown number of experiments, samples and reads
# The 4 spaces between certain words represent tabs
To analyse this data I need to get the data for each reading and know which sample and experiment it came from. Also, I can change the output file format but I think the way I have written it here is the easiest to read.
To read this file in to Python the best way I can think of would be to read it in row by row and search for key words with regular expressions. For example, search the row for the "exp" keyword and then record the number after it, then search for sample in the next line and so on. However, of course this would not work if a keyword was used in the 'notes' section.
So, I'm kind of stumped as to what would best suit my needs (it's hard to use something if you don't know it exists!)
Thanks for your time.

It's a typical task for a syntactic analyzer. In this case, since
lexical constructs do not cross line boundaries and there's a single construct ("statement") per line. In other words, each line is a single statement
full syntax for a single line can be covered by a set of regexes
the structure of compounds (=entities connecting multiple "statements" into something bigger) is simple and straightforward
a (relatively) simple scannlerless parser based on lines, DFA and the aforementioned set of regexes can be applied:
set up the initial parser state (=current position relative to various entities to be tracked) and the parse tree (=data structure representing the information from the file in a convenient way)
for each line
classify it, e.g. by matching against the regexes applicable to the current state
use the matched regex's groups to get the line's statement's meaningful parts
using these parts, update the state and the parse tree
See get the path in a file inside {} by python for an example. There, I do not construct a parse tree (wasn't needed) but only track the current state.

Related

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.
If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Abaqus Python Command to insert keyword to the .inp file

The whole objective behind asking this question stems from trying to multi-process the creation of Linear Constraint Equations (http://abaqus.software.polimi.it/v6.14/books/usb/default.htm?startat=pt08ch35s02aus129.html#usb-cni-pequation) in Abaqus/CAE for applying periodic boundary conditions to a meshed model. Since my model has over a million elements and I need to perform a Monte Carlo simulation of 1000 such models, I would like to parallelize the procedure for which I haven't found a solution due to the licensing and multi-threading restrictions associated with Abaqus/CAE. Some discussions on this here: Python multiprocessing from Abaqus/CAE
I am currently trying to perform the equation definitions outside of Abaqus using the node sets created as I know the syntax of Equations for the input file.
** Constraint: <name>
*Equation
<dof>
<set1>, <dof>, <coefficient1>.
<set2>, <dof>, <coefficient2>.
<set3>, <dof>, <coefficient3>.
e.g.
** Constraint: Corner_c1_Constraint-1-pair1
*Equation
3
All-1.c1_Node-1, 1, 1.
All-1.c5_Node-1, 1, -1.
RefPoint-3.SetRefPoint3, 1, -1.
Instead of directly writing these lines into the .inp file, I can also write these commands as a separate file and link it to the .inp file of the model using
*EQUATION, INPUT=file_name
I am looking for the Abaqus Python command to write a keyword such as above to the .inp file instead of specifying the Equation constraints themselves.
The User's guide linked above instructs specifying this via GUI but that I haven't been able to do that in my version of Abaqus CAE 2018.
Abaqus/CAE Usage:
Interaction module: Create Constraint: Equation: click mouse button 3 while holding the cursor over the data table, and select Read from File.
So I am looking for a command from the scripting reference manual to do this instead. There are commands to parse input files (http://abaqus.software.polimi.it/v6.14/books/ker/pt01ch24.html) but not something to directly write to input file instead of performing it via scripting. I know I can hard code this into the input file but the sheer number of simulations that I would like to perform calls for every bit of automation possible. I have already tried optimising the code using appropriate algorithms and numpy arrays yet the pre-processing itself takes hours for one single model.
p.s. This is my first post on SO - so I'm not sure if this question is phrased in the appropriate format. Would appreciate any answers to the actual question or any other solutions to the intended outcome of parallelizing the pre-processing steps in Abaqus/CAE.
You are looking for the KeywordBlock object. This allows you to write anything to the input file (keywords, comments, etc) as a formatted string. I believe this approach is much less error-prone than alternatives that require you to programmatically write an input file (or update an existing one) on your own.
The KeywordBlock object is created when a Part Instance is created at the Assembly level. It stores everything you do in the CAE with the corresponding keywords.
Note that any changes you make to the KeywordBlock object will be synced to the Job input file when it is written, but will not update the CAE Model database. So for example, if you use the KeywordBlock to store keywords for constraint equations, the constraint definitions will not show up in the CAE model tree. The rest of the Mdb does not know they exist.
As you know, keywords must be written to the appropriate section of an input file. For example, the *equation keyword may be defined at the Part, Part Instance, or Assembly level (see the Keywords Reference Manual). This must also be considered when you store keywords in the KeywordBlock object (it will not auto-magically put your keywords in the right spot, unfortunately!). A side effect is that it is not always safe to write to the KeywordBlock - that is, whenever it may conflict with subsequent changes made via the CAE GUI. I believe that the Abaqus docs recommend that you add your keywords as a last step.
So basically, read the KeywordBlock object, find the right place, and insert your new keyword. Here's an example snippet to get you started:
partname = "Part-1"
model = mdb.models["Model-1"]
modelkwb = model.keywordBlock
assembly = model.rootAssembly
if assembly.isOutOfDate:
assembly.regenerate()
# Synch edits to modelkwb with those made in the model. We don't need
# access to *nodes and *elements as they would appear in the inp file,
# so set the storeNodesAndElements arg to False.
modelkwb.synchVersions(storeNodesAndElements=False)
# Search the modelkwb for the desired insertion point. In this example,
# we are looking for a line that indicates we are beginning the Part-Level
# block for the specific Part we are interested in. If it is found, we
# break the loop, storing the line number, and then write our keywords
# using the insert method (which actually inserts just below the specified
# line number, fyi).
line_num = 0
for n, line in enumerate(modelkwb.sieBlocks):
if line.replace(" ","").lower() == "*part,name={0}".format(partname.lower()):
line_num = n
break
if line_num:
kwds = "your keyword as a string here (may be multiple lines)..."
modelkwb.insert(position=line_num, text=kwds)
else:
e = ("Error: Part '{}' was not found".format(partname),
"in the Model KeywordBlock.")
raise Exception(" ".join(e))

Using Python & NLP, how can I extract certain text strings & corresponding numbers preceding the strings from Excel column having a lot of free text?

I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...
If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])
I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!

python parsing file into data structure

So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?

How to apply a loop with in the Regex function in Python?

I am trying to use a Regex function to find keywords in a large text file and then pick a certain value in the text file corresponding to it. While my current script does this, I want to put a loop within the Regex function so that I can do it for multiple (>100) keywords. For example: In my text, B443 would be searched and a number written next to it would be picked. The text looks like this :
*
(BHC443) 2,462,000
1.a.(1)(a) (b) All other loans secured by real estate
(BHC442) 1,033,000
1.a.(1)(b)
*
The output would be BHC443:2,462,000, BHC:442:1,033,000 etc. for all the keywords searched. Now, I have many more keywords in the text for which I need to pick the corresponding numbers and I want to write a dynamic regex function such that it takes the keywords one by one and generates outputs. I have a fixed list of keywords already sorted out (e.g., B443, B442, CA13323,SQDS73733 etc.). So the problem is searching for all those in the text and then picking up numbers probably by importing the keywords as a list first, and then running the regex function over the elements of that list. I don't know how to run a loop for that.
The regex code I wrote for finding the number corresponding to one keyword at a time is written below and it works.
with open(path, 'r') as file:
for line in file:
key_value_name = re.search('(B443)([\\(\\)((\\s)+)|(\\n)?(\\n)])
([1234567890,a-zA-Z.\\s]+)', line) # For each keyword, pick the
corresponding amount
if key_value_name:
print(key_value_name.group(1))
print(key_value_name.group(3))

Categories