How to apply a loop with in the Regex function in Python?

How to apply a loop with in the Regex function in Python? - python

I am trying to use a Regex function to find keywords in a large text file and then pick a certain value in the text file corresponding to it. While my current script does this, I want to put a loop within the Regex function so that I can do it for multiple (>100) keywords. For example: In my text, B443 would be searched and a number written next to it would be picked. The text looks like this :
*
(BHC443) 2,462,000
1.a.(1)(a) (b) All other loans secured by real estate
(BHC442) 1,033,000
1.a.(1)(b)
*
The output would be BHC443:2,462,000, BHC:442:1,033,000 etc. for all the keywords searched. Now, I have many more keywords in the text for which I need to pick the corresponding numbers and I want to write a dynamic regex function such that it takes the keywords one by one and generates outputs. I have a fixed list of keywords already sorted out (e.g., B443, B442, CA13323,SQDS73733 etc.). So the problem is searching for all those in the text and then picking up numbers probably by importing the keywords as a list first, and then running the regex function over the elements of that list. I don't know how to run a loop for that.
The regex code I wrote for finding the number corresponding to one keyword at a time is written below and it works.
with open(path, 'r') as file:
for line in file:
key_value_name = re.search('(B443)([\\(\\)((\\s)+)|(\\n)?(\\n)])
([1234567890,a-zA-Z.\\s]+)', line) # For each keyword, pick the
corresponding amount
if key_value_name:
print(key_value_name.group(1))
print(key_value_name.group(3))

Related

Using Python & NLP, how can I extract certain text strings & corresponding numbers preceding the strings from Excel column having a lot of free text?

I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...

If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])

I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!

Searching for indexes of multiple subtrings in multiple files

I've got two dataframes which are as follows:
df1 : contains one variable ['search_term'] and 100000 rows
These are words/phrases I want to search for in my files
df2: contains parsed file contents in a column called file_text
There are 20000 rows in this dataframe and two columns ['file_name', 'file_text']
What I need is the index of each appearance of a search term in the file_text.
I cannot figure out an efficient way to perform this search.
I am using the str.find() function along with groupby but it's taking around 0.25s per file_text-search term (which becomes really long with 20k files*100k search terms)
Any ideas on ways to do this in a fast and efficient way would be lifesavers!

I remember having to do something similar in one of our projects. We had a very large set of keywords and we wanted to search for them in a large string and find all occurrences of those keywords. Let's call the string we want to search in content. After some bench-marking, the solution I adopted was a two pass method: first check to see if a keyword exists in the content using the highly optimized in operator, and then use regular expressions to find all occurrences of it.
import re
keywords = [...list of your keywords ...]
found_keywords = []
for keyword in keywords:
if keyword in content:
found_keywords.append(keyword)
for keyword in found_keywords:
for match in re.finditer(keyword, content):
print(match.start())

python parsing file into data structure

So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?

Contig Extension with Python

I want to add a function to a program that creates dictionaries with dna sequences that receives a contig (incon= initial contig; dna sequence) and extends it to the right by finding overlapping parts in form of keys in dictionaries and concatenating the values with the "+" operator.
I'll give a quick example:
GATTTGAAGC as initial contig
ATTTGAAGC:A is one of many entries in the dictionary
I want the function to search for such an overlapping part (I asked this here yesterday and it worked fine by itself and with specific values but not within the function with variables) that is a key in the dictionary and concatenate the value of that key to the initial sequence (extend the contig to the right) and save the new sequence into incon then delete this dictionary-entry and repeat until there are no entries left (this part i haven't even tried yet).
First i want the function to search for keys with length of 9 with values of length 1 (ATTTGAAGC:A) and if there are no overlapping parts for keys with length 8 with values of length 2 (f.e. ATTTGAAG:TG) and so on.
Additional Info:
The Dictionary "suffixDicts" has such entries with values with length from 1 (key has length 14) to 10 (key has length 5).
"reads" is where a list of sequences is stored
When i try to do the steps one after another some work (like the search) and some don't but when i tried to built a function out of it, literally nothing happens. The function is supposed to return the smallest possible extension.
def extendContig (incon, reads, suffixDicts):
incon = reads[0]
for x in range(1,len(incon)):
for key in suffixDicts.keys():
if incon[x:] == key:
incon = incon+suffixDicts['key']
print(incon)
else:
print("n")
return()
I'm very new to Python and there probably are very dire mistakes i made and i would like them to be pointed out. I know that I'm way over my head with this but I'm understanding most parts of the already existing code now but still have problems with implementing something by myself into it, probably due to incorrect synthax. I know there are programs i could use but i would like to understand the whole thing behind it.
edit: As asked for i will add the already given functions. Some of them were already written some parts i wrote based on the given code (basically i copied it with some tweaks). Warning: It is quite a lot:
Reading the Fasta file:
Additional Info:
The Fasta file contains large amounts of sequences in the Form:
"> read 1
TTATGAATATTACGCAATGGACGTCCAAGGTACAGCGTATTTGTACGCTA
"> read 2
AACTGCTATCTTTCTTGTCCACTCGAAAATCCATAACGTAGCCCATAACG
"> read 3
TCAGTTATCCTATATACTGGATCCCGACTTTAATCGGCGTCGGAATTACT
I uploaded the file here: http://s000.tinyupload.com/?file_id=52090273537190816031
edit: edited the large blocks of code out it doesn't seem to be necessary.

Tips for reading in a complex file - Python

I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc.
The files look something like:
Program
Username: X Laser: X Em: X
exp 1
sample 1
Time: X Notes: X
Read 1 X data
Read 2 X data
# unknown number of reads
sample 2
Time: X Notes: X
Read 1 X data
...
# Unknown number of samples
exp 2
sample 1
...
# Unknown number of experiments, samples and reads
# The 4 spaces between certain words represent tabs
To analyse this data I need to get the data for each reading and know which sample and experiment it came from. Also, I can change the output file format but I think the way I have written it here is the easiest to read.
To read this file in to Python the best way I can think of would be to read it in row by row and search for key words with regular expressions. For example, search the row for the "exp" keyword and then record the number after it, then search for sample in the next line and so on. However, of course this would not work if a keyword was used in the 'notes' section.
So, I'm kind of stumped as to what would best suit my needs (it's hard to use something if you don't know it exists!)
Thanks for your time.

It's a typical task for a syntactic analyzer. In this case, since
lexical constructs do not cross line boundaries and there's a single construct ("statement") per line. In other words, each line is a single statement
full syntax for a single line can be covered by a set of regexes
the structure of compounds (=entities connecting multiple "statements" into something bigger) is simple and straightforward
a (relatively) simple scannlerless parser based on lines, DFA and the aforementioned set of regexes can be applied:
set up the initial parser state (=current position relative to various entities to be tracked) and the parse tree (=data structure representing the information from the file in a convenient way)
for each line
classify it, e.g. by matching against the regexes applicable to the current state
use the matched regex's groups to get the line's statement's meaningful parts
using these parts, update the state and the parse tree
See get the path in a file inside {} by python for an example. There, I do not construct a parse tree (wasn't needed) but only track the current state.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.