Searching for indexes of multiple subtrings in multiple files

Searching for indexes of multiple subtrings in multiple files - python

I've got two dataframes which are as follows:
df1 : contains one variable ['search_term'] and 100000 rows
These are words/phrases I want to search for in my files
df2: contains parsed file contents in a column called file_text
There are 20000 rows in this dataframe and two columns ['file_name', 'file_text']
What I need is the index of each appearance of a search term in the file_text.
I cannot figure out an efficient way to perform this search.
I am using the str.find() function along with groupby but it's taking around 0.25s per file_text-search term (which becomes really long with 20k files*100k search terms)
Any ideas on ways to do this in a fast and efficient way would be lifesavers!

I remember having to do something similar in one of our projects. We had a very large set of keywords and we wanted to search for them in a large string and find all occurrences of those keywords. Let's call the string we want to search in content. After some bench-marking, the solution I adopted was a two pass method: first check to see if a keyword exists in the content using the highly optimized in operator, and then use regular expressions to find all occurrences of it.
import re
keywords = [...list of your keywords ...]
found_keywords = []
for keyword in keywords:
if keyword in content:
found_keywords.append(keyword)
for keyword in found_keywords:
for match in re.finditer(keyword, content):
print(match.start())

Related

Creating new data frame by extracting cells (contains strings) with certain substring that matches with a list of words (have over 100)

I already found a roundabout solution to the problem, and I am sure there is a simple way.
I got two data frames with one column in each. DF1 and DF2 contains strings.
Now, I try to match using .str contains in python, limited by knowledge, I am forced to manually enter substrings that I am looking for.
contain_values = df[df['month'].str.contains('Ju|Ma')]
This highlighted way is how I am able to solve the problem of matching substring within DF1 from DF2.
The current scenario pushes me to add 100 words using the vertical bar right here, str.contains('Ju|Ma').
Now can anyone kindly share some wisdom on how to link the second data frame that contain one column (contains 100+ words)

here is one way to do it. if you post a MRE, i would be able to test and share result, but the below should work
# create a list of words to search
w=['ju', 'ma']
# create a search string
s='|'.join(w)
# search using regex
import re
df[df['month'].str.contains(s, re.IGNORECASE)]

Pandas DataFrame - What is the correct way to operate with multiple values in one cell?

I'm working with a dataset which has gene names and gene ids. Basically, ids are uniquely defined, while one name can correspond to multiple ids.
I use a list to contain all ids of a gene name and the dataframe looks like:
|GeneName|GeneID|
|Name_1|[ID_1, ID_2, ID_5]|
|Name_2|[ID_3, ID_4]|
All names and ids are strings, but some ids are missing and I use NaN to represnt missing ones (not sure if this is a good practice either).
After saving the dataframe to a csv file and load it back, all lists containing gene ids are regarded as strings. I found a solution using:
pd.read_csv(fpath, converters={'GeneName': pd.eval, 'GeneID': pd.eval})
to load them as list, but I encounter
pandas.core.computation.ops.UndefinedVariableError: name 'NaN' is not defined
What is the best solution to deal with situation like this?
Thanks.

From the problem you described in the comments you can just use empty strings to indicate missing categories.
Then use pd.eval or ast.literal_eval:
import ast
ast.literal_eval('["ID_1", "ID_2", "", "", "ID_5"]')
>>['ID_1', 'ID_2', '', '', 'ID_5']
Important Note:
Use different ' and " for list string and list element strings

How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame?

I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.
I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.
My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!

Group with expressions and alias as needed. Lets try:
df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()

Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.
If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this
sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")

Performing a substring lookup between one dataframe and another

I have a dataframe of IDs and names (2 x 1.5e6). Separately, I have a long list of vulgar words, kept in a .txt file (needs to be stored in a central location and is constantly updated).
Essentially, I am trying to 'match' the dataframe of names to the vulgar word list. I hope to create a new variable on the dataframe ('vulgar_flag'), and flag as a 0 or 1 depending on whether any of the words from the vulgar list (.txt file) are a substring of the name in the dataframe.
Currently, my approach is to read in the vulgar .txt file and create a list of words called vulgar_scrub. I then have the following code to create the flag:
df['vulgar_flag'] = numpy.where(df.FULLNAME.str.contains('|'.join(vulgar_scrub)),1,0)
This seems clunky, and I'm wondering if there are any more efficient alternatives. This post (Pandas lookup, mapping one column in a dataframe to another in a different dataframe) mentions using df.merge although I'm not sure that would support checking for substrings as I am looking for.
Mainly just curious to see if there are other solutions, or any dataframe functionality I'm unaware of. Thanks!

How to apply a loop with in the Regex function in Python?

I am trying to use a Regex function to find keywords in a large text file and then pick a certain value in the text file corresponding to it. While my current script does this, I want to put a loop within the Regex function so that I can do it for multiple (>100) keywords. For example: In my text, B443 would be searched and a number written next to it would be picked. The text looks like this :
*
(BHC443) 2,462,000
1.a.(1)(a) (b) All other loans secured by real estate
(BHC442) 1,033,000
1.a.(1)(b)
*
The output would be BHC443:2,462,000, BHC:442:1,033,000 etc. for all the keywords searched. Now, I have many more keywords in the text for which I need to pick the corresponding numbers and I want to write a dynamic regex function such that it takes the keywords one by one and generates outputs. I have a fixed list of keywords already sorted out (e.g., B443, B442, CA13323,SQDS73733 etc.). So the problem is searching for all those in the text and then picking up numbers probably by importing the keywords as a list first, and then running the regex function over the elements of that list. I don't know how to run a loop for that.
The regex code I wrote for finding the number corresponding to one keyword at a time is written below and it works.
with open(path, 'r') as file:
for line in file:
key_value_name = re.search('(B443)([\\(\\)((\\s)+)|(\\n)?(\\n)])
([1234567890,a-zA-Z.\\s]+)', line) # For each keyword, pick the
corresponding amount
if key_value_name:
print(key_value_name.group(1))
print(key_value_name.group(3))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching for indexes of multiple subtrings in multiple files - python

Related

Creating new data frame by extracting cells (contains strings) with certain substring that matches with a list of words (have over 100)

Pandas DataFrame - What is the correct way to operate with multiple values in one cell?

How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame?

Performing a substring lookup between one dataframe and another

How to apply a loop with in the Regex function in Python?

Categories

Resources