Having trouble applying Re module on a Series - python

I have trouble applying the following to my series.
Data['Notes']
0 2018-06-07 09:38:14Z -- legal -- As per ...
1 2018-06-05 12:48:26Z -- name -- Holdin...
2 2018-06-05 17:15:48Z -- filing -- Answe...
3 2018-06-11 08:34:53Z -- name -- lvm i...
4 2018-05-11 08:31:26Z -- filed -- summo...
5 2018-06-01 16:07:11Z -- Name Rogers -- sent ...
import re
keywords = {'file', 'filing', 'legal'}
max_words_after = 5
key_re = re.compile(fr"""
(?:{'|'.join([w for w in keywords])}) #keyword options group
\s((?:[\s]?[A-Za-z\']+[\s]?) #capture word. include with line-breaks
{{1,{max_words_after}}}) #1 to max_words_after
""", re.VERBOSE|re.IGNORECASE
)
for f in data['Notes']:
data['Result'] = key_re.findall(f)
In response, all I get is
"ValueError: Length of values does not match the length of index."
Please tell me how I can get a result for every index position and append it to a new series within the data frame.

Understanding your error
key_re.findall(f) returns a list of varying sizes (I think 0 or 1 keyword will be found but depending on your re expressions it could be more).
You are broadcasting this to all the rows is your dataframe which of course doesn't have the same number of items. Hence "Length of values does not match the length of index."
I don't think that's what you want to do anyway. I think you want to create a new column based on another column. See this question for details but here's it applied to your situation.
Fixing your code
import re
import pandas as pd
Here's what I was looking for regarding your Data variable. Something I can copy and paste and run:
Data = pd.DataFrame([["2018-06-07 09:38:14Z -- legal -- As per ..."],["2018-06-05 12:48:26Z -- name -- Holdin..."]], columns=["Notes"])
Create a function that does the transformation that you want.
def find_key_words(row):
keywords = {'file', 'filing', 'legal'}
max_words_after = 5
I'm only including the first line of your re expression because when I tested it, I always got no results when I had your complete expression in there. You can modify this as you need.
key_re = re.compile(fr"""
(?:{'|'.join([w for w in keywords])}) #keyword options group
""", re.VERBOSE|re.IGNORECASE
)
return key_re.findall(row['Notes'])
Now apply that function to each row. That way, you'll be broadcasting something that matches the length of what Data['Result'] would expect.
Data['Result'] = Data.apply(lambda row: find_key_words(row),axis=1)

Related

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

I'm trying to create a phylogenetic tree by making a .phy file from my data.
I have a dataframe
ndf=
ESV trunc
1 esv1 TACGTAGGTG...
2 esv2 TACGGAGGGT...
3 esv3 TACGGGGGG...
7 esv7 TACGTAGGGT...
I checked the length of the elements of the column "trunc":
length_checker = np.vectorize(len)
arr_len = length_checker(ndf['trunc'])
The resulting arr_len gives the same length (=253) for all the elements.
I saved this dataframe as .phy file, which looks like this:
23 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
This is similar to the file used in this tutorial.
However, when I run the command
aln = AlignIO.read('msa.phy', 'phylip')
I get "ValueError: Sequences must all be the same length"
I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!
Thanks
Generally phylip is the fiddliest format in phylogenetics between different programs. There is strict phylip format and relaxed phylip format etc ... t is not easy to know which is the separator being used, a space character and/or a carriage return.
I think that you appear to have left a space between the name of the taxon (i.e. the sequence label) and sequence name, viz.
2. esv2
Phylip format is watching for the space between the label and the sequence data. In this example the sequence would be 3bp long. The use of a "." is generally not a great idea as well. The integer doesn't appear to denote a line number.
The other issue is you could/should try keeping the sequence on the same line as the label and remove the carriage return, viz.
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
Sometimes a carriage return does work (this could be relaxed phylip format), the traditional format uses a space character " ". I always maintained a uniform number of spaces to preserve the alignment ... not sure if that is needed.
Note if you taxon name exceeeds 10 characters you will need relaxed phylip format and this format in any case is generally a good idea.
The final solution is all else fails is to convert to fasta, import as fasta and then convert to phylip. If all this fails ... post back there's more trouble-shooting
Fasta format removes the "23 254" header and then each sequence looks like this,
>esv2
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
There is always a carriage return between ">esv2" and the sequence. In addition, ">" is always present to prefix the label (taxon name) without any spae. You can simply convert via reg-ex or "re" in Python. Using a perl one-liner it will be s/^([az]+[0-9]+)/>$1/g type code. I'm pretty sure they'll be an online website that will do this.
You then simply replace the "phylip" with "fasta" in your import command. Once imported you ask BioPython to convert to whatever format you want and it should not have any problem.
First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.
Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.
The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.
First some imports and let's recreate your dataframe.
import pandas as pd
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO
data = {'ESV' : ['esv1', 'esv2', 'esv3'],
'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']
}
ndf = pd.DataFrame.from_dict(data)
print(ndf)
Output:
ESV trunc
0 esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
1 esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
2 esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...
Next, write the phylip file in the correct format.
with open("test.phy", 'w') as f:
f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
for row in ndf.iterrows():
f.write("{:10} {}\n".format(*row[1].to_list()))
Ouput of test.phy:
3 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
Now we can start with the creation of our phylogenetic tree.
# Read the sequences and align
aln = AlignIO.read('test.phy', 'phylip')
print(aln)
Output:
SingleLetterAlphabet() alignment with 3 rows and 253 columns
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3
Calculate the distance matrix:
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
print(dm)
Output:
esv1 0
esv2 0.3003952569169961 0
esv3 0.6086956521739131 0.6245059288537549 0
Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
Phylo.draw_ascii(tree)
Output:
________________________________________________________________________ esv3
_|
| ___________________________________ esv2
|____________________________________|
|___________________________________ esv1
Or make a nice plot of the tree:
Phylo.draw(tree)
Output:

How to split strings into new dataframe rows depending on keywords

I want to split a row into a new row whenever an adverb is present. However, if multiple adverbs occur in a row, then I only want to split into a new row after the last adverb.
A sample of my dataframe looks like this:
0 but well that's alright
1 otherwise however we'll have to
2 okay sure
3 what?
With adverbs = ['but', 'well', 'otherwise', 'however'], I want the resulting df to look like this:
0 but well
1 that's alright
2 otherwise however
3 we'll have to
2 okay sure
3 what?
I have a partial solution, maybe it can help.
You could use the TextBlob package.
Using this API, you can assign each word a token. A list of possible tokens is available here.
The issue is that it's not perfect to tag words, and your definition of adverb might not match theirs (for instance, but is a coordinating conjunction on the API, and the well tag is, for some reason, a verb. But it still works for the most part:
The splitting could be done this way
from textblob import TextBlob
def adv_split(s):
annotations = TextBlob(s).tags
# Extract adverbs (CC for coordinating conjunction or RB for adverbs)
adv_words = [ word for word,tag in annotations
if tag.startswith('CC') or tag.startswith('RB') ]
# We have at least one adverb
if len(adv_words) >0:
# Get the last one
adv_pos = s.index(adv_words[-1]) + len(adv_words[-1])
return [s[:adv_pos], s[adv_pos:]]
else:
return s
Then, you can use the pandas apply() and the new explode() method (pandas>0.25) to split your dataframe:
import pandas as pd
data = pd.Series(["but well that's alright",
"otherwise however we'll have to",
"okay sure",
"what?"])
data.apply(adv_split).explode()
You get:
0 but
0 well that's alright
1 otherwise however
1 we'll have to
2 okay sure
3 what?
It's not exactly right since well's tag is wrong, but you have the idea.
df = df[0].str.split().explode().to_frame()
df[1] = df[0].str.contains('|'.join(adverbs))
df = df.groupby([df.index, 1], sort=False).agg(' '.join).reset_index(drop=True)
print(df)
0
0 but well
1 that's alright
2 otherwise however
3 we'll have to
4 okay sure
5 what?

How to use pandas .replace() with list of regexs while honoring list order?

I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.
My data looks like this:
In [1] A
Out[1]:
wildcards \
42 (.*)activation.playready.microsoft.com
35 (.*)v10.vortex-win.data.microsoft.com
40 (.*)settings-win.data.microsoft.com
43 (.*)smartscreen.microsoft.com
39 (.*).playready.microsoft.com
38 (.*)go.microsoft.com
240 (.*)i.microsoft.com
238 (.*)microsoft.com
regex
42 re.compile('^(.*)activation.playready.microsof...
35 re.compile('^(.*)v10.vortex-win.data.microsoft...
40 re.compile('^(.*)settings-win.data.microsoft.c...
43 re.compile('^(.*)smartscreen.microsoft.com$')
39 re.compile('^(.*).playready.microsoft.com$')
38 re.compile('^(.*)go.microsoft.com$')
240 re.compile('^(.*)i.microsoft.com$')
238 re.compile('^(.*)microsoft.com$')
In [2] B.head()
Out[2]:
server_hostname
146 mobile.pipe.aria.microsoft.com
205 settings-win.data.microsoft.com
341 nav.smartscreen.microsoft.com
406 v10.vortex-win.data.microsoft.com
667 www.microsoft.com
Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:
B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())
But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.
How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?
Most answers use apply() which is known to be slower than built-in vector function solutions. My hope in using .replace() was that it would be fast since it is such a built in vector function. #vlemaistre's answer was the only one to not use .apply() as is my solution here, which instead of compiling each wildcard into a regex, it treats it as a right-hand substring to use logic: "If server_hostname ends with wildcard, then it's a match". So long as I sort my wildcards by length, then it works just fine.
My function which does this is:
def match_to_whitelist(accepts_df, whitelist_df):
""" Adds `whitelists` column to accepts_df showing which (if any) whitelist entry it matches with """
accepts_df.loc[:, 'wildcards'] = None
for wildcard in whitelist_df['wildcards']:
accepts_df.loc[(accepts_df['wildcards'].isnull()) & (
accepts_df['server_hostname'].str.endswith(wildcard)), 'wildcards'] = wildcard
rows_matched = len(accepts_df['wildcards'].notnull())
matched {rows_matched}")
return accepts_df
Here, accepts_df is like B from before, and whitelist_df is like A before, but with 2 differences:
no regex column
the wildcards values are no longer in glob/regex format (ie "(.*)microsoft.com" becomes "microsoft.com"
To benchmark answers on my machine, I'll use mine as a baseline, taking 27secs to process 100k accepts_df rows with 400 whitelist_df rows. Using the same dataset, here are the times for other solutions (i was lazy: if they didn't run out the gate, I didn't debug much to find out):
#vlemaistre - List Comprehension with vector functions: 193sec
#user214 - SequenceMatcher: 234sec
#aws_apprentice - Compare RE search result lengths: 24sec
#fpersyn - First match (will be best match if A is sorted): over 6mins, so quit...
#Andy Hayden - lastgroup: didn't test because I can't (quickly) build a long RE programatically.
#capelastegui - Series.str.match(): Error: "pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects"
Ultimately, none of our answers say how to use .replace() as desired, so for the time being, I'll leave this unanswered for a few weeks in case someone can provide an answer to better use .replace(), or at least some other fast vector-based solution. Until then, I'll keep with what I have, or maybe use aws_apprentice's after I verify results.
EDIT
I improved my matcher by adding a "domain" column to both DFs, which consists of the last 2 parts of each wildcard/server_hostname (ie www.microsoft.com becomes "microsoft.com"). I then used groupby('domain') on both DFs, iterated through the whitelist groups of domains, fetched the same domain-group from the server_hostname DF (B) and did my matching just using the subset of wildcards/server_hostnames from each group. This cut my processing time to match in half.
Here is a way to do this using a double list comprehension and the re.sub() function :
import re
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
# For each server_hostname we try each regex and keep the longest matching one
B['wildcards'] = [max([re.sub(to_replace, value, x) for to_replace, value
in A[['regex', 'wildcards']].values
if re.sub(to_replace, value, x)!=x], key=len)
for x in B['server_hostname']]
Output :
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
1 www.microsoft.com (.*)microsoft.com
One alternate way would be to use SequenceMatcher and re.match.
Taken data from the answer given by #vlemaistre
from difflib import SequenceMatcher
import pandas as pd
import re
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com', 'www.i.microsoft.com']})
def regex_match(x):
match = None
ratio = 0
for w, r in A[['wildcards', 'regex']].to_numpy():
if re.match(r, x) is not None:
pct = SequenceMatcher(None, w, x).ratio()
if ratio < pct: ratio = pct; match = w
return match
B['wildcards'] = B.server_hostname.apply(regex_match)
# print(B.wildcards)
0 (.*)v10.vortex-win.data.microsoft.com
1 (.*)microsoft.com
2 (.*)i.microsoft.com
Name: server_hostname, dtype: object
Here is another approach using apply. There is no pure pandas way to do this as far as I know. I also borrowed the data that #vlemaistre provided.
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
pats = set(A.regex)
def max_match(hostname):
d = {}
for pat in pats:
maybe_result = pat.search(hostname)
if maybe_result:
p = pat.pattern
d[len(p)] = p
return d.get(max([*d]))
B['wildcards'] = B['server_hostname'].apply(max_match)
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com ^(.*)v10.vortex-win.data.microsoft.com$
1 www.microsoft.com ^(.*)microsoft.com$
An alternative tack, which unfortunately still needs an apply, is to use lastgroup. This entails compiling a single regex and then looking up the name of the matched group (row):
In [11]: regex = re.compile("|".join([f"(?P<i{i}>{regex})" for i, regex in s["wildcards"].items()]))
In [12]: regex
Out[12]:
re.compile(r'(?P<i42>(.*)activation.playready.microsoft.com)|(?P<i35>(.*)v10.vortex-win.data.microsoft.com)|(?P<i40>(.*)settings-win.data.microsoft.com)|(?P<i43>(.*)smartscreen.microsoft.com)|(?P<i39>(.*).playready.microsoft.com)|(?P<i38>(.*)go.microsoft.com)|(?P<i240>(.*)i.microsoft.com)|(?P<i238>(.*)microsoft.com)',
re.UNICODE)
In [13]: B.server_hostname.apply(lambda s: int(re.match(regex, s).lastgroup[1:]))
Out[13]:
146 238
205 40
341 43
406 35
667 238
Name: server_hostname, dtype: int64
In [14]: B.server_hostname.apply(lambda s: int(re.match(regex, s).lastgroup[1:])).map(s.wildcards)
Out[14]:
146 (.*)microsoft.com
205 (.*)settings-win.data.microsoft.com
341 (.*)smartscreen.microsoft.com
406 (.*)v10.vortex-win.data.microsoft.com
667 (.*)microsoft.com
Name: server_hostname, dtype: object
This attribute isn't exposed by pandas (but it might be possible to do something clever with the internals)...
The purest pandas approach that I could find involves running Series.str.match() on B.server_hostname for each regex, then taking the first match from each column with idxmax().
# Create input data
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
# Ensure B has a unique index
B = B.reset_index(drop=True)
# Check which regexes match each hostname
df_match = A.regex.apply(lambda x: B.server_hostname.str.match(x))
df_match.index= A.wildcards
df_match.columns=B.server_hostname
# Get first match for each hostname
df_first_match = df_match.idxmax().rename('wildcards').reset_index()
Output:
print(df_match)
print(df_first_match)
server_hostname v10.vortex-win.data.microsoft.com www.microsoft.com
wildcards
(.*)activation.playready.microsoft.com False False
(.*)v10.vortex-win.data.microsoft.com True False
(.*)i.microsoft.com False False
(.*)microsoft.com True True
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
1 www.microsoft.com (.*)microsoft.com
That said, this seems to be a bit slower than other solutions posted earlier.
The pandas documentation describes the .replace() method as:
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
This implies that the method will iterate over all cells in the dataframe and replace what it can for each query provided in the to_replace argument. A quick example to demonstrate this:
df = pd.DataFrame({'A':['a','c'],'B':['b','d']})
df.replace(['a','b'],['b','c'])
Output:
A B
0 c c
1 c d
In your example, each regex rule overwrites previous replacements when there is a new match, which is how you end up with a vector of (.*)microsoft.com results.
You could use the .apply() method instead. For instance, by sorting your whitelist (A) descending by length, iterate over each row of your value DataFrame (B) and return each first match:
import pandas as pd
import re
# Using the definitions for A and B from your question,
# where A is sorted descending by length.
def first_match(x):
for index, row in A.iterrows():
if bool(re.search(row['wildcards'], x['server_hostname'])) is True:
return row['wildcards']
B['wildcards'] = B.apply(first_match, axis=1)
B
Output:
server_hostname wildcards
0 mobile.pipe.aria.microsoft.com (.*)microsoft.com
1 settings-win.data.microsoft.com (.*)settings-win.data.microsoft.com
2 nav.smartscreen.microsoft.com (.*)smartscreen.microsoft.com
3 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
4 www.microsoft.com (.*)microsoft.com
It might also be worth reading up on the split-apply-combine pattern for more advanced strategies. I hope that helps.

How can I find the start and end of a regex match using a python pandas dataframe?

I get DNA or protein sequences from databases. The sequences are aligned, so although I always know one input sequence, it is often truncated and includes gaps in the form of added "-" characters. I first want to find a region in the query string. In this case, a regex search makes perfect sense. I then want to extract the equivalent regions from the other aligned strings (I've named them here "markup" and "hit"). Since the sequences are aligned, the region I want in all strings will have the same start and stop. Is there a simple way to obtain the start and stop of a regex match in a pandas dataframe?
import pandas as pd
import re
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| || ||||||||||||||||','|| | ||| :|| || |:: |','||: ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA'
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})
#create a regex search string to find the appropriate subset in the query sequence,
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'
Pandas has a nice extract function to slice out the matched sequence from the query:
df['query'].str.extract(regex_desired_region_from_query)
However I need the start and end of the match in order to extract the equivalent regions from the markup and hit columns. For a single string, this is done as follows:
match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
sliced_hit
Out[3]:'PLETQDA'
My current workaround is as follows. (Edited to include nhahtdh's suggestion and therefore avoid searching twice.)
#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
m = re.search(regex_desired_region_from_query, x)
return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']
for n, col in enumerate(columns_from_regex_output):
df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)
Now I want to use the obtained start and end integers to slice the strings.
This code would be nice:
df.sliced = df.string[df.start:df.end]
But I don't think it currently exists. Instead I have once again used lambda functions:
#create slice functions
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
fn_slice_markup = lambda x : x['markup'][x['start']:x['end']]
#apply the slice functions
df['sliced_markup'] = df.apply(fn_slice_markup, axis = 1)
df['sliced_hit'] = df.apply(fn_slice_hit, axis = 1)
print(df)
hit markup query start end sliced_markup sliced_hit
0 MPTMGFWVYITVELAIAVLAILG || || |||||||||||||||| MPIMGSSVYITVELAIAVLAILG 1 7 | || PTMGFW
1 MP-NSSLVYIGLELVIACLSVAG || | ||| :|| || |:: | MPIMGSSVYITVELAIAVLAILG 1 7 | | P-NSSL
2 MPLETQDALYVALELAIAALSVA ||: ::|: :||||| |:: MPI-MGSSVYITVELAIAVLAIL 1 8 |: : PLETQDA
Do pandas .match, .extract, .findall functions have the equivalent of a .start() or .end() attribute? Is there a way to slice more elegantly? Any help would be appreciated!
I don't think this exists in pandas, but would be a great addition. Go to https://github.com/pydata/pandas/issues and add a new Issue. Explain that it's an enhancement that you'd like to see.
For the .start() and .end() method, those probably make more sense as kwargs to the extract() method. If str.extract(pat, start_index=True), then returns a Series or Dataframe of start indexes rather than the value of the capture group. Same goes for end_index=True. Those probably need to be mutually exclusive.
I also like your suggestion of
df.sliced = df.string[df.start:df.end]
Pandas already has a str.slice method
df.sliced = df.string.str.slice(1, -1)
But those have to be ints. Add a separate issue on Github to have the str.slice method take series objects and apply element-wise.
Sorry to not have a better solution than your lambda hack, but it's use-cases like these that help drive Pandas to be better.

Output with Python Glob // Cannot find where is error in Python code

I have the following code, which does NOT give an error but it also does not produce an output.
The script is made to do the following:
The script takes an input file of 4 tab-separated columns:
It then counts the unique values in Column 1 and the frequency of corresponding values in Column 4 (which contains 2 different tags: C and D).
The output is 3 tab-separated columns containing the unique values of column 1 and their corresponding frequency of values in Column 4: Column 2 has the frequency of the string in Column 1 that corresponds with Tag C and Column 3 has the frequency of the string in Column 1 that corresponds with Tag D.
Here is a sample of input:
algorithm-n like-1-resonator-n 8.1848 C
algorithm-n produce-hull-n 7.9104 C
algorithm-n like-1-resonator-n 8.1848 D
algorithm-n produce-hull-n 7.9104 D
anything-n about-1-Zulus-n 7.3731 C
anything-n above-shortage-n 6.0142 C
anything-n above-1-gig-n 5.8967 C
anything-n above-1-magnification-n 7.8973 C
anything-n after-1-memory-n 2.5866 C
and here is a sample of the desired output:
algorithm-n 2 2
anything-n 5 0
The code I am using is the following (which one will see takes into consideration all suggestions from the comments):
from collections import defaultdict, Counter
def sortAndCount(opened_file):
lemma_sense_freqs = defaultdict(Counter)
for line in opened_file:
lemma, _, _, senseCode = line.split()
lemma_sense_freqs[lemma][senseCode] += 1
return lemma_sense_freqs
def writeOutCsv(output_file, input_dict):
with open(output_file, "wb") as outfile:
for lemma in input_dict.keys():
for senseCode in input_dict[lemma].keys():
outstring = "\t".join([lemma, senseCode,\
str(input_dict[lemma][senseCode])])
outfile.write(outstring + "\n")
import os
import glob
folderPath = "Python_Counter" # declare here
for input_file in glob.glob(os.path.join(folderPath, 'out_')):
with open(input_file, "rb") as opened_file:
lemma_sense_freqs = sortAndCount(input_file)
output_file = "count_*.csv"
writeOutCsv(output_file, lemma_sense_freqs)
My intuition is the problem is coming from the "glob" function.
But, as I said before: the code itself DOES NOT give me an error -- but it doesn't seem to produce an output either.
Can someone help?
I have referred to the documentation here and here, and I cannot seem to understand what I am doing wrong.
Can someone provide me insight on how to solve the problem by outputting the results from glob. As I have a large amount of files I need to process.
In regards to your original code, *lemma_sense_freqs* is not defined cause it should be returned by the function sortAndCount(). And you never call that function.
For instance, you have a second function in your code, which is called writeOutCsv. You define it, and then you actually call it on the last line.
While you never call the function sortAndCount() (which is the one that should return the value of *lemma_sense_freqs*). Hence, the error.
I don't know what you want to achieve exactly with that code, but you definitely need to write at a certain point (try before the last line) something like this
lemma_sense_freqs = sortAndCount(input_file)
this is the way you call the function you need and lemma_sense_freqs will then have a value associated and you shouldn't get the error.
I cannot be more specific cause it is not clear exactly what you want to achieve with that code. However, you just are experiencing a basic issue at the moment (you defined a function but never used it to retrieve the value lemma_sense_freqs). Try to add the piece of code I suggest and play with it.

Categories