How to split strings into new dataframe rows depending on keywords

How to split strings into new dataframe rows depending on keywords - python

I want to split a row into a new row whenever an adverb is present. However, if multiple adverbs occur in a row, then I only want to split into a new row after the last adverb.
A sample of my dataframe looks like this:
0 but well that's alright
1 otherwise however we'll have to
2 okay sure
3 what?
With adverbs = ['but', 'well', 'otherwise', 'however'], I want the resulting df to look like this:
0 but well
1 that's alright
2 otherwise however
3 we'll have to
2 okay sure
3 what?

I have a partial solution, maybe it can help.
You could use the TextBlob package.
Using this API, you can assign each word a token. A list of possible tokens is available here.
The issue is that it's not perfect to tag words, and your definition of adverb might not match theirs (for instance, but is a coordinating conjunction on the API, and the well tag is, for some reason, a verb. But it still works for the most part:
The splitting could be done this way
from textblob import TextBlob
def adv_split(s):
annotations = TextBlob(s).tags
# Extract adverbs (CC for coordinating conjunction or RB for adverbs)
adv_words = [ word for word,tag in annotations
if tag.startswith('CC') or tag.startswith('RB') ]
# We have at least one adverb
if len(adv_words) >0:
# Get the last one
adv_pos = s.index(adv_words[-1]) + len(adv_words[-1])
return [s[:adv_pos], s[adv_pos:]]
else:
return s
Then, you can use the pandas apply() and the new explode() method (pandas>0.25) to split your dataframe:
import pandas as pd
data = pd.Series(["but well that's alright",
"otherwise however we'll have to",
"okay sure",
"what?"])
data.apply(adv_split).explode()
You get:
0 but
0 well that's alright
1 otherwise however
1 we'll have to
2 okay sure
3 what?
It's not exactly right since well's tag is wrong, but you have the idea.

df = df[0].str.split().explode().to_frame()
df[1] = df[0].str.contains('|'.join(adverbs))
df = df.groupby([df.index, 1], sort=False).agg(' '.join).reset_index(drop=True)
print(df)
0
0 but well
1 that's alright
2 otherwise however
3 we'll have to
4 okay sure
5 what?

Related

How to search for multiple substrings using text.find

I'm a Python beginner, so please forgive me if I'm not using the right lingo and if my code includes blatant errors.
I have text data (i.e., job descriptions from job postings) in one column of my data frame. I want to determine which job ads contain any of the following strings: bachelor, ba/bs, bs/ba.
The function I wrote doesn't work because it produces an empty column (i.e., all zeros). It works fine if I just search for one substring at a time. Here it is:
def requires_bachelor(text):
if text.find('bachelor|ba/bs|bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
Thanks so much to anyone who is willing to help!

Here's my approach. You were pretty close but you need to check for each of the items individually. If any of the available "Bachelor tags" exist, return true. Then instead of using map({true:1, false:0}), you can use map(bool) to make it a bit nicer. Good luck!
import pandas as pd
df_jobs = pd.DataFrame({"name":["bob", "sally"], "description":["bachelor", "ms"]})
def requires_bachelor(text):
return any(text.find(a) > -1 for a in ['bachelor', 'ba/bs','bs/ba']) # -1 if not found
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map(bool)

The | in search string does not work like or operator. You should divide it into three calls like this:
if text.find('bachelor') > -1 or text.find('ba/bs') > -1 or text.find('bs/ba') > -1:

You could try doing:
bachelors = ["bachelor", "ba/bs", "bs/ba"]
if any(bachelor in text for bachelor in bachelors):
return True

Instead of writing a custom function that requires .apply (which will be quite slow), you can use str.contains for this. Also, you don't need map to turn booleans into 1 and 0; try using astype(int) instead.
df_jobs = pd.DataFrame({'description': ['job ba/bs', 'job bachelor',
'job bs/ba', 'job ba']})
df_jobs['bachelor'] = df_jobs.description.str.contains(
'bachelor|ba/bs|bs/ba', regex=True).astype(int)
print(df_jobs)
description bachelor
0 job ba/bs 1
1 job bachelor 1
2 job bs/ba 1
3 job ba 0
# note that the pattern does not look for match on simply "ba"!

So, you are checking for a string bachelor|ba/bs|bs/ba in the list, Which I don't believe will exist in any case...
What I suggest you do is to check for all possible combinations in the IF, and join them with a or statement, as follows:
def requires_bachelor(text):
if text.find('bachelor')>-1 or text.find('ba/bs')>-1 or text.find('bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})

It can all be done simply in one line in Pandas
df_jobs['bachelor'] = df_jobs['description'].str.contains(r'bachelor|bs|ba')

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

I'm trying to create a phylogenetic tree by making a .phy file from my data.
I have a dataframe
ndf=
ESV trunc
1 esv1 TACGTAGGTG...
2 esv2 TACGGAGGGT...
3 esv3 TACGGGGGG...
7 esv7 TACGTAGGGT...
I checked the length of the elements of the column "trunc":
length_checker = np.vectorize(len)
arr_len = length_checker(ndf['trunc'])
The resulting arr_len gives the same length (=253) for all the elements.
I saved this dataframe as .phy file, which looks like this:
23 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
This is similar to the file used in this tutorial.
However, when I run the command
aln = AlignIO.read('msa.phy', 'phylip')
I get "ValueError: Sequences must all be the same length"
I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!
Thanks

Generally phylip is the fiddliest format in phylogenetics between different programs. There is strict phylip format and relaxed phylip format etc ... t is not easy to know which is the separator being used, a space character and/or a carriage return.
I think that you appear to have left a space between the name of the taxon (i.e. the sequence label) and sequence name, viz.
2. esv2
Phylip format is watching for the space between the label and the sequence data. In this example the sequence would be 3bp long. The use of a "." is generally not a great idea as well. The integer doesn't appear to denote a line number.
The other issue is you could/should try keeping the sequence on the same line as the label and remove the carriage return, viz.
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
Sometimes a carriage return does work (this could be relaxed phylip format), the traditional format uses a space character " ". I always maintained a uniform number of spaces to preserve the alignment ... not sure if that is needed.
Note if you taxon name exceeeds 10 characters you will need relaxed phylip format and this format in any case is generally a good idea.
The final solution is all else fails is to convert to fasta, import as fasta and then convert to phylip. If all this fails ... post back there's more trouble-shooting
Fasta format removes the "23 254" header and then each sequence looks like this,
>esv2
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
There is always a carriage return between ">esv2" and the sequence. In addition, ">" is always present to prefix the label (taxon name) without any spae. You can simply convert via reg-ex or "re" in Python. Using a perl one-liner it will be s/^([az]+[0-9]+)/>$1/g type code. I'm pretty sure they'll be an online website that will do this.
You then simply replace the "phylip" with "fasta" in your import command. Once imported you ask BioPython to convert to whatever format you want and it should not have any problem.

First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.
Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.
The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.
First some imports and let's recreate your dataframe.
import pandas as pd
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO
data = {'ESV' : ['esv1', 'esv2', 'esv3'],
'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']
}
ndf = pd.DataFrame.from_dict(data)
print(ndf)
Output:
ESV trunc
0 esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
1 esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
2 esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...
Next, write the phylip file in the correct format.
with open("test.phy", 'w') as f:
f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
for row in ndf.iterrows():
f.write("{:10} {}\n".format(*row[1].to_list()))
Ouput of test.phy:
3 253
esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG
Now we can start with the creation of our phylogenetic tree.
# Read the sequences and align
aln = AlignIO.read('test.phy', 'phylip')
print(aln)
Output:
SingleLetterAlphabet() alignment with 3 rows and 253 columns
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3
Calculate the distance matrix:
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
print(dm)
Output:
esv1 0
esv2 0.3003952569169961 0
esv3 0.6086956521739131 0.6245059288537549 0
Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
Phylo.draw_ascii(tree)
Output:
________________________________________________________________________ esv3
_|
| ___________________________________ esv2
|____________________________________|
|___________________________________ esv1
Or make a nice plot of the tree:
Phylo.draw(tree)
Output:

Having trouble applying Re module on a Series

I have trouble applying the following to my series.
Data['Notes']
0 2018-06-07 09:38:14Z -- legal -- As per ...
1 2018-06-05 12:48:26Z -- name -- Holdin...
2 2018-06-05 17:15:48Z -- filing -- Answe...
3 2018-06-11 08:34:53Z -- name -- lvm i...
4 2018-05-11 08:31:26Z -- filed -- summo...
5 2018-06-01 16:07:11Z -- Name Rogers -- sent ...
import re
keywords = {'file', 'filing', 'legal'}
max_words_after = 5
key_re = re.compile(fr"""
(?:{'|'.join([w for w in keywords])}) #keyword options group
\s((?:[\s]?[A-Za-z\']+[\s]?) #capture word. include with line-breaks
{{1,{max_words_after}}}) #1 to max_words_after
""", re.VERBOSE|re.IGNORECASE
)
for f in data['Notes']:
data['Result'] = key_re.findall(f)
In response, all I get is
"ValueError: Length of values does not match the length of index."
Please tell me how I can get a result for every index position and append it to a new series within the data frame.

Understanding your error
key_re.findall(f) returns a list of varying sizes (I think 0 or 1 keyword will be found but depending on your re expressions it could be more).
You are broadcasting this to all the rows is your dataframe which of course doesn't have the same number of items. Hence "Length of values does not match the length of index."
I don't think that's what you want to do anyway. I think you want to create a new column based on another column. See this question for details but here's it applied to your situation.
Fixing your code
import re
import pandas as pd
Here's what I was looking for regarding your Data variable. Something I can copy and paste and run:
Data = pd.DataFrame([["2018-06-07 09:38:14Z -- legal -- As per ..."],["2018-06-05 12:48:26Z -- name -- Holdin..."]], columns=["Notes"])
Create a function that does the transformation that you want.
def find_key_words(row):
keywords = {'file', 'filing', 'legal'}
max_words_after = 5
I'm only including the first line of your re expression because when I tested it, I always got no results when I had your complete expression in there. You can modify this as you need.
key_re = re.compile(fr"""
(?:{'|'.join([w for w in keywords])}) #keyword options group
""", re.VERBOSE|re.IGNORECASE
)
return key_re.findall(row['Notes'])
Now apply that function to each row. That way, you'll be broadcasting something that matches the length of what Data['Result'] would expect.
Data['Result'] = Data.apply(lambda row: find_key_words(row),axis=1)

How can I find the start and end of a regex match using a python pandas dataframe?

I get DNA or protein sequences from databases. The sequences are aligned, so although I always know one input sequence, it is often truncated and includes gaps in the form of added "-" characters. I first want to find a region in the query string. In this case, a regex search makes perfect sense. I then want to extract the equivalent regions from the other aligned strings (I've named them here "markup" and "hit"). Since the sequences are aligned, the region I want in all strings will have the same start and stop. Is there a simple way to obtain the start and stop of a regex match in a pandas dataframe?
import pandas as pd
import re
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| || ||||||||||||||||','|| | ||| :|| || |:: |','||: ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA'
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})
#create a regex search string to find the appropriate subset in the query sequence,
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'
Pandas has a nice extract function to slice out the matched sequence from the query:
df['query'].str.extract(regex_desired_region_from_query)
However I need the start and end of the match in order to extract the equivalent regions from the markup and hit columns. For a single string, this is done as follows:
match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
sliced_hit
Out[3]:'PLETQDA'
My current workaround is as follows. (Edited to include nhahtdh's suggestion and therefore avoid searching twice.)
#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
m = re.search(regex_desired_region_from_query, x)
return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']
for n, col in enumerate(columns_from_regex_output):
df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)
Now I want to use the obtained start and end integers to slice the strings.
This code would be nice:
df.sliced = df.string[df.start:df.end]
But I don't think it currently exists. Instead I have once again used lambda functions:
#create slice functions
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
fn_slice_markup = lambda x : x['markup'][x['start']:x['end']]
#apply the slice functions
df['sliced_markup'] = df.apply(fn_slice_markup, axis = 1)
df['sliced_hit'] = df.apply(fn_slice_hit, axis = 1)
print(df)
hit markup query start end sliced_markup sliced_hit
0 MPTMGFWVYITVELAIAVLAILG || || |||||||||||||||| MPIMGSSVYITVELAIAVLAILG 1 7 | || PTMGFW
1 MP-NSSLVYIGLELVIACLSVAG || | ||| :|| || |:: | MPIMGSSVYITVELAIAVLAILG 1 7 | | P-NSSL
2 MPLETQDALYVALELAIAALSVA ||: ::|: :||||| |:: MPI-MGSSVYITVELAIAVLAIL 1 8 |: : PLETQDA
Do pandas .match, .extract, .findall functions have the equivalent of a .start() or .end() attribute? Is there a way to slice more elegantly? Any help would be appreciated!

I don't think this exists in pandas, but would be a great addition. Go to https://github.com/pydata/pandas/issues and add a new Issue. Explain that it's an enhancement that you'd like to see.
For the .start() and .end() method, those probably make more sense as kwargs to the extract() method. If str.extract(pat, start_index=True), then returns a Series or Dataframe of start indexes rather than the value of the capture group. Same goes for end_index=True. Those probably need to be mutually exclusive.
I also like your suggestion of
df.sliced = df.string[df.start:df.end]
Pandas already has a str.slice method
df.sliced = df.string.str.slice(1, -1)
But those have to be ints. Add a separate issue on Github to have the str.slice method take series objects and apply element-wise.
Sorry to not have a better solution than your lambda hack, but it's use-cases like these that help drive Pandas to be better.

Output with Python Glob // Cannot find where is error in Python code

I have the following code, which does NOT give an error but it also does not produce an output.
The script is made to do the following:
The script takes an input file of 4 tab-separated columns:
It then counts the unique values in Column 1 and the frequency of corresponding values in Column 4 (which contains 2 different tags: C and D).
The output is 3 tab-separated columns containing the unique values of column 1 and their corresponding frequency of values in Column 4: Column 2 has the frequency of the string in Column 1 that corresponds with Tag C and Column 3 has the frequency of the string in Column 1 that corresponds with Tag D.
Here is a sample of input:
algorithm-n like-1-resonator-n 8.1848 C
algorithm-n produce-hull-n 7.9104 C
algorithm-n like-1-resonator-n 8.1848 D
algorithm-n produce-hull-n 7.9104 D
anything-n about-1-Zulus-n 7.3731 C
anything-n above-shortage-n 6.0142 C
anything-n above-1-gig-n 5.8967 C
anything-n above-1-magnification-n 7.8973 C
anything-n after-1-memory-n 2.5866 C
and here is a sample of the desired output:
algorithm-n 2 2
anything-n 5 0
The code I am using is the following (which one will see takes into consideration all suggestions from the comments):
from collections import defaultdict, Counter
def sortAndCount(opened_file):
lemma_sense_freqs = defaultdict(Counter)
for line in opened_file:
lemma, _, _, senseCode = line.split()
lemma_sense_freqs[lemma][senseCode] += 1
return lemma_sense_freqs
def writeOutCsv(output_file, input_dict):
with open(output_file, "wb") as outfile:
for lemma in input_dict.keys():
for senseCode in input_dict[lemma].keys():
outstring = "\t".join([lemma, senseCode,\
str(input_dict[lemma][senseCode])])
outfile.write(outstring + "\n")
import os
import glob
folderPath = "Python_Counter" # declare here
for input_file in glob.glob(os.path.join(folderPath, 'out_')):
with open(input_file, "rb") as opened_file:
lemma_sense_freqs = sortAndCount(input_file)
output_file = "count_*.csv"
writeOutCsv(output_file, lemma_sense_freqs)
My intuition is the problem is coming from the "glob" function.
But, as I said before: the code itself DOES NOT give me an error -- but it doesn't seem to produce an output either.
Can someone help?
I have referred to the documentation here and here, and I cannot seem to understand what I am doing wrong.
Can someone provide me insight on how to solve the problem by outputting the results from glob. As I have a large amount of files I need to process.

In regards to your original code, *lemma_sense_freqs* is not defined cause it should be returned by the function sortAndCount(). And you never call that function.
For instance, you have a second function in your code, which is called writeOutCsv. You define it, and then you actually call it on the last line.
While you never call the function sortAndCount() (which is the one that should return the value of *lemma_sense_freqs*). Hence, the error.
I don't know what you want to achieve exactly with that code, but you definitely need to write at a certain point (try before the last line) something like this
lemma_sense_freqs = sortAndCount(input_file)
this is the way you call the function you need and lemma_sense_freqs will then have a value associated and you shouldn't get the error.
I cannot be more specific cause it is not clear exactly what you want to achieve with that code. However, you just are experiencing a basic issue at the moment (you defined a function but never used it to retrieve the value lemma_sense_freqs). Try to add the piece of code I suggest and play with it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split strings into new dataframe rows depending on keywords - python

df = df[0].str.split().explode().to_frame() df[1] = df[0].str.contains('|'.join(adverbs)) df = df.groupby([df.index, 1], sort=False).agg(' '.join).reset_index(drop=True) print(df) 0 0 but well 1 that's alright 2 otherwise however 3 we'll have to 4 okay sure 5 what?

Related

How to search for multiple substrings using text.find

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

Having trouble applying Re module on a Series

How can I find the start and end of a regex match using a python pandas dataframe?

Output with Python Glob // Cannot find where is error in Python code

Categories

Resources