Fairly new to Python (And Stack Overflow!) here. I have a data set with subject line data (text strings) that I am working on building a bag of words model with. I'm creating new variables that flags a 0 or 1 for various possible scenarios, but I'm stuck trying to identify where there is an ellipsis ("...") in the text. Here's where I'm starting from:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('(\w+)\.{2,}(.+)')
Inputting ('...') doesn't work for obvious reasons, but the above RegEx code was suggested--still not working. Also tried this:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('.\.\.\')
No dice.
The above code shell works for other variables I've created, but I'm also having trouble creating a 0-1 output instead of True/False (would be an 'as.numeric' argument in R.) Any help here would also be appreciated.
Thanks!
Using search() instead of match() would spot an ellipses at any point in the text. In Pandas str.contains() supports regular expressions:
For example in Pandas:
import pandas as pd
df = pd.DataFrame({'Text' : ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]})
df['Ellipses'] = df.Text.str.contains(r'\w+(\.{3,})|…')
print(df)
Giving you:
Text Ellipses
0 hello.. False
1 again... this True
2 is......a test True
3 Real ellipses… here True
4 ...not here False
Or without pandas:
import re
for test in ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]:
print(int(bool(re.search(r'\w+(\.{3,})|…', test))))
This matches on the middle tests giving:
0
1
1
1
0
Take a look at search-vs-match for a good explanation in the Python docs.
To display the matching words:
import re
for test in ["hello..", "again... this", "is......a test", "...def"]:
ellipses = re.search(r'(\w+)\.{3,}', test)
if ellipses:
print(ellipses.group(1))
Giving you:
again
is
Related
I have DataFrame in Python Pandas like below:
sentence
------------
😎🤘🏾
I like it
+1😍😘
One :-) :)
hah
I need to select only rows containing emoticons or emojis, so as a result I need something like below:
sentence
------------
😎🤘🏾
+1😍😘
One :-) :)
How can I do that in Python ?
You can select the unicode emojis with a regex range:
df2 = df[df['sentence'].str.contains(r'[\u263a-\U0001f645]')]
output:
sentence
0 😎🤘🏾
2 +1😍😘
This is however much more ambiguous for the ASCII "emojis" as there is no standard definition and probably endless combinations. If you limit it to the smiley faces that contain eyes ';:' and a mouth ')(' you could use:
df[df['sentence'].str.contains(r'[\u263a-\U0001f645]|(?:[:;]\S?[\)\(])')]
output:
sentence
0 😎🤘🏾
2 +1😍😘
3 One :-) :)
But you would be missing plenty of potential ASCII possibilities: :O, :P, 8D, etc.
I'm looking to find a way in python spark to search a string with separate two words. for example: IPhone x or Samsun s10 ...
I want to give a text file and (Iphone x) as a composite string for example, and get result then.
All what i find in the internet is just one word count
IUUC:
In spark 2.0 and if you were gunna read it from a file, for exemple a .csv file:
df = spark.read.format("csv").option("header", "true").load("pathtoyourcsvfile.csv")
then you can filter it using regex like this:
pattern = "\s+(word1|word2)\s+"
filtered = df.filter(df['<thedesiredcolumnhere>'].rlike(pattern))
You can try to write your own UDF combine with wordsegmente to segment your words, and you can add new word to the dictionary to help library to segment new words, such as "Iphone x"
For example:
>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']
If you don't want to use library, you can also see Word segmentation using dynamic programming
This is the answer:
# give a file
rdd = sc.textFile("/root/PycharmProjects/Spark/file")
# give a composite string
string_ = "Iphone x"
# filer by line containing the string
new_rdd = rdd.filter(lambda line: string_ in line)
# collect these lines
rt = str(new_rdd.collect())
# apply regex to find all words and count
count = re.findall(string_, rt) them
I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.
I have a large pandas dataframe. A column contains text broken down into sentences, one sentence per row. I need to check the sentences for the presence of terms used in various ontologies. Some of the ontologies are fairly large and have more than 100.000 entries. In addition some of the ontologies contain molecule names with hyphens, commas, and other characters that may or may not be present in the text to be examined, hence, the need for regular expressions.
I came up with the code below, but it's not fast enough to deal with my data. Any suggestions are welcome.
Thank you!
import pandas as pd
import re
sentences = ["""There is no point in driving yourself mad trying to stop
yourself going mad""",
"The ships hung in the sky in much the same way that bricks don’t"]
sentence_number = list(range(0, len(sentences)))
d = {'sentence' : sentences, 'number' : sentence_number}
df = pd.DataFrame(d)
regexes = ['\\bt\\w+', '\\bs\\w+']
big_regex = '|'.join(regexes)
compiled_regex = re.compile(big_regex, re.I)
df['found_regexes'] = df.sentence.str.findall(compiled_regex)
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}