I need to search the word 'mas' in Dataframe, the column with frase is Corpo, and the text in this column is splitted in list, for example: I like birds ---> split [I,like,birds]. So, I need search 'mas' in a portuguese frase and catch just the words after 'mas'. The code is taking to long to execute this function.
df.Corpo.update(df.Corpo.str.split()) #tokeniza frase
df.Corpo = df.Corpo.fillna('')
for i in df.index:
for j in range(len(df.Corpo[i])):
lista_aux = []
if df.Corpo[i][j] == 'mas' or df.Corpo[i][j] == 'porem' or df.Corpo[i][j] == 'contudo' or df.Corpo[i][j] == 'todavia':
lista_aux = df.Corpo[i]
df.Corpo[i] = lista_aux[j+1:]
break
if df.Corpo[i][j] == 'question':
df.Corpo[i] = ['question']
break
When working with pandas dataframes (or numpy arrays) you should always try to use vectorized operations instead of for-loops over individual dataframe elements. Vectorized operations are (nearly always) significantly faster than for-loops.
In your case you could use pandas built-in vectorized operation str.extract, which allows extraction of the string part that matches a regex search pattern. The regex search pattern mas (.+) should capture the part of a string that follows after 'mas'.
import pandas as pd
# Example dataframe with phrases
df = pd.DataFrame({'Corpo': ['I like birds', 'I mas like birds', 'I like mas birds']})
# Use regex search to extract phrase sections following 'mas'
df2 = df.Corpo.str.extract(r'mas (.+)')
# Fill gaps with full original phrase
df2 = df2.fillna(df.Corpo)
will give as result:
In [1]: df2
Out[1]:
0
0 I like birds
1 like birds
2 birds
Related
I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.
So I am trying to get a count for specific phrases in Python from a string I created. I have been able to make a list of specific individual words but never with anything involving two phrases. I just want to be able to create a list of items that involve two words for each item.
import pandas as pd
import numpy as np
import re
import collections
import plotly.express as px
df = pd.read_excel("Datasets/realDonaldTrumprecent2020.xlsx", sep='\t',
names=["Tweet_ID", "Date", "Text"])
df = pd.DataFrame(df)
df.head()
tweets = df["Text"]
raw_string = ''.join(tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
no_capital_letters = re.sub('[A-Z]+', lambda m: m.group(0).lower(), no_special_characters)
words_list = no_capital_letters.split(" ")
phrases = ['fake news', 'lamestream media', 'sleepy joe', 'radical left', 'rigged election']
I initially was able to get a list of just the individual words but I want to be able to get a list of instances where the phrases show up. Is there a way to do this?
Pandas provides some nice tools for doing these things.
For example, if your DataFrame was as follows:
import pandas as pd
df = pd.DataFrame({'text': [
'Encyclopedia Britannica is FAKE NEWS!',
'What does Sleepy Joe read? Webster\'s Dictionary? Fake News!',
'Sesame Street is lamestream media by radical leftist Big Bird!!!',
'1788 was a rigged election! Landslide for King George! Fake News',
]})
...you could select tweets containing the phrase 'fake news' like so:
selector = df.text.str.lower().str.contains('fake news')
This produces the following Series of booleans:
0 True
1 True
2 False
3 True
Name: text, dtype: bool
You can count how many are positive with sum:
sum(selector)
And use it to index the data frame to get an array of tweets
df.text[selector].values
If you are trying to count the number of times those phrases appear in the text, the following code should work.
for phrase in phrases:
sum(s.count(phrase) for phrase in words_list)
print(phrase, sum)
In terms of "a list of instances where the phrases show up", you should be able to slightly modify the above for loop:
phrase_list = []
for phrase in phrases:
for tweet in tweets:
if tweet in phrase:
phrase_list.append(tweet)
Because I want to remove ambiguity when I train the data. I want to clean it well. So how can I remove all rows that contain 3 words or less in python?
Hello World! This will be my first contribution ever to SO :-)
Let's create some data:
data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])
My approach is very straight forward, simple and little "brute" and inefficient,howver I ran this in a large dataframe (1013952 rows) and the time was fairly acceptable.
let's find the indices of the data frame where there are more than n tokens:
from nltk.tokenize import word_tokenize
def get_indices(df,col,n):
"""
Get the indices of dataframe where exist more than n tokens in a specific column
Parameters:
df(pandas dataframe)
n(int): threshold value for minimum words
col(string): column name
"""
tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
if len(word_tokenize(df[col][i])) < n:
tmp.append(i)
return tmp
Next we just need to call the function and drop the rows and said indices:
tmp = get_indices(df)
df_clean = df.drop(tmp)
Best!
df = pd.DataFrame({"mycolumn": ["", " ", "test string", "test string 1", "test string 2 2"]})
df = df.loc[df["mycolumn"].str.count(" ") >= 2]
You should never loop over a dataframe, always use vectorized operations.
I've seen questions posted here that are similar to mine, but I'm still getting errors in my code when trying some accepted answers. I have a dataframe with three columns--created _at, text, and words (which is just tokenized version of text). See below:
Now, I have a list of companies ['Starbucks', 'Nvidia', 'IBM', 'Dell'], and I only want to keep the rows where the text includes those words above.
I've tried a few things, but with no success:
small_DF.filter(lambda x: any(word in x.text for word in test_list))
Returns : TypeError: condition should be string or Column
I tried creating a function and using foreach():
def filters(line):
return(any(word in line for word in test_list))
df = df.foreach(filters)
That turns df into 'Nonetype'
And the last one I tried:
df = df.filter((col("text").isin(test_list))
This returns an empty dataframe, which is nice as I get no error, but obviously not what I want.
Your .filter returns an error because it is the sql filter function (expecting a BooleanType() column) on dataframes not the filter function on RDDs. If you want to use the RDD one, just add .rdd:
small_DF.rdd.filter(lambda x: any(word in x.text for word in test_list))
You don't have to use a UDF, you can use regular expressions in pyspark with .rlike on your column "text":
from pyspark.sql import HiveContext
hc = HiveContext(sc)
import pyspark.sql.functions as psf
words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = hc.createDataFrame(data).toDF('text')
df.filter(psf.lower(df.text).rlike('|'.join(words)))
I think filter isnt working becuase it expects a boolean output from lambda function and isin just compares with column. You are trying to compare list of words to list of words. Here is something that I tried can give you some direction -
# prepare some test data ==>
words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = spark.createDataFrame(data).toDF('text')
from pyspark.sql.types import *
def intersect(row):
# convert each word in lowecase
row = [x.lower() for x in row.split()]
return True if set(row).intersection(set(words)) else False
filterUDF = udf(intersect,BooleanType())
df.where(filterUDF(df.text)).show()
output :
+------------------+
| text|
+------------------+
| i love Starbucks|
|dell laptops rocks|
+------------------+
I am working in jupyter notebook and have a pandas dataframe "data":
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I want to go through the text in column "Answer" and get the three words before and after the word "data".
So in this scenario I would have gotten "is very important"; "We value", "since we need".
Is there an good way to do this within a pandas dataframe? So far I only found solutions where "Answer" would be its own file run through python code (without a pandas dataframe). While I realize that I need to use the NLTK library, I haven't used it before, so I don't know what the best approach would be. (This was a great example Extracting a word and its prior 10 word context to a dataframe in Python)
This may work:
import pandas as pd
import re
df = pd.read_csv('data.csv')
for value in df.Answer.values:
non_data = re.split('Data|data', value) # split text removing "data"
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
output:
['is very important']
['We value', 'since we need']
The solution using generator expression, re.findall and itertools.chain.from_iterable functions:
import pandas as pd, re, itertools
data = pd.read_csv('test.csv') # change with your current file path
data_adjacents = ((i for sublist in (list(filter(None,t))
for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
for l in data.Answer.tolist())
print(list(itertools.chain.from_iterable(data_adjacents)))
The output:
[' is very important', 'We value ', ' since we need']