I have a dataframe containing home descriptions:
description
0 Beautiful, spacious skylit studio in the heart...
1 Enjoy 500 s.f. top floor in 1899 brownstone, w...
2 The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3 We welcome you to stay in our lovely 2 br dupl...
4 Please don’t expect the luxury here just a bas...
5 Our best guests are seeking a safe, clean, spa...
6 Beautiful house, gorgeous garden, patio, cozy ...
7 Comfortable studio apartment with super comfor...
8 A charming month-to-month home away from home ...
9 Beautiful peaceful healthy homeThe spaceHome i...
I'm trying to count the number of sentences on each row (using sent_tokenize from nltk.tokenize) and append those values as a new column, sentence_count, to the df. Since this is part of a larger data pipeline, I'm using pandas assign so that I could chain operations.
I can't seem to get it to work, though. I've tried:
df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))
and
df.assign(sentence_count=len(sent_tokenize(df['description'])))
but both raise the following errro:
TypeError: expected string or bytes-like object
I've confirmed that each row has a dtype of str. Perhaps it's because description has dtype('O')?
What am I doing wrong here? Using a pipe with a custom function works fine here, but I prefer using assign.
x['description'] when you pass it to sent_tokenize in the first example is a pandas.Series. It's not a string. It's a Series (similar to a list) of strings.
So instead you should do this:
df.assign(sentence_count=df['description'].apply(sent_tokenize))
Or, if you need to pass extra parameters to sent_tokenize:
df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))
Related
I'm fairly new to Python. While I know it's possible to deduplicate rows in Pandas with drop_duplicates for identical text results, is there a way to drop similar rows of text?
E.g. for this fictional collection of online article headlines, populated in chronological order
1 "The dog ate my homework" says confused child in Banbury
2 Confused Banbury child says dog ate homework
3 Why are dogs so cute
4 Teacher in disbelief as child says dog ate homework - Banbury Times
5 Dogs don't like eggs, here's why
6 The moment a senior stray is adopted - try not to cry
7 Dog smugglers in Banbury arrested in police sting operation
My ideal outcome would be that only rows 1, 3, 5, 6 and 7 remain, with rows 1, 2 and 4 having been grouped for similarity and then only 1, the oldest/ 'first' entry, kept.
(How) could I get there? Even advice purely about the grouping approach would be very helpful. I would want to be able to run this on hundreds of rows of text, without having a specific, manually pre-determined article or headline to measure similarity against, just group similar rows.
Thank you so much for your thoughts and time!
You cam try to obtain your data with doc2vec (example of usage), then cluster your text with cosine distance with kmedoids of hierarchical algorithms.
I am having some difficulties to select not empty fields using regex (findall) within my dataframe, looking for words contained into a text source:
text = "Be careful otherwise police will capture you quickly."
I will need to look for words that ends with ful in my text string, then looking for words that ends with full in my dataset.
Author DF_Text
31 Better the devil you know than the one you don't
53 Beware the door with too many keys.
563 Be careful what you tolerate. You are teaching people how to treat you.
41 Fear the Greeks bearing gifts.
539 NaN
51 The honey is sweet but the bee has a sting.
21 Be careful what you ask for; you may get it.
(from csv/txt file).
I need to extract words ending with ful in text, then look at both DF_Text (thus Author) which contains words ending with ful and appending results in a list.
n=0
for i in df['DF_Text']:
print(re.findall(r"\w+ful", i))
n=n+1
print(n)
My question is: how can I remove empty rows([]) from the analysis (NaN) and report the author names (e.g. 563, 21) related to?
I will be happy to provide further information, in case it would be not clear.
Use str.findall instead of looping with re.findall:
df["found"] = df["DF_Text"].str.findall(r"(\w+ful)")
df.loc[df["found"].str.len().eq(0),"found"] = df["Author"]
print (df)
Author DF_Text found
0 31 Better the devil you know than the one you don't 31
1 53 Beware the door with too many keys. 53
2 563 Be careful what you tolerate. You are teaching... [careful]
3 41 Fear the Greeks bearing gifts. 41
4 539 NaN NaN
5 51 The honey is sweet but the bee has a sting. 51
6 21 Be careful what you ask for; you may get it. [careful]
I would use the .notna() function of Pandas to get rid of that row in your datafrae. l.
Something like this
df = df[df['DF_Text'].notna()]
And please note that Python calls the dataframe twice before overwriting it, this is correct.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.htm
I have several old books where each page is filled with historical records of immigrants and their families. Most variables were filled only for the father, usually regarded as the family's chief. So, for example, if the immigrant family is going to live in a city called "small city in the West", only the father would have this information, while the mother and children were supposed to go to the same destiny. Additionally, some observations have no information at all, even for the father.
What I want to do is just filled the missing values for the relatives within the same family (i.e., the same boss). I have reached a solution but it's too inefficient and I'm afraid I'm overcomplicating something that is rather simple. Below I use an example dataset to show my solution.
Example dataset:
m=1
test=pd.DataFrame({'destino_heranca':['A','','','','C','']*m, 'num_familia_raw':[1,1,2,2,3,3]*m}, index=range(6*m))
test
Note that individual 1 has the city A as destiny since here is from family 1. In the other hand, family 2 must be a missing information in the final dataset since I don't have information even for the boss.
destino_heranca num_familia_raw
0 A 1
1 1
2 2
3 2
4 C 3
5 3
Then, I create a dictionary called isdest_null where the keys are the family numbers and the values are boolean values, True if the family's boss has information and False otherwise:
def num_familia_raw_dest(m):
return list(set(test[test['num_familia_raw']==m].destino_heranca.values))
isdest_null={k:('' in num_familia_raw_dest(k)) & (len(num_familia_raw_dest(k))==1) for k in test.num_familia_raw.unique()}
In a separate executable file called heritage.py I define the following function:
import numpy as np
def heritage(col, data, empty_map):
for k in data.num_familia_raw.unique():
if empty_map[k]:
data[data.num_familia_raw==k]=data[data.num_familia_raw==k].replace({'{}_heranca'.format(col):{'':'nao informado'}})
#information doesn't exist
condition1=(data['{}_heranca'.format(col)]=='')
#same family
condition2=(data['num_familia_raw']==data['num_familia_raw'].shift(1))
while '' in data.groupby('num_familia_raw').last()['{}_heranca'.format(col)].values:
data['{}_heranca'.format(col)]=np.where(condition1 & condition2,data['{}_heranca'.format(col)].shift(1),data['{}_heranca'.format(col)])
return data['{}_heranca'.format(col)]
Running the full code with the appropriate imports yields:
0 A
1 A
2 nao informado
3 nao informado
4 C
5 C
which is exactly what I want. However, this solution is hugely inefficient and my real data has almost 2 million rows.
Measuring performance with timeit
I'm trying to measure the performance of my implementation to compare it with other solutions that I eventually develop and I would be very grateful if someone help to understand it better. Here is my code:
import timeiit
timeit.timeit("heritage('destino', data=test, empty_map=isdest_null)",number=1000, globals=globals())
output:
23.539601539001524
I'm not sure how to interpret it but according to the documentation this means 23 seconds per loop but what this means in my case?
If the available destino_heranca always appears first in each num_familia_raw, then you can do a transform:
test['destino_heranca'] = (test.groupby('num_familia_raw')['destino_heranca']
.transform('first')
.replace('','nao informado')
)
Output:
destino_heranca num_familia_raw
0 A 1
1 A 1
2 nao informado 2
3 nao informado 2
4 C 3
5 C 3
This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I want to filter a column containing tweets (3+million rows) in a pandas dataframe by dropping those tweets that do not contain a keyword/s. To do this, I'm running the following loop (sorry, I'm new to python):
filter_word_indicators = []
for i in range(1, len(df)):
if 'filter_word' in str(df.tweets[0:i]):
indicator = 1
else:
indicator = 0
filter_word_indicators.append(indicator)
The idea is to then drop tweets if the indicator equals 0. The problem is that this loop is taking forever to run. I'm sure there is a better way to drop tweets that do not contain my 'filer_word', but I don't know how to code it up. Any help would be great.
Check out pandas.Series.str.contains, which you can use as follows.
df[~df.tweets.str.contains('filter_word')]
MWE
In [0]: df = pd.DataFrame(
[[1, "abc"],
[2, "bce"]],
columns=["number", "string"]
)
In [1]: df
Out[1]:
number string
0 1 abc
1 2 bce
In [2]: df[~df.string.str.contains("ab")]
Out[2]:
number string
1 2 bce
Timing
Ran a small timing test on the following synthetic DataFrame with three million random strings the size of a tweet
df = pd.DataFrame(
[
"".join(random.choices(string.ascii_lowercase, k=280))
for _ in range(3000000)
],
columns=["strings"],
)
and the keyword abc, comparing the original solution, map + regex and this proposed solution (str.contains). The results are as follows.
original 99s
map + regex 21s
str.contains 2.8s
I create the following example:
df = pd.DataFrame("""Suggested order for Amazon Prime Doctor Who series
Why did pressing the joystick button spit out keypresses?
Why tighten down in a criss-cross pattern?
What exactly is the 'online' in OLAP and OLTP?
How is hair tissue mineral analysis performed?
Understanding the reasoning of the woman who agreed with King Solomon to "cut the baby in half"
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Heavily limited premature compiler translates text into excecutable python code
How many children?
Why are < or > required to use /dev/tcp
Hot coffee brewing solutions for deep woods camping
Minor traveling without parents from USA to Sweden
Non-flat partitions of a set
Are springs compressed by energy, or by momentum?
What is "industrial ethernet"?
What does the hyphen "-" mean in "tar xzf -"?
How long would it take to cross the Channel in 1890's?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Is it illegal to withhold someone's passport and green card in California?
When to remove insignificant variables?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
Cut the gold chain
Why do some professors with PhDs leave their professorships to teach high school?
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?""".split('\n'), columns = ['Sentence'])
You can juste create a simple function with regular expression (more flexible in case of capital characters):
def tweetsFilter(s, keyword):
return bool(re.match('(?i).*(' + keyword + ').*', s))
This function can be called to obtain the boolean series of strings which contains the specific keywords. The mapcan speed up your script (you need to test!!!):
keyword = 'Why'
sel = df.Sentence.map(lambda x: tweetsFilter(x, keyword))
df[sel]
And we obtained:
Sentence
1 Why did pressing the joystick button spit out ...
2 Why tighten down in a criss-cross pattern?
9 Why are < or > required to use /dev/tcp
17 Why do all the teams that I have worked with a...
20 Why does Linux list NVMe drives as /dev/nvme0 ...
22 Why do some professors with PhDs leave their p...
I have this list:
sentences = ['The number of adults who read at least one novel within the past 12 months fell to 47%.',
'Fiction reading rose from 2002 to 2008.', 'The decline in fiction reading last year occurred mostly among men.',
'Women read more fiction.', '50% more.', 'Though it decreased 10% over the last decade.', 'Men are more likely to read nonfiction.',
'Young adults are more likely to read fiction.', 'Just 54% of Americans cracked open a book of any kind last year.',
'But novels have suffered more than nonfiction.']
And I have another list containing the indexes of all sequences of sentences in the above list that contain a number.
index_groupings = [[0, 1], [4, 5], [8]]
I want to extract specified sentence sequences in the variable "sentences" by using the indexes in the variable "index_groupings" so that I get the following output:
The number of adults who read at least one novel within the past 12 months fell to 47%.Fiction reading rose from 2002 to 2008.
50% more.Though it decreased 10% over the last decade.
Just 54% of Americans cracked open a book of any kind last year.
So I do the following:
for grouping in index_groupings:
if len(grouping) > 1:
print sentences[grouping[:]]
else:
print sentences[grouping[0]]
When I run that, I get an error message that says
TypeError: list indices must be integers, not list
The line print sentences[grouping[:]] trips it up. Is there a way to loop through those index sequences in the list index_groupings so that it returns the correct output?
print ["".join(map(lambda x:sentences[x],i)) for i in index_groupings]
You can use join and list comprehension. here.
Output:['The number of adults who read at least one novel within the past 12 months fell to 47%. Fiction reading rose from 2002 to 2008.', '50% more. Though it decreased 10% over the last decade.', 'Just 54% of Americans cracked open a book of any kind last year.']
Python's got some batteries for this:
for grouping in index_groupings:
if len(grouping) > 1:
print sentences[slice(*grouping)]
else:
print sentences[grouping[0]]
The slice constructor is used under the hood when you do a[x:y:z] (it gets converted to a[slice(x, y, z)] except specialized byte codes are used to avoid some overhead), this is just creating one explicitly since we have the sequence of values needed to call the constructor anyway, and it's more convenient than indexing or unpacking and then using slicing syntax.