Correcting typos in string via pandas.DataFrame - python

I have a list huge list of distorted data that is stored in text that I need to do some wrangling but just cannot figure out what is the best and most efficient method. Another consideration in mind is that this data is pretty huge. Sample size 1.6 million rows and production going up to 10s of millions.
In [200]:data=['Bernard 51','Ber%nard Bachelor','BER78NARD$ bsc','BERnard$d B.']
In [201]:test=pd.DataFrame(data,columns=['Names'])
In [2020:test
Out[202]:
Names
0 Bernard 51
1 Ber%nard Bachelor
2 BER78NARD$ bsc
3 BERnard$d B.
My objective is to output
Names
0 bernard
1 bernard ba
2 bernard ba
3 bernard ba
My pseudo code will be something like:
In[222]:test_processed=pd.DataFrame(test.Names.str.lower()) #covert all str to lower
In[223]:test_processed
Out[223]:
Names
0 bernard 51
1 ber%nard bachelor
2 ber78nard$ bsc
3 bernard$d b.
In[224]:test_processed2=pd.DataFrame(test_processed.Names.str.replace('[^\w\s]',''))
#removes punctuation/symbol typos
In[225]:test_processed2
Out[225]:
Names
0 bernard 51
1 bernard bachelor
2 ber78nard bsc
3 bernardd b
In[226]:BA=['bachelor','bsc','b.'] #define list to be replaced with ba
In[227]:test_processed.replace(BA,'ba') #replace list defined above with standard term
Out[227]:
Names
0 bernard 51
1 ber%nard bachelor
2 ber78nard$ bsc
3 bernard$d b.
#no change, didn't work
My observation tells me replace does not work for a list if it is applied on a Pandas DataFrame.
Reason I am not using test_processed2.Names.str.replace is because, DataFrame.str.replace does not allow using list as to be replaced.
Reason why I am using a list because I hope to easily maintain the lists as more and more different variables might come in. I would love to hear from you if you have a solution or a better alternative other than using Python or Pandas.

test_processed.replace(BA,'ba') will only replace exact matches, not parts of entries. That is, if one of your entries is 'bachelor' it will replace it just fine. For parts of the strings, you can use regex option as per docs.
There is also replace which works on strings. So, for example, if you have a list data and you want to replace all instances of 'bsc' with 'ba', what you do is this:
data = [d.replace('bsc', 'ba') for d in data]
For the whole list of replacements you can do:
data = [d.replace(b, 'ba') for d in data for b in BA]
Now, while I feel like this is exactly what you are asking about, I should mention that this is ultimately not the right way to fix typos. Imagine you have entry "B.Bernard, msc" - you'll replace "B." with "BA" while this shouldn't have happened. Your algorithm is very basic and thus is faulty.

Related

How to solve Python Pandas assign error when creating new column

I have a dataframe containing home descriptions:
description
0 Beautiful, spacious skylit studio in the heart...
1 Enjoy 500 s.f. top floor in 1899 brownstone, w...
2 The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3 We welcome you to stay in our lovely 2 br dupl...
4 Please don’t expect the luxury here just a bas...
5 Our best guests are seeking a safe, clean, spa...
6 Beautiful house, gorgeous garden, patio, cozy ...
7 Comfortable studio apartment with super comfor...
8 A charming month-to-month home away from home ...
9 Beautiful peaceful healthy homeThe spaceHome i...
I'm trying to count the number of sentences on each row (using sent_tokenize from nltk.tokenize) and append those values as a new column, sentence_count, to the df. Since this is part of a larger data pipeline, I'm using pandas assign so that I could chain operations.
I can't seem to get it to work, though. I've tried:
df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))
and
df.assign(sentence_count=len(sent_tokenize(df['description'])))
but both raise the following errro:
TypeError: expected string or bytes-like object
I've confirmed that each row has a dtype of str. Perhaps it's because description has dtype('O')?
What am I doing wrong here? Using a pipe with a custom function works fine here, but I prefer using assign.
x['description'] when you pass it to sent_tokenize in the first example is a pandas.Series. It's not a string. It's a Series (similar to a list) of strings.
So instead you should do this:
df.assign(sentence_count=df['description'].apply(sent_tokenize))
Or, if you need to pass extra parameters to sent_tokenize:
df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))

Remove empty rows within a dataframe and check similarity

I am having some difficulties to select not empty fields using regex (findall) within my dataframe, looking for words contained into a text source:
text = "Be careful otherwise police will capture you quickly."
I will need to look for words that ends with ful in my text string, then looking for words that ends with full in my dataset.
Author DF_Text
31 Better the devil you know than the one you don't
53 Beware the door with too many keys.
563 Be careful what you tolerate. You are teaching people how to treat you.
41 Fear the Greeks bearing gifts.
539 NaN
51 The honey is sweet but the bee has a sting.
21 Be careful what you ask for; you may get it.
(from csv/txt file).
I need to extract words ending with ful in text, then look at both DF_Text (thus Author) which contains words ending with ful and appending results in a list.
n=0
for i in df['DF_Text']:
print(re.findall(r"\w+ful", i))
n=n+1
print(n)
My question is: how can I remove empty rows([]) from the analysis (NaN) and report the author names (e.g. 563, 21) related to?
I will be happy to provide further information, in case it would be not clear.
Use str.findall instead of looping with re.findall:
df["found"] = df["DF_Text"].str.findall(r"(\w+ful)")
df.loc[df["found"].str.len().eq(0),"found"] = df["Author"]
print (df)
Author DF_Text found
0 31 Better the devil you know than the one you don't 31
1 53 Beware the door with too many keys. 53
2 563 Be careful what you tolerate. You are teaching... [careful]
3 41 Fear the Greeks bearing gifts. 41
4 539 NaN NaN
5 51 The honey is sweet but the bee has a sting. 51
6 21 Be careful what you ask for; you may get it. [careful]
I would use the .notna() function of Pandas to get rid of that row in your datafrae. l.
Something like this
df = df[df['DF_Text'].notna()]
And please note that Python calls the dataframe twice before overwriting it, this is correct.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.htm

How to use complex conditional to fill column by cell in pandas in a efficient way?

I have several old books where each page is filled with historical records of immigrants and their families. Most variables were filled only for the father, usually regarded as the family's chief. So, for example, if the immigrant family is going to live in a city called "small city in the West", only the father would have this information, while the mother and children were supposed to go to the same destiny. Additionally, some observations have no information at all, even for the father.
What I want to do is just filled the missing values for the relatives within the same family (i.e., the same boss). I have reached a solution but it's too inefficient and I'm afraid I'm overcomplicating something that is rather simple. Below I use an example dataset to show my solution.
Example dataset:
m=1
test=pd.DataFrame({'destino_heranca':['A','','','','C','']*m, 'num_familia_raw':[1,1,2,2,3,3]*m}, index=range(6*m))
test
Note that individual 1 has the city A as destiny since here is from family 1. In the other hand, family 2 must be a missing information in the final dataset since I don't have information even for the boss.
destino_heranca num_familia_raw
0 A 1
1 1
2 2
3 2
4 C 3
5 3
Then, I create a dictionary called isdest_null where the keys are the family numbers and the values are boolean values, True if the family's boss has information and False otherwise:
def num_familia_raw_dest(m):
return list(set(test[test['num_familia_raw']==m].destino_heranca.values))
isdest_null={k:('' in num_familia_raw_dest(k)) & (len(num_familia_raw_dest(k))==1) for k in test.num_familia_raw.unique()}
In a separate executable file called heritage.py I define the following function:
import numpy as np
def heritage(col, data, empty_map):
for k in data.num_familia_raw.unique():
if empty_map[k]:
data[data.num_familia_raw==k]=data[data.num_familia_raw==k].replace({'{}_heranca'.format(col):{'':'nao informado'}})
#information doesn't exist
condition1=(data['{}_heranca'.format(col)]=='')
#same family
condition2=(data['num_familia_raw']==data['num_familia_raw'].shift(1))
while '' in data.groupby('num_familia_raw').last()['{}_heranca'.format(col)].values:
data['{}_heranca'.format(col)]=np.where(condition1 & condition2,data['{}_heranca'.format(col)].shift(1),data['{}_heranca'.format(col)])
return data['{}_heranca'.format(col)]
Running the full code with the appropriate imports yields:
0 A
1 A
2 nao informado
3 nao informado
4 C
5 C
which is exactly what I want. However, this solution is hugely inefficient and my real data has almost 2 million rows.
Measuring performance with timeit
I'm trying to measure the performance of my implementation to compare it with other solutions that I eventually develop and I would be very grateful if someone help to understand it better. Here is my code:
import timeiit
timeit.timeit("heritage('destino', data=test, empty_map=isdest_null)",number=1000, globals=globals())
output:
23.539601539001524
I'm not sure how to interpret it but according to the documentation this means 23 seconds per loop but what this means in my case?
If the available destino_heranca always appears first in each num_familia_raw, then you can do a transform:
test['destino_heranca'] = (test.groupby('num_familia_raw')['destino_heranca']
.transform('first')
.replace('','nao informado')
)
Output:
destino_heranca num_familia_raw
0 A 1
1 A 1
2 nao informado 2
3 nao informado 2
4 C 3
5 C 3

Speed up a loop filtering a string [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I want to filter a column containing tweets (3+million rows) in a pandas dataframe by dropping those tweets that do not contain a keyword/s. To do this, I'm running the following loop (sorry, I'm new to python):
filter_word_indicators = []
for i in range(1, len(df)):
if 'filter_word' in str(df.tweets[0:i]):
indicator = 1
else:
indicator = 0
filter_word_indicators.append(indicator)
The idea is to then drop tweets if the indicator equals 0. The problem is that this loop is taking forever to run. I'm sure there is a better way to drop tweets that do not contain my 'filer_word', but I don't know how to code it up. Any help would be great.
Check out pandas.Series.str.contains, which you can use as follows.
df[~df.tweets.str.contains('filter_word')]
MWE
In [0]: df = pd.DataFrame(
[[1, "abc"],
[2, "bce"]],
columns=["number", "string"]
)
In [1]: df
Out[1]:
number string
0 1 abc
1 2 bce
In [2]: df[~df.string.str.contains("ab")]
Out[2]:
number string
1 2 bce
Timing
Ran a small timing test on the following synthetic DataFrame with three million random strings the size of a tweet
df = pd.DataFrame(
[
"".join(random.choices(string.ascii_lowercase, k=280))
for _ in range(3000000)
],
columns=["strings"],
)
and the keyword abc, comparing the original solution, map + regex and this proposed solution (str.contains). The results are as follows.
original 99s
map + regex 21s
str.contains 2.8s
I create the following example:
df = pd.DataFrame("""Suggested order for Amazon Prime Doctor Who series
Why did pressing the joystick button spit out keypresses?
Why tighten down in a criss-cross pattern?
What exactly is the 'online' in OLAP and OLTP?
How is hair tissue mineral analysis performed?
Understanding the reasoning of the woman who agreed with King Solomon to "cut the baby in half"
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Heavily limited premature compiler translates text into excecutable python code
How many children?
Why are < or > required to use /dev/tcp
Hot coffee brewing solutions for deep woods camping
Minor traveling without parents from USA to Sweden
Non-flat partitions of a set
Are springs compressed by energy, or by momentum?
What is "industrial ethernet"?
What does the hyphen "-" mean in "tar xzf -"?
How long would it take to cross the Channel in 1890's?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Is it illegal to withhold someone's passport and green card in California?
When to remove insignificant variables?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
Cut the gold chain
Why do some professors with PhDs leave their professorships to teach high school?
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?""".split('\n'), columns = ['Sentence'])
You can juste create a simple function with regular expression (more flexible in case of capital characters):
def tweetsFilter(s, keyword):
return bool(re.match('(?i).*(' + keyword + ').*', s))
This function can be called to obtain the boolean series of strings which contains the specific keywords. The mapcan speed up your script (you need to test!!!):
keyword = 'Why'
sel = df.Sentence.map(lambda x: tweetsFilter(x, keyword))
df[sel]
And we obtained:
Sentence
1 Why did pressing the joystick button spit out ...
2 Why tighten down in a criss-cross pattern?
9 Why are < or > required to use /dev/tcp
17 Why do all the teams that I have worked with a...
20 Why does Linux list NVMe drives as /dev/nvme0 ...
22 Why do some professors with PhDs leave their p...

Iterating over a list of strings for operations in Python

I'm trying to do a rather simple loop in Python 3.6.1 that involves a list of strings. Essentially, I have a dataframe that looks like this:
X_out Y_out Z_in X_in Y_in Z_in
Year
1969 4 3 4 4 3 3
1970 2 0 1 3 2 2
1971 3 1 1 0 1 2
1972 2 0 0 3 1 0
and I'd like to find the net change of X, Y, and Z, making them new columns in this dataframe.
In its simplest form, this could be
df['x_net'] = df['x_in'] - df['x_out']
df['y_net'] = df['y_in'] - df['y_out']
df['z_net'] = df['z_in'] - df['z_out']
but in actuality, there are about fifteen columns that need to be created in this way. Since it'll be a bear, I figure it's best to put in a function, or at least a loop. I made a list of our initial "root" variables, without the suffixes that looks like this:
root_vars = ['x', 'y', 'z']
And I think that my code might(?) look something like:
for i in root_vars:
df['%s_net'] = df['%s_in'] - df['%s_out'] %(root_vars_[i])
but that's definitely not right. Could someone give me a hand on this one please?
Thank you so much!
You can use the relatively new (Python 3.6) formatted string literals:
for i in root_vars:
df[f'{i}_net'] = df[f'{i}_in'] - df[f'{i}_out']
The f prefix before each string causes the {i} to be replaced with the value of the variable i. If you want the code to be usable in Python versions before 3.6, you can go with the more usual formatting:
for i in root_vars:
df['{}_net'.format(i)] = df['{}_in'.format(i)] - df['{}_out'.format(i)]

Categories