dropping row containing non-english words in pandas dataframe

dropping row containing non-english words in pandas dataframe - python

I turned this twitter corpus into pandas data frame and I was trying to find the none English tweets and delete them from the data frame, so I did this:
for j in range(0,150):
if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
df.drop(j)
print(df.shape)
but I check the shape, no row was dropped.
Am I using the drop function wrong, or do I need to keep track of the index of the row?

That's because df.drop() returns a copy instead of modifying your original dataframe. Try set inplace=True
for j in range(0,150):
if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
df.drop(j, inplace=True)
print(df.shape)

This will filter out all the non-English rows in our pandas dataframe.
import nltk
nltk.download('words')
from nltk.corpus import words
import pandas as pd
data1 = pd.read_csv("testdata.csv")
Word = list(set(words.words()))
df_final = data1[data1['column_name'].str.contains('|'.join(Word))]
print(df_final)

Related

Filtering a pandas column which contains a list in other column?

I have a pandas dataframe pd like this
https://imgur.com/a/6TM3B3o
I want to filter the df_participants column which contains pd['df_pair'][0], and the expected result should be a new pd1 containing only rows that pd['df_pair'][0] is a subset of df_paticipants like this https://imgur.com/EzCcuh3
I have no idea how to do that. I have tried with .isin() or pd1 = pd[pd['df_participants'].str.contains(pd['df_pair'][0])] but it does not work. Is there any idea?

I think pd variable is not good idea for DataFrame, better is use df:
#remove nans rows
df = df.dropna(subset=['df_participants'])
#get rows if subset
df = df[df.df_participants.map(set(df['df_pair'][0]).issubset)]

Pandas dataframe filter out rows with non-english text

I have a pandas df which has 6 columns, the last one is input_text. I want to remove from df all rows that have non-english text in that column. I would like to use langdetect's detect function.
Some template
from langdetect import detect
import pandas as pd
def filter_nonenglish(df):
new_df = None # Do some magical operations here to create the filtered df
return new_df
df = pd.read_csv('somecsv.csv')
df_new = filter_nonenglish(df)
print('New df is: ', df_new)
Note! It doesn't matter what the other 5 columns are.
Also note: using detect is as simple as:
t = 'I am very cool!'
print(detect(t))
Output is:
en

You can do it as below on your df and get all the rows with english text in the input_text column:
df_new = df[df.input_text.apply(detect).eq('en')]
So basically just apply the langdetect.detect function to the values in input_text column and get all those rows for which text is detected as "en".

Stemming a pandas dataframe

I have tweet dataset (taken from NLTK) which is currently in a pandas dataframe, but I need to stem it. I have tried many different ways and get some different errors, such as
AttributeError: 'Series' object has no attribute 'lower'
and
KeyError: 'text'
I dont understand the KeyError as the column is definitely called 'text', however I understand that I need to change the dataframe to a string in order for the stemmer to work (I think).
Here is an example of the data
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
negative_tweets = twitter_samples.strings('negative_tweets.json')
negtweetsdf = DataFrame(negative_tweets,columns=['text'])
print(stemmer.stem(negtweetstr))

You need to apply the stemming function to the series as follows
negtweetsdf.apply(stemmer.stem)
This will create a new series.
Functions that expect a single string value or similar will not simply work on a pandas dataframe or series. They need to be applied to the entire series, which is why .apply is used.
Here is a worked example with lists inside a dataframe column.
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import TweetTokenizer
stemmer = SnowballStemmer("english")
import pandas as pd
df = pd.DataFrame([['some extremely exciting tweet'],['another']], columns=['tweets'])
# put the strings into lists
df = pd.DataFrame(df.apply(list,axis=1), columns=['tweets'])
# for each row (apply) for each item in the list, apply the stemmer
# return a list containing the stems
df['tweets'].apply(lambda x: [stemmer.stem(y) for y in x])

Using the read_excel function in panda to go through all the columns in an excel file

What the code does below is read a column (named "First") and look for the string "TOM".
I want to go through all the columns in the file ( not just the "First" column) - i was thinking of doing something like excelFile[i][j] where i and j are set in a loop but that does not work. Any ideas?
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import re
excelFile=pd.read_excel("test.xls")
for i in excelFile.index:
match=re.match(".*TOM.*",excelFile['First'][i])
if match:
print(excelFile['First'][i])
print("check")

excelFile.any(axis=None) will return a boolean value telling you if the value was found anywhere in the dataframe.
Documentation for pd.DataFrame.any
To print if the value was found, get the columns from the dataframe and use iterrows:
# Create a list of columns in the dataframe
columns = excelFile.columns.tolist()
# Loop through indices and rows in the dataframe using iterrows
for index, row in excelFile.iterrows():
# Loop through columns
for col in columns:
cell = row[col]
# If we find it, print it out
if re.match(".*TOM.*", cell):
print(f'Found at Index: {index} Column: {col}')

something like this loops through all of the columns and looking for a string match
for column in excelFile:
if 'tom' in column.lower():
print(column)

Unable to insert clean unicode text back into DataFrame in pandas

I am doing 2 things.
1) filter a dataframe in pandas
2) clean unicode text in a specific column in the filtered dataframe.
import pandas as pd
import probablepeople
from unidecode import unidecode
import re
#read data
df1 = pd.read_csv("H:\\data.csv")
#filter
df1=df1[(df1.gender=="female")]
#reset index because otherwise indexes will be as per original dataframe
df1=df1.reset_index()
Now i am trying to clean unicode text in the address column
#clean unicode text
for i in range(10):
df1.loc[i][16] = re.sub(r"[^a-zA-Z.,' ]",r' ',df1.address[i])
However, i am unable to do so and below is the error i am getting.
c:\python27\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

I think you can use str.replace:
df1=df1[df1.gender=="female"]
#reset index with parameter drop if need new monotonic index (0,1,2,...)
df1=df1.reset_index(drop=True)
df1.address = df1.address.str.replace(r"[^a-zA-Z.,' ]",r' ')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

dropping row containing non-english words in pandas dataframe - python

That's because df.drop() returns a copy instead of modifying your original dataframe. Try set inplace=True for j in range(0,150): if not wordnet.synsets(df.i[j]):#Comparing if word is non-English df.drop(j, inplace=True) print(df.shape)

Related

Filtering a pandas column which contains a list in other column?

Pandas dataframe filter out rows with non-english text

Stemming a pandas dataframe

Using the read_excel function in panda to go through all the columns in an excel file

Unable to insert clean unicode text back into DataFrame in pandas

Categories

Resources