Splitting words in a column - python

I have a csv with msg column and it has the following text
muchloveandhugs
dudeseriously
onemorepersonforthewin
havefreebiewoohoothankgod
thisismybestcategory
yupbabe
didfreebee
heykidforget
hecomplainsaboutit
I know that nltk.corpus.words has a bunch of sensible words. My problem is how do I iterate it over the df[‘msg’] column so that I can get words such as
df[‘msg’]
much love and hugs
dude seriously
one more person for the win

From this question about splitting words in strings with no spaces and not quite knowing what your data looks like:
import pandas as pd
import wordninja
filename = 'mycsv.csv' # Put your filename here
df = pd.read_csv(filename)
for wordstring in df['msg']:
split = wordninja.split(wordstring)
# Do something with split

Related

Python/Pandas - split text into columns by delimiter ; and create a csv file

I have a long text where I have inserted a delimiter ";" exactly where I would like to split the text into different columns.
So far, whenever I try to split the text into 'ID' and 'ADText' I only get the first line. However there should be 1439 lines/rows in two columns.
My text looks like this:
1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on
I want to use the ; to split my text into two Columns, one with ID and one with the AD Text.
#read the text file into python:
jobads= pd.read_csv("jobads.txt", header=None)
print(jobadsads)
#create dataframe
df=pd.DataFrame(jobads, index=None, columns=None)
type(df)
print(df)
#name column to target it for split
df = df.rename(columns={0:"Job"})
print(df)
#split it into two columns. Problem: I only get the first row.
print(pd.DataFrame(dr.Job.str.split(';',1).tolist(),
columns=['ID','AD']))
Unfortunately that only works for the first entry and then it stops. The output looks like this:
ID AD
0 1234 text in written from with ...
Where am I going wrong? I would appreciate any advise =)
Thank you!
sample text:
FullName;ISO3;ISO1;molecular_weight
Alanine;Ala;A;89.09
Arginine;Arg;R;174.20
Asparagine;Asn;N;132.12
Aspartic_Acid;Asp;D;133.10
Cysteine;Cys;C;121.16
Create columns based on ";" separator:
import pandas as pd
f = "aminoacids"
df = pd.read_csv(f,sep=";")
EDIT: Considering the comment I assume the text looks more something like this:
t = """1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on1234; text in written from with multiple """
In this case regex like this will split your string into ids and text which you can then use to generate a pandas dataframe.
import re
r = re.compile("([0-9]+);")
re.split(r,t)
Output:
['',
'1234',
' text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon ',
'2345',
' then the new Ad-Text begins until the next ID ',
'3456',
' and so on',
'1234',
' text in written from with multiple ']
EDIT 2:
This is a response to questioners additional question in the comments:
How to convert this string to a pandas dataframe with 2 columns: IDs and Texts
import pandas as pd
# a is the output list from the previous part of this answer
# Create list of texts. ::2 takes every other item from a list, starting with the FIRST one.
texts = a[::2][1:]
print(texts)
# Create list of ID's. ::1 takes every other item from a list, starting with the SECOND one
ids = a[1::2]
print(ids)
df = pd.DataFrame({"IDs":ids,"Texts":texts})

Is there any way in python to auto-correct spelling mistake in multiple rows of an excel files of a single column?

I am working on the Sentiment Analysis for a college project. I have an excel file with a "column" named "comments" and it has "1000 rows". The sentences in these rows have spelling mistakes and for the analysis, I need to have them corrected. I don't know how to process this so that I get and column with correct sentences using python code.
All the methods I found were correcting spelling mistakes of a word not sentence and not on the column level with 100s of rows.
you can use Spellchecker for doing your stuff
import pandas as pd
from spellchecker import SpellChecker
spell = SpellChecker()
df = pd.DataFrame(['hooww good mrning playing fotball studyiing hard'], columns = ['text'])
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell.correction(word))
return ' '.join(correct_word)
df['spell_corrected_sentence'] = df['text'].apply(lambda x: spell_check(x))

Get each unique word in a csv file tokenized

Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):
I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?
I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them.
Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!
note: The real table has more than 50,000 rows.
===some update==
here is the code I have tried.
import pandas as pd
data= pd.read_csv('test.csv')
data.head()
newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)
print (newTry)
import nltk
for sentence in newTry:
new=sentence.split()
print(new)
print(set(new))
enter image description here
Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.
You can use built-in csv pacakge to read csv file. And nltk to tokenize words:
from nltk.tokenize import word_tokenize
import csv
words = []
def get_data():
with open("sample_csv.csv", "r") as records:
for record in csv.reader(records):
yield record
data = get_data()
next(data) # skip header
for row in data:
for sent in row:
for word in word_tokenize(sent):
if word not in words:
words.append(word)
print(words)

Pandas python replace empty lines with string

I have a csv which at some point becomes like this:
57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"
US
"
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"
US
"
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"
NL
I want to get rid of this empty lines. So far I tried the following script :
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(r'\\n',' ', regex=True)
and
df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True)
as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"
I would first open and preprocess the file's data, and just then pass to pandas
lines = []
with open('file.csv') as f:
for line in f:
if line.strip(): lines.append(line.strip())
df = pd.read_csv(io.StringIO("\n".join(lines)))
Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.
import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)
This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.
Try:
df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)
This instruction replaces each \n, \r and Tab with nothing.
Due to inplace argument, no need to substitute the result to df again.
Alternative: Use to_replace=r'\s' to eliminate also spaces,
maybe in selected columns only.

Selecting and importing only certain columns from excel for importing

'I have an excel file which contains many columns with strings, but i want to import certain columns of this excel file containing 'NGUYEN'.
I want to generate a string from columns in my excel which had 'NGUYEN' in them.
import pandas as pd
data = pd.read_excel("my_excel.xlsx", parse_cols='NGUYEN' in col for cols in my_excel.xlsx, skiprows=[0])
data = data.to_string()
print(data)
SyntaxError: invalid syntax
my_excel.xlsx
Function output should be
data = 'NGUYEN VIETNAM HANOIR HAIR PANTS BIKES CYCLING ORANGE GIRL TABLE DARLYN NGUYEN OMG LOL'
I'm pretty sure this is what you are looking for. I tried making it as simple and compact as possible, if you need help making a more readable multi-line function. Let me know!
import pandas as pd
data = pd.read_excel("my_excel.xlsx")
getColumnsByContent = lambda string: ' '.join([' '.join([elem for elem in data[column]]) for column in data.columns if string in data[column].to_numpy()])
print(getColumnsByContent('NGUYEN'))
print(getColumnsByContent('PANTS'))

Categories