NLTK gives error expected string or bytes-like object - python

I imported a dataset (.csv) with pandas. The first column is the column with tweets, I rename it and transform it to a numpy array as usual with .values. Then I start the pre-processing with NLTK, it works pretty much every time, except for this dataset. It gives me the error TypeError: expected string or bytes-like object and I can't figure out why. The text contains some weird stuff, but far from the worst I've seen. Can someone help out?
data = pd.read_csv("facebook.csv")
text = data["Anonymized Message"].values
X = []
for i in range(0, len(text)):
tweet = re.sub("[^a-zA-Z]", " ", text[i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
X.append(tweet)
gives me this error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-a08c1779c787> in <module>()
1 text_train = []
2 for i in range(0, len(text)):
----> 3 tweet = re.sub("[^a-zA-Z]", " ", text[i])
4 tweet = tweet.lower()
5 tweet = tweet.split()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
189 a callable, it's passed the match object and must return
190 a replacement string to be used."""
--> 191 return _compile(pattern, flags).sub(repl, string, count)
192
193 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
Here's the dataset
http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv

Related

I'm trying to extract emails, and I'm getting a TypeError [duplicate]

This question already has answers here:
How to extract text from an existing docx file using python-docx
(6 answers)
I'm getting a TypeError. How do I fix it?
(2 answers)
Closed 6 months ago.
I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:
import pandas as pd
from docx.api import Document
import os
import re
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
data.append(match)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
print(df)
and I'm getting an error showing:
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 19>()
17 data = []
19 for wordDoc in worddocs_list:
---> 20 match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
21 data.append(match)
24 df = pd.DataFrame(data)
File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What am I doing wrong here?
Many thanks.
Your wordDoc variable doesn't contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.
It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
And then use that as the string to match against:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', documentText)
If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time:
regex = re.compile(r'[\w.+-]+#[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = regex.findall(documentText)

Find all website links, group and count from column of dataframe - Python

I have a dataframe with the following columns: Date,Time,Tweet,Client,Client Simplified
The column Tweet contains sometimes a website link.
I am trying to define a function which extract the number of times this link is showed in the tweet and which link it is.
I don't want the answer of the whole function. I am now struggling with the function findall, before I program all this into a function:
import pandas as pd
import re
csv_doc = pd.read_csv("/home/datasci/prog_datasci_2/activities/activity_2/data/TrumpTweets.csv")
URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc)
The error I'm getting is:
TypeError Traceback (most recent call last)
<ipython-input-20-0085f7a99b7a> in <module>
7 # csv_doc.head()
8 tweets = csv_doc.Tweet
----> 9 URL= re.split('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',tweets)
10
11 # URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc[Tweets])
/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
229 and the remainder of the string is returned as the final element
230 of the list."""
--> 231 return _compile(pattern, flags).split(string, maxsplit)
232
233 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
Could you please let me know what is wrong?
Thanks.
try to add r in front of the string. It will tell Python that this is a regex pattern
also re package mostly work on single string, not list or series of string. You can try to use a simple list comprehension like this :
[re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in csv_doc.Tweet]

TypeError: a bytes-like object is required, not 'str' with pd.read_csv

I am trying code from this website: https://datanice.wordpress.com/2015/09/09/sentiment-analysis-for-youtube-channels-with-nltk/
The code I am running into error with is:
import nltk
from nltk.probability import *
from nltk.corpus import stopwords
import pandas as pd
all = pd.read_csv("comments.csv")
stop_eng = stopwords.words('english')
customstopwords =[]
tokens = []
sentences = []
tokenizedSentences =[]
for txt in all.text:
sentences.append(txt.lower())
tokenized = [t.lower().encode('utf-8').strip(":,.!?") for t in txt.split()]
tokens.extend(tokenized)
tokenizedSentences.append(tokenized)
hashtags = [w for w in tokens if w.startswith('#')]
ghashtags = [w for w in tokens if w.startswith('+')]
mentions = [w for w in tokens if w.startswith('#')]
links = [w for w in tokens if w.startswith('http') or w.startswith('www')]
filtered_tokens = [w for w in tokens if not w in stop_eng and not w in customstopwords and w.isalpha() and not len(w)<3 and not w in hashtags and not w in ghashtags and not w in links and not w in mentions]
fd = FreqDist(filtered_tokens)
This gives me the error of:
tokenized = [t.lower().encode('utf-8').strip(":,.!?") for t in txt.split()]
TypeError: a bytes-like object is required, not 'str'
I am getting the csv with this code:
commentDataCsv = pd.DataFrame.from_dict(callFunction).to_csv("comments4.csv", encoding='utf-8')
I have replaced all pd.read_json("comments.csv") with read_csv.
In Py3, the default string type is unicode. encode converts it to bytestring. To apply strip to bytestring, you need to provide a matching character:
In [378]: u'one'.encode('utf-8')
Out[378]: b'one'
In [379]: 'one'.encode('utf-8').strip(':')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-379-98728e474af8> in <module>
----> 1 'one'.encode('utf-8').strip(':')
TypeError: a bytes-like object is required, not 'str'
In [381]: 'one:'.encode('utf-8').strip(b':')
Out[381]: b'one'
If you don't encode first, you can use the default unicode characters
In [382]: 'one:'.strip(':')
Out[382]: 'one'
I'd suggest going this route, otherwise the rest of your code will require the b token.

Error : expected string or bytes-like object

I am working on pre processing the data for "Job Description" column which contains text data format. I have created a dataframe and trying to apply a function to pre process the data, but getting the error as "expected string or bytes-like object" when applying function to the column in data frame. Please refer my code below and help.
####################################################
#Function to pre process the data
def clean_text(text):
"""
Applies some pre-processing on the given text.
Steps :
- Removing HTML tags
- Removing punctuation
- Lowering text
"""
# remove HTML tags
text = re.sub(r'<.*?>', '', text)
# remove the characters [\], ['] and ["]
text = re.sub(r"\\", "", text)
text = re.sub(r"\'", "", text)
text = re.sub(r"\"", "", text)
# convert text to lowercase
text = text.strip().lower()
#replace all numbers with empty spaces
text = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(text))
# replace punctuation characters with spaces
filters='!"\'#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'
translate_dict = dict((c, " ") for c in filters)
translate_map = str.maketrans(translate_dict)
text = text.translate(translate_map)
return text
#############################################################
#To apply "Clean_text" function to job_description column in data frame
df['jobnew']=df['job_description'].apply(clean_text)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-c15402ac31ba> in <module>()
----> 1 df['jobnew']=df['job_description'].apply(clean_text)
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-30-5f24dbf9d559> in clean_text(text)
10
11 # remove HTML tags
---> 12 text = re.sub(r'<.*?>', '', text)
13
14 # remove the characters [\], ['] and ["]
~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
The function re.sub is telling you that you called it with something (the argument text) that is not a string. Since it is invoked by calling apply on the contents of df['job_description'], it is clear that the problem must be in how you created this data frame... and you don't show that part of your code.
Construct your dataframe so that this column only contains strings, and your program will run without error for at least a few more lines.

float' object has no attribute 'lower'

I'm facing this error and I'm really not able to find the reason for it.
Can somebody please point out the reason for it ?
for i in tweet_raw.comments:
mns_proc.append(processComUni(i))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-416-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_processed.append(processtwt(i))
3
<ipython-input-414-4e1b8a8fb285> in processtwt(tweet)
4 #Convert to lower case
5 #tweet = re.sub('RT[\s]+','',tweet)
----> 6 tweet = tweet.lower()
7 #Convert www.* or https?://* to URL
8 #tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','',tweet)
AttributeError: 'float' object has no attribute 'lower'
A second similar error that facing is this :
for i in tweet_raw.comments:
tweet_proc.append(processtwt(i))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-423-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_proc.append(processtwt(i))
3
<ipython-input-421-38fab2ef704e> in processComUni(tweet)
11 tweet=re.sub(('[http]+s?://[^\s<>"]+|www\.[^\s<>"]+'),'', tweet)
12 #Convert #username to AT_USER
---> 13 tweet = re.sub('#[^\s]+',' ',tweet)
14 #Remove additional white spaces
15 tweet = re.sub('[\s]+', ' ', tweet)
C:\Users\m1027201\AppData\Local\Continuum\Anaconda\lib\re.pyc in sub(pattern, repl, string, count, flags)
149 a callable, it's passed the match object and must return
150 a replacement string to be used."""
--> 151 return _compile(pattern, flags).sub(repl, string, count)
152
153 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or buffer
Shall I check whether of not a particluar tweet is tring before passing it to processtwt() function ? For this error I dont even know which line its failing at.
Just try using this:
tweet = str(tweet).lower()
Lately, I've been facing many of these errors, and converting them to a string before applying lower() always worked for me.
My answer will be broader than shalini answer. If you want to check if the object is of type str then I suggest you check type of object by using isinstance() as shown below. This is more pythonic way.
tweet = "stackoverflow"
## best way of doing it
if isinstance(tweet,(str,)):
print tweet
## other way of doing it
if type(tweet) is str:
print tweet
## This is one more way to do it
if type(tweet) == str:
print tweet
All the above works fine to check the type of object is string or not.

Categories