I'm facing this error and I'm really not able to find the reason for it.
Can somebody please point out the reason for it ?
for i in tweet_raw.comments:
mns_proc.append(processComUni(i))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-416-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_processed.append(processtwt(i))
3
<ipython-input-414-4e1b8a8fb285> in processtwt(tweet)
4 #Convert to lower case
5 #tweet = re.sub('RT[\s]+','',tweet)
----> 6 tweet = tweet.lower()
7 #Convert www.* or https?://* to URL
8 #tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','',tweet)
AttributeError: 'float' object has no attribute 'lower'
A second similar error that facing is this :
for i in tweet_raw.comments:
tweet_proc.append(processtwt(i))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-423-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_proc.append(processtwt(i))
3
<ipython-input-421-38fab2ef704e> in processComUni(tweet)
11 tweet=re.sub(('[http]+s?://[^\s<>"]+|www\.[^\s<>"]+'),'', tweet)
12 #Convert #username to AT_USER
---> 13 tweet = re.sub('#[^\s]+',' ',tweet)
14 #Remove additional white spaces
15 tweet = re.sub('[\s]+', ' ', tweet)
C:\Users\m1027201\AppData\Local\Continuum\Anaconda\lib\re.pyc in sub(pattern, repl, string, count, flags)
149 a callable, it's passed the match object and must return
150 a replacement string to be used."""
--> 151 return _compile(pattern, flags).sub(repl, string, count)
152
153 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or buffer
Shall I check whether of not a particluar tweet is tring before passing it to processtwt() function ? For this error I dont even know which line its failing at.
Just try using this:
tweet = str(tweet).lower()
Lately, I've been facing many of these errors, and converting them to a string before applying lower() always worked for me.
My answer will be broader than shalini answer. If you want to check if the object is of type str then I suggest you check type of object by using isinstance() as shown below. This is more pythonic way.
tweet = "stackoverflow"
## best way of doing it
if isinstance(tweet,(str,)):
print tweet
## other way of doing it
if type(tweet) is str:
print tweet
## This is one more way to do it
if type(tweet) == str:
print tweet
All the above works fine to check the type of object is string or not.
Related
This question already has answers here:
How to extract text from an existing docx file using python-docx
(6 answers)
I'm getting a TypeError. How do I fix it?
(2 answers)
Closed 6 months ago.
I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:
import pandas as pd
from docx.api import Document
import os
import re
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
data.append(match)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
print(df)
and I'm getting an error showing:
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 19>()
17 data = []
19 for wordDoc in worddocs_list:
---> 20 match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
21 data.append(match)
24 df = pd.DataFrame(data)
File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What am I doing wrong here?
Many thanks.
Your wordDoc variable doesn't contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.
It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
And then use that as the string to match against:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', documentText)
If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time:
regex = re.compile(r'[\w.+-]+#[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = regex.findall(documentText)
I have a text dataset that can be imported by pandas only by using the encoding Latin-1, when I try to use another encodings, it results error. I would like to clear the special characters from that dataset. However, those special characters appears in the hex form like this:
AKU\n\nKU \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84
Then I saw on another thread that I can get rid of this by decoding this to Latin-1, then encode to UTF-8. But it resulted error as shown in the image.
x = data.iloc[5, 0].decode('iso-8859-1').encode('utf8')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-91-c80119246806> in <module>()
1 print(data.iloc[5, 0])
----> 2 x = data.iloc[5, 0].decode('iso-8859-1').encode('utf8')
3 if True:
4 x = re.sub("[\n\t]", ' ', x)
5 x = re.sub("\d+", ' ', x)
AttributeError: 'str' object has no attribute 'decode'
Basically, how can I convert that to UTF-8 for the next steps for text processing? Or is there any other way to get rid of those without need of convertion? Thank you
You can use
import codecs
print(codecs.decode(data.iloc[5, 0], 'unicode-escape').encode('latin1').decode('utf-8'))
See the online Python demo:
import codecs
text = r'AKU\n\nKU \xf0\x9f\x98\x84\xf0\x9f\x98\x84\xf0\x9f\x98\x84'
print(codecs.decode(text, 'unicode-escape').encode('latin1').decode('utf-8'))
# => AKU\n\nKU 😄😄😄
I have a dataframe with the following columns: Date,Time,Tweet,Client,Client Simplified
The column Tweet contains sometimes a website link.
I am trying to define a function which extract the number of times this link is showed in the tweet and which link it is.
I don't want the answer of the whole function. I am now struggling with the function findall, before I program all this into a function:
import pandas as pd
import re
csv_doc = pd.read_csv("/home/datasci/prog_datasci_2/activities/activity_2/data/TrumpTweets.csv")
URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc)
The error I'm getting is:
TypeError Traceback (most recent call last)
<ipython-input-20-0085f7a99b7a> in <module>
7 # csv_doc.head()
8 tweets = csv_doc.Tweet
----> 9 URL= re.split('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',tweets)
10
11 # URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc[Tweets])
/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
229 and the remainder of the string is returned as the final element
230 of the list."""
--> 231 return _compile(pattern, flags).split(string, maxsplit)
232
233 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
Could you please let me know what is wrong?
Thanks.
try to add r in front of the string. It will tell Python that this is a regex pattern
also re package mostly work on single string, not list or series of string. You can try to use a simple list comprehension like this :
[re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in csv_doc.Tweet]
I am trying to collect all the internal links of Requests library for python and filter out all the external links.
I am using regular expression to do the same. But it is throwing this type error that I am unable to solve.
My code:
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://2.python-requests.org/en/master/')
content = BeautifulSoup(r.text)
[i['href'] for i in content.find_all('a') if not re.match("http", i)]
Error:
TypeError Traceback (most recent call last)
<ipython-input-10-b7d82067fe9c> in <module>
----> 1 [i['href'] for i in content.find_all('a') if not re.match("http", i)]
<ipython-input-10-b7d82067fe9c> in <listcomp>(.0)
----> 1 [i['href'] for i in content.find_all('a') if not re.match("http", i)]
~\Anaconda3\lib\re.py in match(pattern, string, flags)
171 """Try to apply the pattern at the start of the string, returning
172 a Match object, or None if no match was found."""
--> 173 return _compile(pattern, flags).match(string)
174
175 def fullmatch(pattern, string, flags=0):
TypeError: expected string or bytes-like object
You are passing it a BeautifulSoup node object not a string. Try this:
[i['href'] for i in content.find_all('a') if not re.match("http", i['href'])]
I imported a dataset (.csv) with pandas. The first column is the column with tweets, I rename it and transform it to a numpy array as usual with .values. Then I start the pre-processing with NLTK, it works pretty much every time, except for this dataset. It gives me the error TypeError: expected string or bytes-like object and I can't figure out why. The text contains some weird stuff, but far from the worst I've seen. Can someone help out?
data = pd.read_csv("facebook.csv")
text = data["Anonymized Message"].values
X = []
for i in range(0, len(text)):
tweet = re.sub("[^a-zA-Z]", " ", text[i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
X.append(tweet)
gives me this error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-a08c1779c787> in <module>()
1 text_train = []
2 for i in range(0, len(text)):
----> 3 tweet = re.sub("[^a-zA-Z]", " ", text[i])
4 tweet = tweet.lower()
5 tweet = tweet.split()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
189 a callable, it's passed the match object and must return
190 a replacement string to be used."""
--> 191 return _compile(pattern, flags).sub(repl, string, count)
192
193 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
Here's the dataset
http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv