Error : expected string or bytes-like object - python

I am working on pre processing the data for "Job Description" column which contains text data format. I have created a dataframe and trying to apply a function to pre process the data, but getting the error as "expected string or bytes-like object" when applying function to the column in data frame. Please refer my code below and help.
####################################################
#Function to pre process the data
def clean_text(text):
"""
Applies some pre-processing on the given text.
Steps :
- Removing HTML tags
- Removing punctuation
- Lowering text
"""
# remove HTML tags
text = re.sub(r'<.*?>', '', text)
# remove the characters [\], ['] and ["]
text = re.sub(r"\\", "", text)
text = re.sub(r"\'", "", text)
text = re.sub(r"\"", "", text)
# convert text to lowercase
text = text.strip().lower()
#replace all numbers with empty spaces
text = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(text))
# replace punctuation characters with spaces
filters='!"\'#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'
translate_dict = dict((c, " ") for c in filters)
translate_map = str.maketrans(translate_dict)
text = text.translate(translate_map)
return text
#############################################################
#To apply "Clean_text" function to job_description column in data frame
df['jobnew']=df['job_description'].apply(clean_text)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-c15402ac31ba> in <module>()
----> 1 df['jobnew']=df['job_description'].apply(clean_text)
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-30-5f24dbf9d559> in clean_text(text)
10
11 # remove HTML tags
---> 12 text = re.sub(r'<.*?>', '', text)
13
14 # remove the characters [\], ['] and ["]
~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object

The function re.sub is telling you that you called it with something (the argument text) that is not a string. Since it is invoked by calling apply on the contents of df['job_description'], it is clear that the problem must be in how you created this data frame... and you don't show that part of your code.
Construct your dataframe so that this column only contains strings, and your program will run without error for at least a few more lines.

Related

I'm trying to extract emails, and I'm getting a TypeError [duplicate]

This question already has answers here:
How to extract text from an existing docx file using python-docx
(6 answers)
I'm getting a TypeError. How do I fix it?
(2 answers)
Closed 6 months ago.
I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:
import pandas as pd
from docx.api import Document
import os
import re
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
data.append(match)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
print(df)
and I'm getting an error showing:
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 19>()
17 data = []
19 for wordDoc in worddocs_list:
---> 20 match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
21 data.append(match)
24 df = pd.DataFrame(data)
File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What am I doing wrong here?
Many thanks.
Your wordDoc variable doesn't contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.
It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
And then use that as the string to match against:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', documentText)
If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time:
regex = re.compile(r'[\w.+-]+#[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = regex.findall(documentText)

Find all website links, group and count from column of dataframe - Python

I have a dataframe with the following columns: Date,Time,Tweet,Client,Client Simplified
The column Tweet contains sometimes a website link.
I am trying to define a function which extract the number of times this link is showed in the tweet and which link it is.
I don't want the answer of the whole function. I am now struggling with the function findall, before I program all this into a function:
import pandas as pd
import re
csv_doc = pd.read_csv("/home/datasci/prog_datasci_2/activities/activity_2/data/TrumpTweets.csv")
URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc)
The error I'm getting is:
TypeError Traceback (most recent call last)
<ipython-input-20-0085f7a99b7a> in <module>
7 # csv_doc.head()
8 tweets = csv_doc.Tweet
----> 9 URL= re.split('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',tweets)
10
11 # URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc[Tweets])
/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
229 and the remainder of the string is returned as the final element
230 of the list."""
--> 231 return _compile(pattern, flags).split(string, maxsplit)
232
233 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
Could you please let me know what is wrong?
Thanks.
try to add r in front of the string. It will tell Python that this is a regex pattern
also re package mostly work on single string, not list or series of string. You can try to use a simple list comprehension like this :
[re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in csv_doc.Tweet]

AttributeError: 'float' object has no attribute 'translate' Python

Im working on doing some NLP with textual data from doctors just trying to do some basic preprocessing text cleaning trying to remove stop words and punctuation. I have already given the program a list of punctuations and stop words.
My text data looks something like this:
"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells"
Then my code looks like:
import string
# Create a function to remove punctuations
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
# Create a function to remove stop words
def remove_stop_words(x):
x = ' '.join([i for i in x.split(' ') if i not in stop])
return x
# Create a function to lowercase the words
def to_lower(x):
return x.lower()
So then I try to apply the functions to the Text column
train['Text'] = train['Text'].apply(remove_punctuation)
train['Text'] = train['Text'].apply(remove_stop_words)
train['Text'] = train['Text'].apply(lower)
And I get an error message like:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
----> 1 train['Text'] = train['Text'].apply(remove_punctuation)
2 train['Text'] = train['Text'].apply(remove_stop_words)
3 train['Text'] = train['Text'].apply(lower)
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in
apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and
isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
in remove_punctuation(sentence)
3 # Create a function to remove punctuations
4 def remove_punctuation(sentence: str) -> str:
----> 5 return sentence.translate(str.maketrans('', '', string.punctuation))
6
7 # Create a function to remove stop words
AttributeError: 'float' object has no attribute 'translate'
Why am I getting this error. Im guessing because digits appear in the text?

Expected string or buffer error while splitting in python [duplicate]

This question already has answers here:
TypeError: expected string or buffer
(5 answers)
Closed 4 years ago.
I am trying to split a document into paragraph first and then the paragraph into lines. Then check for the lines and print the paragraph.
Although I am able to achieve that with the code below, there is some 'expected string or buffer' error that shows up when I am trying to do the same for multiple documents.
with io.open(input_path, mode='r') as f, io.open(write_path, mode='w') as f2:
data = f.read()
splat = re.split(r"\n(\s)*\n", data)
mylist=[]
for para1 in splat:
splat2= re.split(r"\n", para1)
for line1 in splat2:
PERFORM SOME OPERATION
Error
<ipython-input-218-18e633df1d46> in custom_section(input_path, write_path)
14 mylist=[]
15 for para1 in splat:
---> 16 splat2= re.split(r"\n", para1)
17 for line1 in splat2:
18 # line1 = line1.decode("utf-8")
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.pyc in split(pattern, string, maxsplit, flags)
169 """Split the source string by the occurrences of the pattern,
170 returning a list containing the resulting substrings."""
--> 171 return _compile(pattern, flags).split(string, maxsplit)
172
173 def findall(pattern, string, flags=0):
TypeError: expected string or buffer
I believe this error is occurring because the list of strings returned as your variable splat contains one or more None objects. If you insist on using re.split() you could remove the None objects with the filter() function, like so: filter(None, splat).

NLTK gives error expected string or bytes-like object

I imported a dataset (.csv) with pandas. The first column is the column with tweets, I rename it and transform it to a numpy array as usual with .values. Then I start the pre-processing with NLTK, it works pretty much every time, except for this dataset. It gives me the error TypeError: expected string or bytes-like object and I can't figure out why. The text contains some weird stuff, but far from the worst I've seen. Can someone help out?
data = pd.read_csv("facebook.csv")
text = data["Anonymized Message"].values
X = []
for i in range(0, len(text)):
tweet = re.sub("[^a-zA-Z]", " ", text[i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
X.append(tweet)
gives me this error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-a08c1779c787> in <module>()
1 text_train = []
2 for i in range(0, len(text)):
----> 3 tweet = re.sub("[^a-zA-Z]", " ", text[i])
4 tweet = tweet.lower()
5 tweet = tweet.split()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
189 a callable, it's passed the match object and must return
190 a replacement string to be used."""
--> 191 return _compile(pattern, flags).sub(repl, string, count)
192
193 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
Here's the dataset
http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv

Categories