This question already has answers here:
TypeError: expected string or buffer
(5 answers)
Closed 4 years ago.
I am trying to split a document into paragraph first and then the paragraph into lines. Then check for the lines and print the paragraph.
Although I am able to achieve that with the code below, there is some 'expected string or buffer' error that shows up when I am trying to do the same for multiple documents.
with io.open(input_path, mode='r') as f, io.open(write_path, mode='w') as f2:
data = f.read()
splat = re.split(r"\n(\s)*\n", data)
mylist=[]
for para1 in splat:
splat2= re.split(r"\n", para1)
for line1 in splat2:
PERFORM SOME OPERATION
Error
<ipython-input-218-18e633df1d46> in custom_section(input_path, write_path)
14 mylist=[]
15 for para1 in splat:
---> 16 splat2= re.split(r"\n", para1)
17 for line1 in splat2:
18 # line1 = line1.decode("utf-8")
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.pyc in split(pattern, string, maxsplit, flags)
169 """Split the source string by the occurrences of the pattern,
170 returning a list containing the resulting substrings."""
--> 171 return _compile(pattern, flags).split(string, maxsplit)
172
173 def findall(pattern, string, flags=0):
TypeError: expected string or buffer
I believe this error is occurring because the list of strings returned as your variable splat contains one or more None objects. If you insist on using re.split() you could remove the None objects with the filter() function, like so: filter(None, splat).
Related
This question already has answers here:
How does python startswith work?
(2 answers)
Closed 5 months ago.
I'm learning how to manipulate strings in python. I'm currently having an issue using the "startswith()" function. I'm trying to see how many lines start with a specific character I.E "0" but I'm not getting any results. Where did I go wrong? The text file only contains random generated numbers.
random = open("output-onlinefiletools.txt","r")
r = random.read()
#print(len(r))
#small = r[60:79]
#print(r[60:79])
#print(len(r[60:79]))
#print(small)
for line in random:
line = line.rstrip()
if line.startswith(1):
print(line)
You are searching for 1 as an int, and I wouldn't use random as it is not protected but is generally used as part of the random lib; the lines are treated as strings once read thus you need to use startswith on a string and not an int.
myFile = open("C:\Dev\Docs\output-onlinefiletools.txt","r")
r = myFile.read()
# return all lines that start with 0
for line in r.splitlines():
if line.startswith("0"):
print(line)
Output:
00000
01123
0000
023478
startwith takes the prefix as argument, in your case it will be line.startswith("0")
This question already has answers here:
How to extract text from an existing docx file using python-docx
(6 answers)
I'm getting a TypeError. How do I fix it?
(2 answers)
Closed 6 months ago.
I'm attempting to take emails from 500 word documents, and use findall to extract them into excel. This is the code I have so far:
import pandas as pd
from docx.api import Document
import os
import re
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
output_path = 'C:\\Users\\user1\\test2'
writer = pd.ExcelWriter('{}/docx_emails.xlsx'.format(output_path),engine='xlsxwriter')
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
data.append(match)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
print(df)
and I'm getting an error showing:
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 19>()
17 data = []
19 for wordDoc in worddocs_list:
---> 20 match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+',wordDoc)
21 data.append(match)
24 df = pd.DataFrame(data)
File ~\anaconda3\lib\re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
What am I doing wrong here?
Many thanks.
Your wordDoc variable doesn't contain a string, it contains a Document object. You need to look at the docx.api documention to see how to get the body of the Word document as a string out of the object.
It looks like you first have to get the Paragraphs with wordDoc.paragraphs and then ask each one for its text, so maybe something like this?
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
And then use that as the string to match against:
match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', documentText)
If you're going to be using the same regular expression over and over, though, you should probably compile it into a Pattern object first instead of passing it as a string to findall every time:
regex = re.compile(r'[\w.+-]+#[\w-]+\.[\w.-]+')
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
documentText = '\n'.join([p.text for p in wordDoc.paragraphs])
match = regex.findall(documentText)
This question already has answers here:
How to convert string representation of list to a list
(19 answers)
Closed 1 year ago.
This seems like a simple question, but couldn't find it on the Stack community. I have a dataset like the one below in a text file. I would like to read it in as a list with each value as a float. Currently the output is very odd from the simple list needed (also below).
data.txt:
[1130.1271455966723, 1363.3947962724474, 784.433380329118, 847.2140341725295, 803.0276763894814,..]
Code attempted:
my_file = open(r"data.txt", "r")
content = my_file.read()
content_list = content.split(",")
my_file.close()
The output is odd. The values are string and list inside of list and added spaces:
Current result:
['[1130.1271455966723',
' 1363.3947962724474',
' 784.433380329118',
' 847.2140341725295',
' 803.0276763894814',
' 913.7751118925291',
' 1055.3775618432019',...]']
I also tried the approach here (How to convert string representation of list to a list?) with the following code but produced an error:
import ast
x = ast.literal_eval(result)
raise ValueError('malformed node or string: ' + repr(node))
ValueError: malformed node or string: ['[1130.1271455966723', '1363.3947962724474', ' 784.433380329118', ' 847.2140341725295', ' 803.0276763894814',...]']
Ideal result:
list = [1130.1271455966723, 1363.3947962724474, 784.433380329118, 847.2140341725295, 803.0276763894814]
Your data is valid JSON, so just use the corresponding module that will take care of all the parsing for you:
import json
with open("data.txt") as f:
data = json.load(f)
print(data)
Output:
[1130.1271455966723, 1363.3947962724474, 784.433380329118, 847.2140341725295, 803.0276763894814]
I have a dataframe with the following columns: Date,Time,Tweet,Client,Client Simplified
The column Tweet contains sometimes a website link.
I am trying to define a function which extract the number of times this link is showed in the tweet and which link it is.
I don't want the answer of the whole function. I am now struggling with the function findall, before I program all this into a function:
import pandas as pd
import re
csv_doc = pd.read_csv("/home/datasci/prog_datasci_2/activities/activity_2/data/TrumpTweets.csv")
URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc)
The error I'm getting is:
TypeError Traceback (most recent call last)
<ipython-input-20-0085f7a99b7a> in <module>
7 # csv_doc.head()
8 tweets = csv_doc.Tweet
----> 9 URL= re.split('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',tweets)
10
11 # URL = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', csv_doc[Tweets])
/usr/lib/python3.8/re.py in split(pattern, string, maxsplit, flags)
229 and the remainder of the string is returned as the final element
230 of the list."""
--> 231 return _compile(pattern, flags).split(string, maxsplit)
232
233 def findall(pattern, string, flags=0):
TypeError: expected string or bytes-like object
Could you please let me know what is wrong?
Thanks.
try to add r in front of the string. It will tell Python that this is a regex pattern
also re package mostly work on single string, not list or series of string. You can try to use a simple list comprehension like this :
[re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',x) for x in csv_doc.Tweet]
I am working on pre processing the data for "Job Description" column which contains text data format. I have created a dataframe and trying to apply a function to pre process the data, but getting the error as "expected string or bytes-like object" when applying function to the column in data frame. Please refer my code below and help.
####################################################
#Function to pre process the data
def clean_text(text):
"""
Applies some pre-processing on the given text.
Steps :
- Removing HTML tags
- Removing punctuation
- Lowering text
"""
# remove HTML tags
text = re.sub(r'<.*?>', '', text)
# remove the characters [\], ['] and ["]
text = re.sub(r"\\", "", text)
text = re.sub(r"\'", "", text)
text = re.sub(r"\"", "", text)
# convert text to lowercase
text = text.strip().lower()
#replace all numbers with empty spaces
text = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(text))
# replace punctuation characters with spaces
filters='!"\'#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'
translate_dict = dict((c, " ") for c in filters)
translate_map = str.maketrans(translate_dict)
text = text.translate(translate_map)
return text
#############################################################
#To apply "Clean_text" function to job_description column in data frame
df['jobnew']=df['job_description'].apply(clean_text)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-c15402ac31ba> in <module>()
----> 1 df['jobnew']=df['job_description'].apply(clean_text)
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-30-5f24dbf9d559> in clean_text(text)
10
11 # remove HTML tags
---> 12 text = re.sub(r'<.*?>', '', text)
13
14 # remove the characters [\], ['] and ["]
~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
The function re.sub is telling you that you called it with something (the argument text) that is not a string. Since it is invoked by calling apply on the contents of df['job_description'], it is clear that the problem must be in how you created this data frame... and you don't show that part of your code.
Construct your dataframe so that this column only contains strings, and your program will run without error for at least a few more lines.