I have my 'cost_money' column like this,
0 According to different hospitals, the charging...
1 According to different hospitals, the charging...
2 According to different conditions, different h...
3 According to different hospitals, the charging...
Name: cost_money, dtype: object
Out of which each string has some important data in brackets, which I need to extract.
"According to different hospitals, the charging standard is inconsistent, the city's three hospitals is about (1000-4000 yuan)"
My try for this is,
import regex as re
full_df['cost_money'] = full_df.cost_money.str.extract('\((.*?)\')
full_df
But this gives an error between string and int conversion, I guess. This a whole string and if I print any character it is going to be char type.
Other than that, I don't need 'yuan' word from the brackets so my method to extract the numbers directly was
import regex as re
df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
full_df['cost_money']
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <module>
1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
3 full_df['cost_money']
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1082, in SeriesApply.apply(self)
1078 if isinstance(self.f, str):
1079 # if we are a string, try to dispatch
1080 return self.apply_str()
-> 1082 return self.apply_standard()
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1137, in SeriesApply.apply_standard(self)
1131 values = obj.astype(object)._values
1132 # error: Argument 2 to "map_infer" has incompatible type
1133 # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1134 # Dict[Hashable, Union[Union[Callable[..., Any], str],
1135 # List[Union[Callable[..., Any], str]]]]]"; expected
1136 # "Callable[[Any], Any]"
-> 1137 mapped = lib.map_infer(
1138 values,
1139 f, # type: ignore[arg-type]
1140 convert=self.convert_dtype,
1141 )
1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
1144 # GH#43986 Need to do list(mapped) in order to get treated as nested
1145 # See also GH#25959 regarding EA support
1146 return obj._constructor_expanddim(list(mapped), index=obj.index)
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\_libs\lib.pyx:2870, in pandas._libs.lib.map_infer()
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <lambda>(x)
1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
3 full_df['cost_money']
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\regex\regex.py:338, in findall(pattern, string, flags, pos, endpos, overlapped, concurrent, timeout, ignore_unused, **kwargs)
333 """Return a list of all matches in the string. The matches may be overlapped
334 if overlapped is True. If one or more groups are present in the pattern,
335 return a list of groups; this will be a list of tuples if the pattern has
336 more than one group. Empty matches are included in the result."""
337 pat = _compile(pattern, flags, ignore_unused, kwargs, True)
--> 338 return pat.findall(string, pos, endpos, overlapped, concurrent, timeout)
TypeError: expected string or buffer
I tried the same thing using findall but most posts mentioned using extract so I stuck to that.
MY REQUESTED OUTPUT:
[5000, 8000]
[6000, 7990]
..SO ON
Can somebody please help me out? Thanks
I believe your regex was incorrect. Here are alternatives.
Example input:
df = pd.DataFrame({'cost_money': ['random text (123-456 yuans)',
'other example (789 yuans)']})
Option A:
df['cost_money'].str.extract('\((\d+-\d+)', expand=False)
Option B (allow single cost):
df['cost_money'].str.extract('\((\d+(?:-\d+)?)', expand=False)
Option C (all numbers eater the first '(' as list:
df['cost_money'].str.split('[()]').str[1].str.findall('(\d+)')
Output (assigned as new columns):
cost_money A B C
0 random text (123-456 yuans) 123-456 123-456 [123, 456]
1 other example (789 yuans) NaN 789 [789]
You can use (\d*-\d*) to match the number part and then split on -.
df['money'] = df['cost_money'].str.extract('\((\d*-\d*).*\)')
df['money'] = df['money'].str.split('-')
Or use (\d*)[^\d]*(\d*) to match the two number parts seperately
df['money'] = df['cost_money'].str.extract('\((\d*)[^\d]*(\d*).*\)').values.tolist()
Related
I am trying to do sentiment analysis using a list of words to get a count of positive and negative words in a pyspark dataframe column. I can successfully get the counts of positive words using the same method, and there are roughly 2k positive words in that list. The negative list has about double the number of words (~4k words). What could be causing this issue, and how can I fix it?
I don't think it is due to the code since it worked for the positive words, but I am confused as to whether the number of words I'm searching for is too long in the other list, or what I am missing. Here is an example (not the exact list) below:
stories.show()
+--------------------+
| words|
+--------------------+
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
+--------------------+
neg = ['angry','sad','sorrowful','angry']
#doing some counting manipulation here
df3.show()
Error:
spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308 answer = self.gateway_client.send_command(command)
1309 return_value = get_return_value(
-> 1310 answer, self.gateway_client, self.target_id, self.name)
1311
1312 for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib/python3.7/re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
p = _parse_sub(source, pattern, True, 0)
File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
source.tell() - here + len(this))
re.error: multiple repeat at position 5
Expected output:
+--------------------+--------+
| words|Negative|
+--------------------+--------+
|tom and jerry went t| 45|
|she was angry when g| 12|
|arnold became sad at| 54|
Your neg list contains characters that have special meaning for regular expression patterns and consequently, your pattern becomes an unparsable regex pattern.
You can escape the special characters in the pattern by using the re.escape() function.
I am trying to load a dataframe into into bag of words and CountVectorizer but I get TypeError: 'float' object is not iterable when going loading from mess equal a test sentence to mess equaling the dataframe I need to use.
the example corpus on scikit learn docs and the course online both loaded from just list of sentences instead of data frame.
I tried Removing integers
AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer
I get different error
TypeError: list indices must be integers or slices, not str
mess1 = [item for item in mess if not isinstance(item, int)]
this is what works
mess = 'Sample message! Notice: it has punctuation.'
this is the dataframe
i need to use instead.
mess.head()
| bios | artistName
----+---------------------------------------------------------+-------------------
0 | Chris Cosentino Biography Chris Cosentino gre... | Chris Cosentino
----+---------------------------------------------------------+-------------------
1 | Magda Biography The DJ known as Magda was bor... | Magda
----+---------------------------------------------------------+-------------------
2 | Jean-Michel Cousteau Biography Since first be... | jean michel cousteau
----+---------------------------------------------------------+-------------------
3 | Kyle Busch Biography The American stock car r... | Kyle Busch
----+---------------------------------------------------------+-------------------
4 | Naughty by Nature Biography Naughty by Nature... | Naughty by Nature
----+---------------------------------------------------------+-------------------
nopunc = [c for c in mess if c not in string.punctuation]
def text_process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
mess['bios'].head(5).apply(text_process)
Output
0 [Chris, Cosentino, Biography, Chris, Cosentino...
1 [Magda, Biography, DJ, known, Magda, born, rai...
2 [JeanMichel, Cousteau, Biography, Since, first...
3 [Kyle, Busch, Biography, American, stock, car,...
4 [Naughty, Nature, Biography, Naughty, Nature, ...
Name: bios, dtype: object
mess.dtypes
bios object
artistName object
dtype: object
from sklearn.feature_extraction.text import CountVectorizer
then run either
bow_transformer = CountVectorizer(analyzer=text_process)
bow_transformer.fit(mess['bios'])
print(len(bow_transformer.vocabulary_))
or this
bow_transformer = CountVectorizer(analyzer=text_process).fit(mess['bios'])
print(len(bow_transformer.vocabulary))
I get the error
TypeError: 'float' object is not iterable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-148-74d381110eec> in <module>
1 bow_transformer = CountVectorizer(analyzer=text_process)
----> 2 bow_transformer.fit(mess['bios'])
3 print(len(bow_transformer.vocabulary_))
~\anaconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
996 self
997 """
--> 998 self.fit_transform(raw_documents)
999 return self
1000
~\anaconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1030
1031 vocabulary, X = self._count_vocab(raw_documents,
-> 1032 self.fixed_vocabulary_)
1033
1034 if self.binary:
~\anaconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
<ipython-input-134-ad1781692b41> in text_process(mess)
1 def text_process(mess):
2
----> 3 nopunc = [char for char in mess if char not in string.punctuation]
4
5 nopunc = ''.join(nopunc)
TypeError: 'float' object is not iterable
Based Ben Reiniger Comment
I looked for the missing values in the dataframe. Even though it was complete there was thousands of fully blank ones added.
I counted nan
count_nan = len(mess) - mess.count()
count_nan
bios 9682
artistName 9768
dtype: int64
I ran dropna to remove them
mess.dropna(inplace=True)
output is now
bios 0
artistName 0
dtype: int64
Error received when nan successfully dropped
Now I try to run bow_transformer = CountVectorizer(analyzer=text_process)
that fixed my original TypeError: 'float' object is not iterable
however i get a new error but I am one step closer.
bow_transformer = CountVectorizer(analyzer=text_process)
bow_transformer.fit(mess['bios'])
print(len(bow_transformer.vocabulary_))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-62-74d381110eec> in <module>
1 bow_transformer = CountVectorizer(analyzer=text_process)
----> 2 bow_transformer.fit(mess['bios'])
3 print(len(bow_transformer.vocabulary_))
TypeError: list indices must be integers or slices, not str
Im working on doing some NLP with textual data from doctors just trying to do some basic preprocessing text cleaning trying to remove stop words and punctuation. I have already given the program a list of punctuations and stop words.
My text data looks something like this:
"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells"
Then my code looks like:
import string
# Create a function to remove punctuations
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
# Create a function to remove stop words
def remove_stop_words(x):
x = ' '.join([i for i in x.split(' ') if i not in stop])
return x
# Create a function to lowercase the words
def to_lower(x):
return x.lower()
So then I try to apply the functions to the Text column
train['Text'] = train['Text'].apply(remove_punctuation)
train['Text'] = train['Text'].apply(remove_stop_words)
train['Text'] = train['Text'].apply(lower)
And I get an error message like:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
----> 1 train['Text'] = train['Text'].apply(remove_punctuation)
2 train['Text'] = train['Text'].apply(remove_stop_words)
3 train['Text'] = train['Text'].apply(lower)
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in
apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and
isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
in remove_punctuation(sentence)
3 # Create a function to remove punctuations
4 def remove_punctuation(sentence: str) -> str:
----> 5 return sentence.translate(str.maketrans('', '', string.punctuation))
6
7 # Create a function to remove stop words
AttributeError: 'float' object has no attribute 'translate'
Why am I getting this error. Im guessing because digits appear in the text?
I am trying to create a program that will identify American Dates through a regular expression, and for some reason I keep picking up ALL dates not just american style dates. Can someone take a look at my code and tell me what I am doing wrong with the regex?
I have thoroughly looked through the re Python docs to craft an expression that will pick up any american style dates styled MM-DD-YYYY.
import shutil, os, re
date_pattern = re.compile(r"""^(.*?)
((0|1)?\d)-
((0|1|2|3)?\d)-
((19|20)\d\d)
(.*?)$
""", re.VERBOSE)
american_date_list = []
file_list = os.listdir('.\date_files')
for file in file_list:
american_date = date_pattern.search(file)
if american_date:
american_date_list.append(file)
The below are my test files:
'02-25-1992 bermuda'
'21-07-1992 Utah'
'25-02-1992 atlanta'
'bahamas 12-15-1992'
My expectation would be that I would only get a match object for first and last listed file names, but I keep getting a match for every file name.
What am I doing wrong in the regular expression?
What am I doing wrong in the regular expression?
Using it.
Seriously. You should use regex only if there is no other reasonable option.
Python has a good standard library for working with dates and times, and that is not for your liking use libraries like arrow.
Instead of breaking your head on Regex do:
In [1]: import datetime
In [2]: datetime.datetime.strptime("1-12-2018", "%m-%d-%Y")
Out[2]: datetime.datetime(2018, 1, 12, 0, 0)
This gets you have a legal date. Now, try an parse a non existing month:
In [20]: datetime.datetime.strptime("13-12-2018", "%m-%d-%Y")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-02e1071664f7> in <module>()
----> 1 datetime.datetime.strptime("13-12-2018", "%m-%d-%Y")
/usr/lib64/python3.6/_strptime.py in _strptime_datetime(cls, data_string, format)
563 """Return a class cls instance based on the input string and the
564 format string."""
--> 565 tt, fraction = _strptime(data_string, format)
566 tzname, gmtoff = tt[-2:]
567 args = tt[:6] + (fraction,)
/usr/lib64/python3.6/_strptime.py in _strptime(data_string, format)
360 if not found:
361 raise ValueError("time data %r does not match format %r" %
--> 362 (data_string, format))
363 if len(data_string) != found.end():
364 raise ValueError("unconverted data remains: %s" %
ValueError: time data '13-12-2018' does not match format '%m-%d-%Y'
So you see this will throw an exception you can use in your code if the format isn't legal.
strptime can also handle special dates for you
datetime.datetime.strptime("02-29-2018", "%m-%d-%Y") # throws
ValueError: day is out of range for month
In the second line you have the following matches:
^(.*?) matches '2
((0|1)?\d)- matches 1-
((0|1|2|3)?\d)- matches 07-
((19|20)\d\d) matches 1992
(.*?)$ matches Utah'
Put \b before ((0|1)?\d) to ensure that it starts matching at a word boundary, so it won't match in the middle of a number.
^(.*?)\b((0|1)?\d)-((0|1|2|3)?\d)-((19|20)\d\d)(.*?)$
DEMO
I am working on pre processing the data for "Job Description" column which contains text data format. I have created a dataframe and trying to apply a function to pre process the data, but getting the error as "expected string or bytes-like object" when applying function to the column in data frame. Please refer my code below and help.
####################################################
#Function to pre process the data
def clean_text(text):
"""
Applies some pre-processing on the given text.
Steps :
- Removing HTML tags
- Removing punctuation
- Lowering text
"""
# remove HTML tags
text = re.sub(r'<.*?>', '', text)
# remove the characters [\], ['] and ["]
text = re.sub(r"\\", "", text)
text = re.sub(r"\'", "", text)
text = re.sub(r"\"", "", text)
# convert text to lowercase
text = text.strip().lower()
#replace all numbers with empty spaces
text = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(text))
# replace punctuation characters with spaces
filters='!"\'#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'
translate_dict = dict((c, " ") for c in filters)
translate_map = str.maketrans(translate_dict)
text = text.translate(translate_map)
return text
#############################################################
#To apply "Clean_text" function to job_description column in data frame
df['jobnew']=df['job_description'].apply(clean_text)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-c15402ac31ba> in <module>()
----> 1 df['jobnew']=df['job_description'].apply(clean_text)
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-30-5f24dbf9d559> in clean_text(text)
10
11 # remove HTML tags
---> 12 text = re.sub(r'<.*?>', '', text)
13
14 # remove the characters [\], ['] and ["]
~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
The function re.sub is telling you that you called it with something (the argument text) that is not a string. Since it is invoked by calling apply on the contents of df['job_description'], it is clear that the problem must be in how you created this data frame... and you don't show that part of your code.
Construct your dataframe so that this column only contains strings, and your program will run without error for at least a few more lines.