Perform stemming on csv file using Pandas - python

I am new to python and Pandas and want to perform stemming on a CSV file column named 'Body' using Pandas. My code is as below:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
answers= pd.read_csv('F:/mtech/Project/answer_sample.csv')
porter_stemmer = PorterStemmer()
#print(answers.head())
#print(answers.loc[0:,"Body"])
df= pd.read_csv('F:/mtech/Project/answer_sample.csv','utf-8')
df['Body'] = df['Body'].str.lower().str.split()
stop = stopwords.words('english')
df['Body']= df['Body'].apply(lambda x: [item for item in x if item not in stop])
df['Body_Tokenized']= df['Body'].apply(lambda x : filter(None,x.split(' ')))
df['Body_Stemmed']= df['Body_Tokenized'].apply(lambda x : [porter_stemmer.stem(y) for y in x])
df.to_csv('F:/mtech/Project/answer_swr_stem.csv')
print("Done..")
I am able to perform stopword removal but while stemming, I get the following error:
Traceback (most recent call last):
File "F:\mtech\DATASET\answer_pd.py", line 10, in <module>
df= pd.read_csv('F:/mtech/Project/answer_sample.csv','utf-8')
File "C:\Users\Ayushi Misra\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Ayushi Misra\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 401, in _read
data = parser.read()
File "C:\Users\Ayushi Misra\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "C:\Users\Ayushi Misra\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 1997, in read
alldata = self._rows_to_cols(content)
File "C:\Users\Ayushi Misra\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 2551, in _rows_to_cols
raise ValueError(msg)
ValueError: Expected 1 fields in line 2853, saw 2. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Need help!

Related

Python Read In Google Spreadsheet Using Pandas

I have file in Google sheets I want to read it into a Pandas Dataframe. But gives me an error i don't know what's it.
this is the code :
import pandas as pd
sheet_id = "1HUbEhsYnLxJP1IisFcSKtHTYlFj_hHe5v21qL9CVyak"
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?
gid=556844753&format=csv")
print(df)
And this is the error :
File "c:\Users\blaghzao\Documents\Stage PFA(Laghzaoui Brahim)\google_sheet.py", line 3, in <module>
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?gid=556844753&format=csv")
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1255, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 89 fields in line 3, saw 243
I found the answer, the problem it's just with access permissions of the file.
enter image description here
Remove gid from code
import pandas as pd
sheet_id = "1HUbEhsYnLxJP1IisFcSKtHTYlFj_hHe5v21qL9CVyak"
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv")
print(df)
Click on link for sample image
As far as I know this error rises using comma delimiter and you have more commas then expected.
Can you try with below read_csv() method to avoid them;
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export? gid=556844753&format=csv", on_bad_lines='skip')
This will avoid bad lines so you can identify problem depending on skipped lines. I believe your csv format export is not matching with what pandas read_csv() expects.

Resource punkt not found. But, it is downloaded and installed

I have the following columns in a dataframe.
Unnamed: 0, title, publication, author, year, month, title.1, content, len_article, gensim_summary, split_words, first_100_words
I am trying to run this small piece of code.
import nltk
nltk.download('punkt')
# TOKENIZE
df.first_100_words = df.first_100_words.str.lower()
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
The last line of code throws an error. I'm getting this error message.
df.first_100_words = df.first_100_words.str.lower()
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
Traceback (most recent call last):
File "<ipython-input-129-42381e657774>", line 2, in <module>
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "<ipython-input-129-42381e657774>", line 2, in <lambda>
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 144, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 105, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\data.py", line 868, in load
opened_resource = _open(resource_url)
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\data.py", line 993, in _open
return find(path_, path + ['']).open()
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\data.py", line 701, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/en.pickle
Searched in:
- 'C:\\Users\\ryans/nltk_data'
- 'C:\\Users\\ryans\\Anaconda3\\nltk_data'
- 'C:\\Users\\ryans\\Anaconda3\\share\\nltk_data'
- 'C:\\Users\\ryans\\Anaconda3\\lib\\nltk_data'
- 'C:\\Users\\ryans\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- ''
**********************************************************************
I'm pretty new to all the tokenization stuff.
The sample code is from this site.
https://github.com/AustinKrause/Mod_5_Text_Summarizer/blob/master/Notebooks/Text_Cleaning_and_KMeans.ipynb
I found that and it helped: https://github.com/b0noI/dialog_converter/issues/7
Just add
nltk.download('punkt')
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')

how to explode dict (or list of dict) object in multiple column in dask.dataframe

When I try to convert some xml to dataframe using xmltodict it happens that a particular column contains all the info I need as dict or list of dict. I'm able to convert this column in multiple ones with pandas but I'm not able to perform the similar operation in dask.
Is not possible to use meta because I've no idea of all the possible fields that are available in the xml and dask is necessary because the true xml files are bigger than 1Gb each.
example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<itemList>
<eventItem uid="1">
<timestamp>2019-07-04T09:57:35.044Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnA</name>
<value>AAA</value>
</detail>
<detail>
<name>columnB</name>
<value>BBB</value>
</detail>
</details>
</eventItem>
<eventItem uid="2">
<timestamp>2019-07-04T09:57:52.188Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnC</name>
<value>CCC</value>
</detail>
</details>
</eventItem>
</itemList>
Working pandas code:
import xmltodict
import collections
import pandas as pd
def pd_output_dict(details):
detail = details.get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
ret_value[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
ret_value[i["name"]] = i["value"]
return pd.Series(ret_value)
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df = pd.DataFrame(df_dict_list)
df = pd.concat([df, df.apply(lambda row: pd_output_dict(row.details), axis=1, result_type="expand")], axis=1)
print(df.head())
Not working dask code:
import xmltodict
import collections
import dask
import dask.bag as db
import dask.dataframe as dd
def dd_output_dict(row):
detail = row.get("details", {}).get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
row[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
row[i["name"]] = i["value"]
return row
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df_bag = db.from_sequence(df_dict_list)
df = df_bag.to_dataframe()
df = df.apply(lambda row: dd_output_dict(row), axis=1)
The idea is to have in dask similar result I've in pandas but a the moment I'm receiving errors:
>>> df = df.apply(lambda row: output_dict(row), axis=1)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
AttributeError: ("'str' object has no attribute 'get'", 'occurred at index 0')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 3964, in apply
M.apply, self._meta_nonempty, func, args=args, udf=True, **kwds
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 190, in raise_on_meta_error
raise ValueError(msg)
ValueError: Metadata inference failed in `apply`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
AttributeError("'str' object has no attribute 'get'", 'occurred at index 0')
Traceback:
---------
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
Right, so operations like map_partitions will need to know the column names and data types. As you've mentioned, you can specify this with the meta= keyword.
Perhaps you can run through your data once to compute what these will be, and then construct a proper meta object, and pass that in? This is inefficient, and requires reading through all of your data, but I'm not sure that there is another way.

I am having trouble running my csv file through my code and I don't understand the error message

I am trying to run a csv file through my code and work with the data. I am receiving a error message that I don't exactly understand.
Here is the csv file
There is a lot more code but I will only include code that is relevant to the problem. Comment below if you need more info.
import pandas as pd
df_playoffs = pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', encoding='latin-1', index_col = 'team')
df_playoffs.fillna('None', inplace=True)
Here is the error message:
Traceback (most recent call last):
File "/Users/hannahbeegle/Desktop/Baseball.py", line 130, in <module>
df_playoffs = pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', encoding='latin-1', index_col = 'team')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
Looks as though your csv is maybe tab delimited, in the line that specifies the csv, edit to something like:
pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', sep="\t", encoding='latin-1', index_col = 'team')
[edited section after comments]
If the data is "ragged" then you could try breaking it up into a dictionary, and then using that to build the dataframe - here's an example I tried with a mocked-up sample dataset:
record_dict = {}
file=open("variable_columns.csv", mode="r")
for line in file.readlines():
split_line = line.split()
record_dict[split_line[0]]=split_line[1:]
df_playoffs = pd.DataFrame.from_dict(record_dict, orient='index' )
df_playoffs.sample(5)
You might need to look at the line.split() line, and enter "\t" as the split parameter (i.e. line.split("\t") but you can experiment with this.
Also, notice that pandas has forced the data to be rectangular, so some of the columns will contain None for the "short" rows.

Error when creating vector in python from text file

I want to import data from text file and make vector space representation out of words:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit(f)
bag_of_words = vectorizer.transform(f)
print(bag_of_words)
But I get this error:
Traceback (most recent call last):
File "D:\test\test.py", line 5, in <module>
bag_of_words = vectorizer.fit(f)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 776, in fit
self.fit_transform(raw_documents)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 110, in decode
doc = doc.read()
AttributeError: 'str' object has no attribute 'read'
Any ideas?
The vectorizer.fit method expects an iterable of file or string objects (not a single file object), hence you should have vectorizer.fit([f]).
In addition, you cannot reuse the f in the second call to vectorizer.transform (because the file has been read at that moment). What you probably want to do is the following:
vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit_transform([f])

Categories