Panda AssertionError columns passed, passed data had 2 columns - python

I am working on Azure ML implementation on text analytics with NLTK, the following execution is throwing
AssertionError: 1 columns passed, passed data had 2 columns\r\nProcess returned with non-zero exit code 1
Below is the code
# The script MUST include the following function,
# which is the entry point for this module:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
# import required packages
import pandas as pd
import nltk
import numpy as np
# tokenize the review text and store the word corpus
word_dict = {}
token_list = []
nltk.download(info_or_id='punkt', download_dir='C:/users/client/nltk_data')
nltk.download(info_or_id='maxent_treebank_pos_tagger', download_dir='C:/users/client/nltk_data')
for text in dataframe1["tweet_text"]:
tokens = nltk.word_tokenize(text.decode('utf8'))
tagged = nltk.pos_tag(tokens)
# convert feature vector to dataframe object
dataframe_output = pd.DataFrame(tagged, columns=['Output'])
return [dataframe_output]
Error is throwing here
dataframe_output = pd.DataFrame(tagged, columns=['Output'])
I suspect this to be the tagged data type passed to dataframe, can some one let me know the right approach to add this to dataframe.

Try this:
dataframe_output = pd.DataFrame(tagged, columns=['Output', 'temp'])

Related

Use Spacy with Pandas

I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:
Screenshot
Below is the code I used to apply to my full dataset using Pandas:
Messages = pd.read_csv('Messages.csv', encoding='cp1252')
Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = nlp(Messages['Body'])._.cats
But it gives me the error:
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>
The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.
Thanks so much
I tried Pandas apply method too, but had no luck. Code used:
Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats
The error I got: AttributeError: 'Series' object has no attribute '_'
Expectation is to generate 3 columns as described above
You should provide a callable into Series.apply call:
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
Here, each value in the NLP_Result column will be assigned to x variable.
The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.
import spacy
import classy classification
import csv
import pandas as pd
with open ('Deliveries.txt', 'r') as d:
Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
Not_Spam = n.read().splitlines()
data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam
# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

spaCy: spacy.tokens.doc.Doc to dataframe

I have a Spacy model for text generation, and I want to create a pandas data frame with all the texts that my Spacy model produces in each iteration. How can I save the spacy.tokens.doc.Doc output into a pandas dataframe?
nlp = spacy.load('en_core_web_sm')
newDataSet=pd.dataframe()
docs = nlp.pipe(df['Text'])
syn_augmenter =augmenty.load('random_synonym_insertion.v1',level=0.1)
for doc in augmenty.docs(docs, augmenter=syn_augmenter, nlp=nlp):
newDataSet=newDataSet.add(doc) # this produces an error
so you probably want to use DframCy library to make that happen. It is also recommended by SpaCy: https://spacy.io/universe/project/dframcy. A snippet I use is:
import spacy
from dframcy import DframCy
from tqdm import tqdm
nlp = spacy.load('en_core_web_trf')
dframcy = DframCy(nlp)
columns=["id", "text", "start", "end", "pos_", "tag_", "dep_", \
"head", "ent_type_", "lemma_", "lower_", "is_punct", "is_quote", "is_digit"]
def get_features(item):
doc = dframcy.nlp(item[1]["discourse_text"])
annotation_dataframe = dframcy.to_dataframe(doc, columns=columns)
annotation_dataframe['index'] = item[0]
return annotation_dataframe
results = []
for item in tqdm(df.iterrows(), total=df.shape[0]):
results.append(get_features(item))
features = pd.concat(results)
features
So the columns object denotes what objects you want to have returned. This is parsed to dframcy is extract the features and return a nice dataframe per document. If you have a table of strings that you want to tokenize and get features from, you need to iterate over it. TQDM tracks the overall progress of your for-loop. Concatenating the list of dataframes (per doc) will give you a complete overview.

How gather lists and load into dataframe

The following code creates a dataframe, tokenizes, and filters stopwords. However, am I stuck trying to properly gather the results to load back into a column of the dataframe. Trying to put the results back into the dataframe (using commented code) produces the following error ValueError: Length of values does not match length of index. It seems like the issue is with how I'm loading the lists back into the df. I think it is treating them one at a time. I'm not clear how to form a list of lists, which is what I think is needed. Neither append() nor extend() seem appropriate, or if they are I'm not doing it properly. Any insight would be greatly appreciated.
Minimal example
# Load libraries
import numpy as np
import pandas as pd
import spacy
# Create dataframe and tokenize
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences',
'This is the second text, with one sentence']})
nlp = spacy.load("en_core_web_sm")
df['Tokens'] = ''
doc = df['Text']
doc = doc.apply(lambda x: nlp(x))
df['Tokens'] = doc
# df # check dataframe
# Filter stopwords
df['No Stop'] = ''
def test_loc(df):
for i in df.index:
doc = df.loc[i,'Tokens']
tokens_no_stop = [token.text for token in doc if not token.is_stop]
print(tokens_no_stop)
# df['No Stop'] = tokens_no_stop # THIS PRODUCES AN ERROR
test_loc(df)
Result
['text', '.', 'sentences']
['second', 'text', ',', 'sentence']
As you mentioned you need a list of lists in order for the assignment to work.
Another solution can be to use pandas.apply as you used in the beginning of your code.
import numpy as np
import pandas as pd
import spacy
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences',
'This is the second text, with one sentence']})
nlp = spacy.load("en_core_web_sm")
df['Tokens'] = df['Text'].apply(lambda x: nlp(x))
def remove_stop_words(tokens):
return [token.text for token in tokens if not token.is_stop]
df['No Stop'] = df['Tokens'].apply(remove_stop_words)
Notice you don't have to create the column before assigning to it.

h2o frame from pandas casting

I am using h2o to perform predictive modeling from python.
I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring) and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.
Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...
# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`

Graphlab and numpy issue

I'm currently doing a course on Coursera (Machine Leraning) offered by University of Washington and I'm facing little problem with the numpy and graphlab
The course requests to use a version of graphlab higher than 1.7
Mine is higher as you can see below, however, when I run the script below, I got an error as follows:
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started.
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1
features = ['constant'] + features # this is how you combine two lists
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
(example_features, example_output) = get_numpy_data(sales,['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output
----> 8 feature_matrix = features_sframe.to_numpy()
NameError: global name 'features_sframe' is not defined
The script above was written by the course authors, so I believe there is something I'm doing wrong
Any help will be highly appreciated.
You are supposed to complete the function get_numpy_data before running it, that's why you are getting an error. Follow the instructions in the original function, which actually are:
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1 # this is how you add a constant column to an SFrame
# add the column 'constant' to the front of the features list so that we can extract it along with the others:
features = ['constant'] + features # this is how you combine two lists
# select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
# the following line will convert the features_SFrame into a numpy matrix:
feature_matrix = features_sframe.to_numpy()
# assign the column of data_sframe associated with the output to the SArray output_sarray
# the following will convert the SArray into a numpy array by first converting it to a list
output_array = output_sarray.to_numpy()
return(feature_matrix, output_array)
The graphlab assignment instructions have you convert from graphlab to pandas and then to numpy. You could just skip the the graphlab parts and use pandas directly. (This is explicitly allowed in the homework description.)
First, read in the data files.
import pandas as pd
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('data//kc_house_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('data//kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('data//kc_house_test_data.csv', dtype=dtype_dict)
The convert to numpy function then becomes
def get_numpy_data(df, features, output):
df['constant'] = 1
# add the column 'constant' to the front of the features list so that we can extract it along with the others
features = ['constant'] + features
# select the columns of data_SFrame given by the features list into the SFrame features_sframe
features_df = pd.DataFrame(**FILL IN THE BLANK HERE WITH YOUR CODE**)
# cast the features_df into a numpy matrix
feature_matrix = features_df.as_matrix()
etc.
The remaining code should be the same (since you only work with the numpy versions for the rest of the assignment).

Categories