Use Spacy with Pandas - python

I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:
Screenshot
Below is the code I used to apply to my full dataset using Pandas:
Messages = pd.read_csv('Messages.csv', encoding='cp1252')
Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = nlp(Messages['Body'])._.cats
But it gives me the error:
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>
The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.
Thanks so much
I tried Pandas apply method too, but had no luck. Code used:
Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats
The error I got: AttributeError: 'Series' object has no attribute '_'
Expectation is to generate 3 columns as described above

You should provide a callable into Series.apply call:
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
Here, each value in the NLP_Result column will be assigned to x variable.
The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.
import spacy
import classy classification
import csv
import pandas as pd
with open ('Deliveries.txt', 'r') as d:
Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
Not_Spam = n.read().splitlines()
data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam
# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

Related

Google analytics 4 reporting dimensions & metrics are incompatible using python

We have custom dimension define in Google Analytics Data API v1Beta for extracting data from Google Analytics GA4 account.
I am trying to fetch eventCount metric with respect to date, campaignId, campaignName and eventName using python. I want to know what is the eventCount for different eventName in different campaignName. Is there any work around how can i fetch this data?
import pandas as pd
import numpy as np
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import DateRange
from google.analytics.data_v1beta.types import Dimension
from google.analytics.data_v1beta.types import Metric
from google.analytics.data_v1beta.types import RunReportRequest
client = BetaAnalyticsDataClient()
## Format Report - run_report method
def format_report(request):
response = client.run_report(request)
# Row index
row_index_names = [header.name for header in response.dimension_headers]
row_header = []
for i in range(len(row_index_names)):
row_header.append([row.dimension_values[i].value for row in response.rows])
row_index_named = pd.MultiIndex.from_arrays(np.array(row_header), names = np.array(row_index_names))
# Row flat data
metric_names = [header.name for header in response.metric_headers]
data_values = []
for i in range(len(metric_names)):
data_values.append([row.metric_values[i].value for row in response.rows])
output = pd.DataFrame(data = np.transpose(np.array(data_values, dtype = 'f')),
index = row_index_named, columns = metric_names)
return output
request = RunReportRequest(
property='properties/'+property_id,
dimensions=[
Dimension(name="date"),
Dimension(name="eventName"),
Dimension(name="campaignId"),
Dimension(name="campaignName")
],
metrics=[
Metric(name="eventCount"),
],
date_ranges=[DateRange(start_date="2023-01-22", end_date="2023-01-25")],
)
Error:
InvalidArgument: 400 Please remove eventCount to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
error
As the error message states not all dimensions and metrics are complatable.
In your case eventcount is not compatible with campaignId or campaignName So to make this request work you must either remove eventcount, or remove campaignId or campaignName
I guess what I am saying is that you cant make that request, the data doesnt exist.
I am having the same issue where I am creating a custom report to fetch itemListViewEvents and itemViewEvents based on itemId
here's my logic:
require 'google/analytics/data'
require 'google/analytics/data/v1beta'
require 'google/analytics/data/v1beta/analytics_data'
class GoogleAnalyticsReportingApiService
range = Google::Analytics::Data::V1beta::DateRange.new(
start_date: Time.zone.today.to_date.to_s,
end_date: 1.week.ago.to_date.to_s
)
metrics = [
Google::Analytics::Data::V1beta::Metric.new(expression: 'itemListViews', name: 'itemListViewEvents'),
Google::Analytics::Data::V1beta::Metric.new(expression: 'itemViews', name: 'itemViewEvents')
]
dimension = Google::Analytics::Data::V1beta::Dimension.new(name: 'itemId')
rr = Google::Analytics::Data::V1beta::RunReportRequest.new(
property: analytics_property,
dimensions: [dimension],
metrics: metrics,
date_ranges: [range],
order_bys: [{desc: true, metric: {metric_name: 'itemViewEvents'}}],
limit: 100000
)
end
I am getting this error:
Google::Cloud::InvalidArgumentError: 3:Please remove itemId to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/. debug_error_string:{UNKNOWN:Error received from peer
and this is clearly outlined here: https://developers.google.com/analytics/devguides/reporting/data/v1/announcements/20221201-compatibility-changes
Now the issue is, I just cannot use only dimension or metrics because the API Names are not supported either way (for example: itemId is not supported in Metric). Not sure how to to deal with this! More info here: https://github.com/googleapis/google-cloud-ruby/issues/20138
[UPDATE Feb/15/23]
After trying my hands on Custom Events, Custom Dimensions and Custom Metrics, I ended up using alternate schema name for the Metrics fields.
The documentation here: https://developers.google.com/analytics/devguides/reporting/data/v1/api-schema helped me a lot and I was able to figure out that I can use itemViewEvents for itemListViewEvents and itemsViewed for itemViewEvents

spaCy: spacy.tokens.doc.Doc to dataframe

I have a Spacy model for text generation, and I want to create a pandas data frame with all the texts that my Spacy model produces in each iteration. How can I save the spacy.tokens.doc.Doc output into a pandas dataframe?
nlp = spacy.load('en_core_web_sm')
newDataSet=pd.dataframe()
docs = nlp.pipe(df['Text'])
syn_augmenter =augmenty.load('random_synonym_insertion.v1',level=0.1)
for doc in augmenty.docs(docs, augmenter=syn_augmenter, nlp=nlp):
newDataSet=newDataSet.add(doc) # this produces an error
so you probably want to use DframCy library to make that happen. It is also recommended by SpaCy: https://spacy.io/universe/project/dframcy. A snippet I use is:
import spacy
from dframcy import DframCy
from tqdm import tqdm
nlp = spacy.load('en_core_web_trf')
dframcy = DframCy(nlp)
columns=["id", "text", "start", "end", "pos_", "tag_", "dep_", \
"head", "ent_type_", "lemma_", "lower_", "is_punct", "is_quote", "is_digit"]
def get_features(item):
doc = dframcy.nlp(item[1]["discourse_text"])
annotation_dataframe = dframcy.to_dataframe(doc, columns=columns)
annotation_dataframe['index'] = item[0]
return annotation_dataframe
results = []
for item in tqdm(df.iterrows(), total=df.shape[0]):
results.append(get_features(item))
features = pd.concat(results)
features
So the columns object denotes what objects you want to have returned. This is parsed to dframcy is extract the features and return a nice dataframe per document. If you have a table of strings that you want to tokenize and get features from, you need to iterate over it. TQDM tracks the overall progress of your for-loop. Concatenating the list of dataframes (per doc) will give you a complete overview.

# getting added to header in csv file

I'm trying to convert a csv file to numpy array. When I'm trying to specify the required column, it just says that it doesn't exist in the csv file. So I checked the csv file and found out that a # is getting added to the header name I specified.
Is there anyway I can avoid that?
Code below
np.savetxt(path13 + 'bkg_sample.csv', predictions, fmt = '%5.5f' ,delimiter = ',' , header='predict')
Header name - # predict
Error on jupyter - 'Dataframe' object has no attribute 'predict'
predict = pd.read_csv('/home/user1/AAAA/Predictions/Code testingbkg.csv',usecols=[2,3,4,5,6,7,8,9,10])
predictions = model.predict(standardscaler.transform(predict))
np.savetxt(path13+'bkg_sample.csv', predictions, fmt = '%5.5f',delimiter = ',',header='predict')
true = pd.read_csv('/home/user1/AAAA/Predictions/bkg_sample.csv')
true[true.predict>0.67] ##This is where the error occurs
Links for image:
bkgsample : https://imgur.com/a/tzh0o2M
predict.csv : https://imgur.com/a/DhPAzqa
Try listing out columns of your DataFrame:
print(true.columns)
It looks like in your bkg_sample.csv there is no column named predict or even # predict.
Found the answer here https://stackoverflow.com/a/17361181/16355784
So apparently it inserts # because the line is a comment, and if you wanna pass it you just need to use comment='' in savetxt.

Loop to retrieve sentiment analysis in pandas.core.series.Series

I have 47 news-articles that I want to extract the sentiment from. They are JSON format (Date, title and body of the article). All I want is to obtain a list with the sentiment using TextBlob. So far I am doing the following:
import json
import pandas
from textblob import TextBlob
appended_data = []
for i in range(1,47):
df0 = pandas.DataFrame([json.loads(l) for l in open('News_%d.json' % i)])
appended_data.append(df0)
appended_data = pandas.concat(appended_data)
doc_set = appended_data.body
docs_TextBlob = TextBlob(doc_set)
for i in docs_TextBlob:
print(docs_TextBlob.sentiment)
Obvioulsy, I get the following error: TypeError: The text argument passed to __init__(text) must be a string, not <class 'pandas.core.series.Series'> Any idea on how to create a list with the sentiment measure?
To create a new column in the DataFrame with the sentiment:
appended_data['sentiment'] = appended_data.body.apply(lambda body: TextBlob(body).sentiment)

Panda AssertionError columns passed, passed data had 2 columns

I am working on Azure ML implementation on text analytics with NLTK, the following execution is throwing
AssertionError: 1 columns passed, passed data had 2 columns\r\nProcess returned with non-zero exit code 1
Below is the code
# The script MUST include the following function,
# which is the entry point for this module:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
# import required packages
import pandas as pd
import nltk
import numpy as np
# tokenize the review text and store the word corpus
word_dict = {}
token_list = []
nltk.download(info_or_id='punkt', download_dir='C:/users/client/nltk_data')
nltk.download(info_or_id='maxent_treebank_pos_tagger', download_dir='C:/users/client/nltk_data')
for text in dataframe1["tweet_text"]:
tokens = nltk.word_tokenize(text.decode('utf8'))
tagged = nltk.pos_tag(tokens)
# convert feature vector to dataframe object
dataframe_output = pd.DataFrame(tagged, columns=['Output'])
return [dataframe_output]
Error is throwing here
dataframe_output = pd.DataFrame(tagged, columns=['Output'])
I suspect this to be the tagged data type passed to dataframe, can some one let me know the right approach to add this to dataframe.
Try this:
dataframe_output = pd.DataFrame(tagged, columns=['Output', 'temp'])

Categories