We have custom dimension define in Google Analytics Data API v1Beta for extracting data from Google Analytics GA4 account.
I am trying to fetch eventCount metric with respect to date, campaignId, campaignName and eventName using python. I want to know what is the eventCount for different eventName in different campaignName. Is there any work around how can i fetch this data?
import pandas as pd
import numpy as np
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import DateRange
from google.analytics.data_v1beta.types import Dimension
from google.analytics.data_v1beta.types import Metric
from google.analytics.data_v1beta.types import RunReportRequest
client = BetaAnalyticsDataClient()
## Format Report - run_report method
def format_report(request):
response = client.run_report(request)
# Row index
row_index_names = [header.name for header in response.dimension_headers]
row_header = []
for i in range(len(row_index_names)):
row_header.append([row.dimension_values[i].value for row in response.rows])
row_index_named = pd.MultiIndex.from_arrays(np.array(row_header), names = np.array(row_index_names))
# Row flat data
metric_names = [header.name for header in response.metric_headers]
data_values = []
for i in range(len(metric_names)):
data_values.append([row.metric_values[i].value for row in response.rows])
output = pd.DataFrame(data = np.transpose(np.array(data_values, dtype = 'f')),
index = row_index_named, columns = metric_names)
return output
request = RunReportRequest(
property='properties/'+property_id,
dimensions=[
Dimension(name="date"),
Dimension(name="eventName"),
Dimension(name="campaignId"),
Dimension(name="campaignName")
],
metrics=[
Metric(name="eventCount"),
],
date_ranges=[DateRange(start_date="2023-01-22", end_date="2023-01-25")],
)
Error:
InvalidArgument: 400 Please remove eventCount to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
error
As the error message states not all dimensions and metrics are complatable.
In your case eventcount is not compatible with campaignId or campaignName So to make this request work you must either remove eventcount, or remove campaignId or campaignName
I guess what I am saying is that you cant make that request, the data doesnt exist.
I am having the same issue where I am creating a custom report to fetch itemListViewEvents and itemViewEvents based on itemId
here's my logic:
require 'google/analytics/data'
require 'google/analytics/data/v1beta'
require 'google/analytics/data/v1beta/analytics_data'
class GoogleAnalyticsReportingApiService
range = Google::Analytics::Data::V1beta::DateRange.new(
start_date: Time.zone.today.to_date.to_s,
end_date: 1.week.ago.to_date.to_s
)
metrics = [
Google::Analytics::Data::V1beta::Metric.new(expression: 'itemListViews', name: 'itemListViewEvents'),
Google::Analytics::Data::V1beta::Metric.new(expression: 'itemViews', name: 'itemViewEvents')
]
dimension = Google::Analytics::Data::V1beta::Dimension.new(name: 'itemId')
rr = Google::Analytics::Data::V1beta::RunReportRequest.new(
property: analytics_property,
dimensions: [dimension],
metrics: metrics,
date_ranges: [range],
order_bys: [{desc: true, metric: {metric_name: 'itemViewEvents'}}],
limit: 100000
)
end
I am getting this error:
Google::Cloud::InvalidArgumentError: 3:Please remove itemId to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/. debug_error_string:{UNKNOWN:Error received from peer
and this is clearly outlined here: https://developers.google.com/analytics/devguides/reporting/data/v1/announcements/20221201-compatibility-changes
Now the issue is, I just cannot use only dimension or metrics because the API Names are not supported either way (for example: itemId is not supported in Metric). Not sure how to to deal with this! More info here: https://github.com/googleapis/google-cloud-ruby/issues/20138
[UPDATE Feb/15/23]
After trying my hands on Custom Events, Custom Dimensions and Custom Metrics, I ended up using alternate schema name for the Metrics fields.
The documentation here: https://developers.google.com/analytics/devguides/reporting/data/v1/api-schema helped me a lot and I was able to figure out that I can use itemViewEvents for itemListViewEvents and itemsViewed for itemViewEvents
Related
I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:
Screenshot
Below is the code I used to apply to my full dataset using Pandas:
Messages = pd.read_csv('Messages.csv', encoding='cp1252')
Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = nlp(Messages['Body'])._.cats
But it gives me the error:
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>
The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.
Thanks so much
I tried Pandas apply method too, but had no luck. Code used:
Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats
The error I got: AttributeError: 'Series' object has no attribute '_'
Expectation is to generate 3 columns as described above
You should provide a callable into Series.apply call:
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
Here, each value in the NLP_Result column will be assigned to x variable.
The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.
import spacy
import classy classification
import csv
import pandas as pd
with open ('Deliveries.txt', 'r') as d:
Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
Not_Spam = n.read().splitlines()
data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam
# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
I have a list of CSV files in a google bucket organized like gs://bucket/some_dir/{partition_value}/filename. I want to create a pyarrow.Dataset from a list of URIs like this (which is a subset of files in some_dir).
How do I do this and extract partition_value as a column?
So far I have:
import gcsfs
import pyarrow as pa
import pyarrow.csv
import pyarrow.dataset as ds
from pyarrow.fs import FSSpecHandler, PyFileSystem
fs = gcsfs.GCSFileSystem()
schema = pa.schema([("gene_id", pa.string()), ("raw_count", pa.float32()), ("scaled_estimate", pa.float32())])
# these data are publicly accessible, btw
uris = [
"gs://gdc-tcga-phs000178-open/0b8b258e-1671-4f86-82e7-59b12ad40d9c/unc.edu.4c243ea9-dfe1-42f0-a887-3c901fb38542.2477720.rsem.genes.results",
"gs://gdc-tcga-phs000178-open/c8ee8367-c529-4dd6-98b4-fde57991134b/unc.edu.a64ae1f5-a189-4173-be13-903bd7637869.2476757.rsem.genes.results",
"gs://gdc-tcga-phs000178-open/78354f8d-5ce8-4617-bba4-79614f232e97/unc.edu.ac19f7cf-670b-4dcc-a26b-db0f56377231.2509607.rsem.genes.results",
]
dataset = ds.FileSystemDataset.from_paths(
uris,
schema,
format=ds.CsvFileFormat(parse_options=pa.csv.ParseOptions(delimiter="\t")),
filesystem=PyFileSystem(FSSpecHandler(fs)),
# partitions=["bucket", "file_gcs_id"],
# root_partition="gdc-tcga-phs000178-open",
)
dataset.to_table()
this gives me a nice table with fields in my schema.
but, i'd like partition_key to be another field in my dataset. i'm guessing i need:
to add this as a field to my schema and
to add something when calling FileSystemDataset.from_paths
i tried fiddling with root_partition, but got an error that the string i provided isn't a pyarrow.Expression (no idea what that is). also i tried specifying partitions but i get ValueError: The number of files resulting from paths_or_selector must be equal to the number of partitions.
During dataset discovery filename information is used (along with a specified partitioning) to generate "guarantees" which are attached to fragments. For example, when we see the file foo/x=7/bar.parquet and we are using "hive partitioning" we can attach the guarantee x == 7. These guarantees are stored as "expressions" for various reasons we don't need to discuss at the moment.
Two solutions jump to mind. First, you could create the guarantees yourself and attach them to your paths (this is what the partitions argument represents in the from_paths method). The expression should be ds.field("column_name") == value.
Second, you could allow the dataset discovery process to run as normal. This will generate all the fragments you need (and some you don't), with the guarantees already attached. Then you could trim down the list of fragments to your desired list of fragments and create a dataset from that.
(I'm guessing I need) to add this as a field to my schema
Yes. In both of the above approaches you will want to make sure your partitioning column(s) are added to your schema.
Here is a code example showing both approaches:
import shutil
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.fs as fs
shutil.rmtree('my_dataset', ignore_errors=True)
table = pa.Table.from_pydict({
'x': [1, 2, 3, 4, 5, 6],
'part': ['a', 'a', 'a', 'b', 'b', 'b']
})
ds.write_dataset(table, 'my_dataset', partitioning=['part'], format='parquet')
print('# Created by dataset factory')
partitioning = ds.partitioning(schema=pa.schema([pa.field('part', pa.string())]))
dataset = ds.dataset('my_dataset',partitioning=partitioning)
print(dataset.to_table())
print()
desired_paths = [
'my_dataset/a/part-0.parquet'
]
# Note that table.schema used below includes the partitioning
# column so we've added that to the schema.
print('# Created from paths')
filesystem = fs.LocalFileSystem()
dataset_from_paths = ds.FileSystemDataset.from_paths(
desired_paths,
table.schema,
format=ds.ParquetFileFormat(),
filesystem=filesystem)
print(dataset_from_paths.to_table())
print()
print('# Created from paths with explicit partition information')
dataset_from_paths = ds.FileSystemDataset.from_paths(
desired_paths,
table.schema,
partitions=[
ds.field('part') == "a"
],
format=ds.ParquetFileFormat(),
filesystem=filesystem)
print(dataset_from_paths.to_table())
print()
print('# Created from discovery then trimmed')
trimmed_fragments = [frag for frag in dataset.get_fragments() if frag.path in desired_paths]
trimmed_dataset = ds.FileSystemDataset(trimmed_fragments, dataset.schema, dataset.format, filesystem=dataset.filesystem)
print(trimmed_dataset.to_table())
I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks
enter image description here
I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?
venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
if 'id' not in paper["venue"]:
if 'type' not in paper["venue"]:
venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
else:
venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
else:
venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)
authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()
Apache Spark allows to read json files in multiple chunks in parallel to make it faster -
https://spark.apache.org/docs/latest/sql-data-sources-json.html
For a regular multi-line JSON file, set the multiLine parameter to True.
If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -
https://koalas.readthedocs.io/en/latest/
Koalas read_json call -
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html
Your use wrong tool to accomplish this task, do not use pandas for this scenario.
Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().
I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.
authors = {}
venues = {}
for paper in papers:
venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
authors[author["id"]] = author["name"]
I am trying to download historical intraday data of USD/EURO for the last 6 months from alpha vantage
Here is the code I am trying to execute
import pandas as pd
from alpha_vantage.timeseries import
api = "######"
ts = TimeSeries(key=####,output_format = "pandas")
data,metadata = ts.get_intraday(symbol = "USD/CAD",interval= "1min" , outputsize = "full")
print(data)
It is giving an error
ValueError: Invalid API call. Please retry or visit the documentation (https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=USD/documentation/) for TIME_SERIES_INTRADAY.
What can be the reason for this ?
As per the documentation for TIME_SERIES_INTRADAY and your error message it is somewhat apparent to me that your API is invalid. If you see your command the API is actually missing and as per the documentation it is required.
Try adding your API token/key in the last line (below) and at least the above problem should be solved.
import pandas as pd
from alpha_vantage.timeseries import
api = "######"
ts = TimeSeries(key=####,output_format = "pandas")
data,metadata = ts.get_intraday(function=TIME_SERIES_INTRADAY, symbol = "USD/CAD",interval= "1min" , outputsize = "full", apikey="Please fill your api key here")
Hope it helps.
===============================================================
Edit after going through the source code of alphavantage.
So I went through the code. There is nothing wrong in it as par as apikey is concerned. Since in the line before, where you actually call the api, you have instantiated the TimeSeries class and at that time you have given the api key. So it is not needed again.
I could replicate your error at my end. After some traversal through code, I realised that you are passing the wrong currency perhaps. It should not be USD/CAD but just USD. If you rather wish to get for USD/CAD .. you have to say USDCAD. When you say currency = "USD/CAD" .. most likely your formed API is wrong and due to "/" it is prematurely terminated.
Below is the edited code. I have also edited your original post and in the second line, after import, I have added TimeSeries. I hope that is right. If not please reject the edit.
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
api = "XXXXXXXXXXXXX"
ts = TimeSeries(key=api,output_format = "pandas")
data,metadata = ts.get_intraday(symbol = "USD",interval= "1min" , outputsize = "full")
print(data)
data,metadata = ts.get_intraday(symbol = "USDCAD",interval= "1min" , outputsize = "full")
print(data)
I hope this helps.
Change your code to:
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
api_key = "XXXX"
ts = TimeSeries(key = api_key,output_format = "pandas")
data, metadata = ts.get_intraday(symbol = "USDCAD",interval= "1min" , outputsize = "full")
print(data)
Here were the edits made to solve this.
You had a few syntax errors in your code sample.
"#" strings don't work as a parameter in the API key. For testing, use something like "XXXX"
The symbol has to be "USDCAD" not "USD/CAD"
Optional, but preferred: You should be using the from alpha_vantage.foreignexchange import ForeignExchange package to get currency pairs as opposed to the TimeSeries object.
TL;DR: I have a PyTable with a float32 Col and get an error when writing a numpy-float32-array into it. (How) can I store a numpy-array (float32) in the Column of a PyTables table?
I'm new to PyTables - following a recommendation of TFtables (a lib to use HDF5 in Tensorflow), I'm using it to store all my HDF5 data (currently distributed in batches in several files with each three datasets) within a table in a single HDF5 file. Datasets are
'data' : (n_elements, 1024, 1024, 4)#float32
'label' : (n_elements, 1024, 1024, 1)#uint8
'weights' : (n_elements, 1024, 1024, 1)#float32
where the n_elements are distributed over several files that I want to merge into one now (to allow unordered access).
So when I build my table, I figured each dataset represents a column. I built everything in a generic way that allows to do this for an arbitrary number of datasets:
# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)
# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}
# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))
The dtypes that I get for each dsets (read with h5py) are obviously the ones of the numpy arrays (ndarray) that reading the dset returns: float32 or uint8. So the Col()-types are Float32Col an UInt8Col. I naively assumed that I can now write a float32-array into this col, but filling in data with:
dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files
sample = table.row
sample['data'] = dummy_data
results in TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``. So now I feel stupid for assuming I'd be able to write an array in there, BUT there are no "ArrayCol()" types offered, neither are there any hints in the PyTables doc as to whether or how it is possible to write an array into a column. How do I do this?
There are "shape" arguments in the Col() class and it's descendents, so it should be possible, otherwise what are these for?!
I know it's a bit late, but I think the answer to your problem lies in the shape parameter for Float32Col.
Here's how it's used in the documentation:
from tables import *
from numpy import *
# Describe a particle record
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root
# Create the groups:
for groupname in ("Particles", "Events"):
group = fileh.create_group(root, groupname)
# Now, create and fill the tables in Particles group
gparticles = root.Particles
# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
# Create a table
table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)
# Get the record object associated with the table:
particle = table.row
# Fill the table with 257 particles
for i in xrange(257):
# First, assign the values to the Particle record
particle['name'] = 'Particle: %6d' % (i)
particle['lati'] = i
particle['longi'] = 10 - i
########### Detectable errors start here. Play with them!
particle['pressure'] = array(i*arange(2*3)).reshape((2,4)) # Incorrect
#particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
########### End of errors
particle['temperature'] = (i**2) # Broadcasting
# This injects the Record values
particle.append()
# Flush the table buffers
table.flush()
Here's the link to the part of the documentation I'm referring to
https://www.pytables.org/usersguide/tutorials.html
Edit: I just saw that the tables.Col.from_type(type, shape) allows using the precision of a type (float32 instead of float alone). The rest stays the same (takes a string and shape).
The factory function tables.Col.from_kind(kind, shape) can be used to construct a Col-Type that supports ndarrays. What "kind" is and how to use this isn't documented anywhere I found; however with trial and error I found that allowed "kind"s are strings of basic datatypes. I.e.: 'float', 'uint', ... without the precision (NOT 'float64')
Since I get numpy.dtypes from h5py reading a dataset (dset.dtype), these have to be cast to str and the precision needs to be removed.
In the end the relevant lines look like this:
# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)
# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing
# the dtype w/o precision, e.g. 'float' or 'uint'
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]
# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}
Then writing data into the table finally works as suggested:
sample = table.row
sample[key] = my_array
Since it all felt a bit "hacky" and isn't documented well, I am still wondering, whether this is not an intended use for PyTables and would leave this question open for abit to see if s.o. knows more about this...