I would like to prepare a market basket analysis in Python based on Google Analytics data. I would like to examine what the most common paths the user goes through, and on a cookie level. I have encountered two problems: first, when I query the data from BigQuery, the hit number is on a session level and not on a cookie level. How would I be able to show the path a user has gone through (on a cookie and not on a session level)? Second, I do not know how to tweak the data: in R, a transaction class is needed for preparing the data to the apriori algorithm. I know that in Python the solution is to one hot encode the data, however, my problem is that through this solution, the sequence of page paths are lost.
Could somebody please help me? Thank you!
I think you best bet for aggregating page_paths at a cookie level would be to group by visitor_id. The visitor_id is what is assigned by GA as the cookie and should persist through visits unless a user goes incognito or clears cookies. Depending if you are using a Custom Dimension to track users logging on to your website, you will see that a user could have multiple visitor_ids.
Before you aggregate up you can combine all this information by using visit_id to distinguish between different sessions. You can query all hit level data for a given a user and then roll up from there.
I think this could be done by adjusting the WHERE clause in your query in how you're querying the hit level of the session now, keeping the hit number but now you're looking at all sessions.
SELECT
fullVisitorId,
visitId,
visitNumber,
hits.hitNumber AS hitNumber,
hits.page.pagePath AS pagePath
FROM
TABLE_DATE_RANGE( [bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-07-01'), TIMESTAMP('2017-07-31') )
WHERE
hits.type="PAGE"
ORDER BY
fullVisitorId,
visitId,
visitNumber,
hitNumber
Related
I'm trying to analyze (for business intelligence purpose) some google analytics data in python.
All I get after many tutorials are "aggregated" data... like the number of views in a day the thing I need instead is something capable of tracking the behavior of a single user.. like what page of the web site he visited, his bounce rate if he used the e-commerce and so on.
I saw many CSV already prepared for such analysis but I'm starting from scratch with my web site.
You can use the User-ID feature, when you send Analytics an ID and related data from multiple sessions, your reports tell a more unified, holistic story about a user’s relationship with your business:
https://support.google.com/analytics/answer/3123662?hl=en
Otherwise, you can examine individual-user behavior at the session level in User Explorer report. The User Explorer report lets you isolate and examine individual rather than aggregate user behavior. Individual user behavior is associated with either Client ID or User ID.
https://support.google.com/analytics/answer/6339208?hl=en
I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.
Let's assume I am developing a service that provides a user with articles. Users can favourite articles and I am using Solr to store these articles for search purposes.
However, when the user adds an article to their favourites list, I would like to be able to figure out out which articles the user has added to favourites so that I can highlight the favourite button.
I am thinking of two approaches:
Fetch articles from Solr and then loop through each article to fetch the "favourite-status" of this article for this specific user from MySQL.
Whenever a user favourites an article, add this user's ID to a multi-valued column in Solr and check whether the ID of the current user is in this column or not.
I don't know the capacity of the multivalued column... and I also don't think the second approach would be a "good practice" (saving user-related data in index).
What other options do I have, if any? Is approach 2 a correct approach?
I'd go with a modified version of the first one - it'll keep user specific data that's not going to be used for search out of the index (although if you foresee a case where you want to search for favourite'd articles, it would probably be an interesting field to have in the index) for now. For just display purposes like in this case, I'd take all the id's returned from Solr, fetch them in one SQL statement from the database and then set the UI values depending on that. It's a fast and easy solution.
If you foresee that "search only in my fav'd articles" as a use case, I would try to get that information into the index as well (or other filter applications against whether a specific user has added the field as a favourite). I'd try to avoid indexing anything more than the user id that fav'd the article in that case.
Both solutions would however work, although the latter would require more code - and the required response from Solr could grow large if a large number of users fav's an article, so I'd try to avoid having to return a set of userid's if that's the case (many fav's for a single article).
I am having a baby soon and I want to give him a unique/relatively less known name from my country. I want to get all names on facebook for a given country (say India) and then find 1000 least common names. I am not able to determine if Facebook API allows me to do this. Can someone suggest which APIs I should look at?
If it is not possible in FB, is it possible in any other social network?
Thanks.
The Graph API. Although I think Graph API takes reference from a user and then search in his friends only or if he/she has a page then in their followers only. The users which are not connected to the user can not be accessed. I've never seen a function which can return all users or their userIDs.
Edit:-
Ok I've found that you might need the Open Graph API and the Action Types, but their's no Action type for country.
This isn't possible. The closest you can do is an FQL query on the name table
SELECT name FROM user WHERE contains('user763410baby')
Assembla provides a simple way to fetch all commits of an organisation using api.assembla.com/v1/activity.json and it takes to and from parameters allowing to get commits of selected date(from all the spaces(repos) the user is participating.
Is there any similar way in Github ?
I found these for Github:
/repos/:owner/:repo/commits
Accepts since and until parameters for getting commits of selected date. But, since I want commits from all repos, I have to loop over all those repos and fetch commits for each repo.
/users/:user/events
This shows the commits of a user. I dont have any problem looping over all the users in the org, but how can I get for a particular date ?
/orgs/:org/events
This shows commits of all users of all repos but dont know how to fetch for a particular date ?
The problem with using the /users/:user/events endpoint is that you just don't get the PushEvents and you would have to skip over non-commit events and perform more calls to the API. Assuming you're authenticated, you should be safe so long as your users aren't hyper active.
For /orgs/:org/events I don't think they accept parameters for anything, but I can check with the API designers.
And just in case you aren't familiar, these are all paginated results. So you can go back until the beginning with the Link headers. My library (github3.py) provides iterators to do this for you automatically. You can also tell it how many events you'd like. (Same with commits, etc). But yeah, I'll come back an edit after talking to the API guys at GitHub.
Edit: Conversation
You might want to check out the GitHub Archive project -- http://www.githubarchive.org/, and the ability to query the archive using Google's BigQuery. Sounds like it would be a perfect tool for the job -- I'm pretty sure you could get exactly what you want with a single query.
The other option is to call the GitHub API -- iterate over all events for the organization and filter out the ones that don't satisfy your date rage criteria and event type criteria (commits). But since you can't specify date ranges in the API call, you will probably do a lot of calls to get the the events that interest you. Notice that you don't have to iterate over every page starting from 0 to find the page that contains the first result in the date range -- just do a (variation of) binary search over page numbers to find any page that contains a commit in the date range, a then iterate in both directions until you break out of the date range. That should reduce the number of API calls you make.