Im trying to use the streaming insert_all method to insert data to a table using the google-api-client gem in ruby.
So I start with creating a new table in Bigquery (read and write priveleges are correct)
with the following contents:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
+-----+-----------+-------------+
This is my code in ruby: (I discovered earlier today that tabledata.insert_all is ruby for tabledata.insertAll - google docs / example need to be updated)
def streaming_insert_data_in_table(table, dataset=DATASET)
body = {"rows"=>[
{"json"=> {"person_id"=>10,"person_name"=>"george"}},
{"json"=> {"person_id"=>11,"person_name"=>"washington"}}
]}
result = #client.execute(
:api_method=> #bigquery.tabledata.insert_all,
:parameters=> {
:projectId=> #project_id.to_s,
:datasetId=> dataset,
:tableId=>table},
:body_object=>body,
)
puts result.body
end
So I run my code the first time and all appears fine. I see this in the table on Bigquery:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
| 4 | 10 | george |
| 5 | 11 | washington |
+-----+-----------+-------------+
Then I change the data in the method to:
body = {"rows"=>[
{"json"=> {"person_id"=>5,"person_name"=>"john"}},
{"json"=> {"person_id"=>6,"person_name"=>"kennedy"}}
]}
Run the method and get this in Bigquery:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
| 4 | 10 | george |
| 5 | 6 | kennedy |
+-----+-----------+-------------+
So, what gives? I've lost data.... (ids 11 and id 5 have vanished) The responses for the request do not have errors either.
Could someone tell me if Im doing something incorrectly or why this is happening please?
Any help is much appreciated.
Thanks and have a great day.
Discovered this appears something to do with the ui (row count doesn't populate for a while and trying to extract the data in the table results in an error "Unexpected. Please try again."). However data is actually stored and can be queried. Thanks for the help Jordan
Related
inside my application i have a dataframe that looks similiar to this:
Example:
id | address | code_a | code_b | code_c | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | 1345wqdwqe | ....
2 | parkdrive 1 | 012ah8 | 012ah8a | dwqd4646 | ....
3 | parkdrive 2 | 852fhz | 852fhza | fewf6465 | ....
4 | parkdrive 3 | 456se1 | 456se1a | 856fewf13 | ....
5 | parkdrive 3 | 456se1 | 456se1a | gth8596s | ....
6 | parkdrive 3 | 456se1 | 456se1a | a48qsgg | ....
7 | parkdrive 4 | tg8596 | tg8596a | 134568a | ....
As you may see, every address can contain multiple entrys inside my dataframe, the code_a and code_b are following a certain pattern and only code_c is unqiue.
What I'm trying to obtain is a dataframe where the column code_c is ignored, dropped or whatever and the whole dataframe is reduced to only one entry for each address...something like this:
id | address | code_a | code_b | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | ...
3 | parkdrive 2 | 852fhz | 852fhza | ...
4 | parkdrive 3 | 456se1 | 456se1a | ...
7 | parkdrive 4 | tg8596 | tg8596a | ...
I tried the groupby-function, but this doesn't seemed to work - or is this even the right function?
Thanks for your help and good day to all of you!
You can drop_duplicates to do this
df.drop_duplicates(subset=[‘address’], inplace=True)
This will keep only a single entry per address
I think what you are looking for is
# in this way you are looking for all the duplicates rows in all columns except for 'code_c'
df.drop_duplicates(subset=df.columns.difference(['code_c']))
# in this way you are looking for all the duplicates rows ONLY based on column 'address'
df.drop_duplicates(subset='address')
I notice in your example data, if you drop columnC then all the entries with address "parkdrive 1" for example, are just duplicates.
you should drop the column c:
df.drop('code_c',axis=1,inplace=True)
Then you can drop the duplicates:
df_clean = df.drop_duplicates()
Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset
I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840
Conversions
user_id | tag | timestamp
|--------- |-------- |---------------------|
| 1 | click1 | 2016-11-01 01:20:39 |
| 2 | click2 | 2016-11-01 09:48:10 |
| 3 | click1 | 2016-11-04 14:27:22 |
| 4 | click4 | 2016-11-05 17:50:14 |
User Sessions
user_id | utm_campaign | session_start
|--------- |--------------- |---------------------|
| 1 | outbrain_2 | 2016-11-01 00:15:34 |
| 1 | email | 2016-11-01 01:00:29 |
| 2 | google_1 | 2016-11-01 08:24:39 |
| 3 | google_4 | 2016-11-04 14:25:06 |
| 4 | google_1 | 2016-11-05 17:43:02 |
Given the 2 tables above, I want to map each conversion event to the most recent campaign that brought a particular user to a site (aka last touch/last click attribution).
The desired output is a table of the format:
user_id | tag | timestamp | campaign
|--------- |-------- |---------------------|-----------
| 1 | click1 | 2016-11-01 01:20:39 | email
| 2 | click2 | 2016-11-01 09:48:10 | google_1
| 3 | click1 | 2016-11-04 14:27:22 | google_4
| 4 | click4 | 2016-11-05 17:50:14 | google_1
Note how user 1 visited the site via the outbrain_2 campaign and then came back to the site via the email campaign. Sometime during the user's second visit, they converted, thus the conversion should be attributed to email and not outbrain_2.
Is there a way to do this in MySQL or Python?
You can do this in Python with Pandas. I assume you can load the data from MySQL tables to Pandas dataframes conversions and sessions. First, concatenate both tables:
all = pd.concat([conversions,sessions])
Some of the elements in the new frame will be NAs. Create a new column that collects the time stamps from both tables:
all["ts"] = np.where(all["session_start"].isnull(),
all["timestamp"],
all["session_start"])
Sort by this column, forward fill the time values, group by the user ID, and select the last (most recent) row from each group:
groups = all.sort_values("ts").ffill().groupby("user_id",as_index=False).last()
Select the right columns:
result = groups[["user_id", "tag", "timestamp", "utm_campaign"]]
I tried this code with your sample data and got the right answer.
Given the following data frame
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 1 | you#you.com | 3.0 | World |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
How can I get all the rows where re.compile([Hh](i|ello)) would match in a cell? That is, from the above example, I would like to get the following output:
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
I am not able to get a solution for this. And help would be very much appreciated.
Using stack to avoid apply
df.loc[df.stack().str.match(r'[Hh](i|ello)').unstack().any(1)]
Using match generates a future warning. The warning is consistant with what we are doing, so that's good. However, findall accomplishes the same thing
df.loc[df.stack().str.findall(r'[Hh](i|ello)').unstack().any(1)]
You can use the findall function which takes regular expressions.
msk = df.apply(lambda x: x.str.findall(r'[Hh](i|ello)')).any(axis=1)
df[msk]
+---|------------|------|---------+
| | A | B | C |
+---|------------|------|---------+
| 0 |hello#me.com| 2 | Hello |
| 2 |us#world.com| hi | holiday |
+---|------------|------|---------+
any(axis=1) will check if any of the columns in a given row are true. So msk is a single column of True/False values indicating whether or not the regular expression was found in that row.