I am facing a serious blockage on a project of mine.
Here is the summary of what i would like to do :
I have a big hourly file (10 Go) with the following extract (no header) :
ID_A|segment_1,segment_2
ID_B|segment_2,segment_3,segment_4,segment_5
ID_C|segment_1
ID_D|segment_2,segment_4
Every ID (from A to D) can be linked to one or multiple segments (from 1 to 5).
I would like to process this file in order to have the following result (the result file contains a header) :
ID|segment_1|segment_2|segment_3|segment_4|segment_5
ID_A|1|1|0|0|0
ID_B|0|1|1|1|1
ID_C|1|0|0|0|0
ID_D|0|1|0|1|0
1 means that the ID is included in the segment, 0 means that it is not.
I can clearly do this task by using a python script with multiple loops and conditions, however, i need a fast script that can do the same work.
I would like to use BigQuery to perform this operation.
Is it possible to do such task in BigQuery ?
How can it be done ?
Thanks to all for your help.
Regards
Let me assume that the file is loaded into a BQ table with an id column and a segments column (which is a string). Then I would recommend storing the result values as an array, but that is not your question.
You can use the following select to create the table:
select id,
countif(segment = 'segment_1') as segment_1,
countif(segment = 'segment_2') as segment_2,
countif(segment = 'segment_3') as segment_3,
countif(segment = 'segment_4') as segment_4,
countif(segment = 'segment_5') as segment_5
from staging s cross join
unnest(split(segments, ',')) as segment
group by id;
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
IF('segment_1' IN UNNEST(list), 1, 0) AS segment_1,
IF('segment_2' IN UNNEST(list), 1, 0) AS segment_2,
IF('segment_3' IN UNNEST(list), 1, 0) AS segment_3,
IF('segment_4' IN UNNEST(list), 1, 0) AS segment_4,
IF('segment_5' IN UNNEST(list), 1, 0) AS segment_5
FROM `project.dataset.table`,
UNNEST([STRUCT(SPLIT(segments) AS list)])
Above assumes you have your data in the table like in below TCE
WITH `project.dataset.table` AS (
SELECT 'ID_A' id, 'segment_1,segment_2' segments UNION ALL
SELECT 'ID_B', 'segment_2,segment_3,segment_4,segment_5' UNION ALL
SELECT 'ID_C', 'segment_1' UNION ALL
SELECT 'ID_D', 'segment_2,segment_4'
)
if to apply above query to such data - result will be
Row id segment_1 segment_2 segment_3 segment_4 segment_5
1 ID_A 1 1 0 0 0
2 ID_B 0 1 1 1 1
3 ID_C 1 0 0 0 0
4 ID_D 0 1 0 1 0
Related
Call_ID
UUID
Intent_Product
A
123
Loan_BankAccount
A
234
StopCheque
A
789
Request_Agent_phone_number
B
900
Loan_BankAccount
B
787
Request_Agent_BankAcc
I have the above table where "Call_ID" means a call that has been made, "UUID" is a unique key for a turn in the same call (Suppose Call A can have multiple turns such as 123, 234, 789(here)) and "Intent_Product" refers to the description of the query.
The expected output is :
Intent_Product
Resolved_Count
Contained_Turns
Contained_Calls
Loan_BankAcc
2
1
0.5
Stop_Cheque
1
0
0
Conditions :
Resolution_Count :- Count of the total number of queries that has been resolved ( Here, for example "Loan_BankAccount" =2 , "StopCheque" = 1) (where "Intent_Product" like "Request_Agent" , have to ignored as those are not resolved)
Contained_Turns :- Count the total number of queries that has been contained, but ignore those queries which has "Intent_Product" like "Request_Agent" as the successor. ( example :- here Containment count for "Loan_BankAccount" = 1 and Stop_Cheque" = 0 )
Contained_Calls :- This would be equal to (Contained_Turns)/(Resolution_Count)
I'm trying to wrap my head around the pyflink datastream api.
My use case is the following:
The source is a kinesis datastream consisting of the following:
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
10
10
15
70min
2
1
5
5
10
120min
I want to create a session window aggregation with a gap of 60 minutes, calculating the mean for each cookie-cluster combination. The window assignment should be based on the cookie, the aggregation based on cookie and cluster.
The result would therefore be like this (each row being forwarded immediately):
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
7.5
7.5
10
70 min
2
1
5
5
10
120min
Expressed in SQL, for a new record I'd like to perform this aggregation:
INSERT INTO `input` (`cookie`, `cluster`, `dim0`, `dim1`, `dim2`, `time_event`) VALUES
("1", "1", 0, 0, 0, 125)
WITH RECURSIVE by_key AS (
SELECT *,
(time_event - lag(time_event) over (partition by cookie order by time_event)) as "time_passed"
FROM input
WHERE cookie = "1"
),
new_session AS (
SELECT *,
CASE WHEN time_passed > 60 THEN 1 ELSE 0 END as "new_session"
FROM by_key),
by_session AS (
SELECT *, SUM(new_session) OVER(partition by cookie order by time_event) as "session_number"
FROM new_session)
SELECT cookie, cluster, avg(dim0), avg(dim1), avg(dim2), max(time_event)
FROM by_session
WHERE cluster = "1"
GROUP BY session_number
ORDER BY session_number DESC
LIMIT 1
I tried to accomplish this with the table api, but I need the results to be updated as soon as a new record is added to a cookie-cluster combination. This is my first project with flink, and the datastream API is an entirely different beast, especially since a lot of stuff is not included yet for python.
My current approach looks like this:
Create a table from the kinesis datastream (datastream has no kinesis connector)
Convert it to a datastream to perform the aggregation. From what I've read, watermarks are propagated and the resulting row objects contain the column names, i.e. I can handle them like a python dictionary. Please correct me, if I'm wrong on this.
Key the data stream by the cookie.
Window with a custom SessionWindowsAssigner, borrowing from the Table API. I'm working on a seperate post on that.
Process the windows by calculating the mean for each cluster
table_env = StreamTableEnvironment.create(stream_env, environment_settings=env_settings)
table_env.execute_sql(
create_table(input_table_name, input_stream, input_region, stream_initpos)
)
ds = table_env.to_append_stream(input_table_name)
ds.key_by(lambda r: r["cookie"])\
.window(SessionWindowAssigner(session_gap=60, is_event_time=True)\
.trigger(OnElementTrigger()).\
.process(MeanWindowProcessFunction())
My basic idea for the ProcessWindowFunction would go like this:
class MeanWindowProcessFunction(ProcessWindowFunction[Dict, Dict, str, TimeWindow]):
def process(self,
key: str,
content: ProcessWindowFunction.Context,
elements: Iterable) -> Iterable[Dict]:
clusters = {}
cluster_records = {}
for element in inputs:
if element["cluster"] not in clusters:
clusters[element["cluster"]] = {key: val for key, val in element.as_dict().items()}
cluster_records[element["cluster"]] = 0
else:
for dim in range(3):
clusters[element["cluster"]][f"dim{dim}"] += element[f"dim{dim}"]
clusters[element["cluster"]]["time_event"] = element["time_event"]
cluster_records[element["cluster"]] += 1
for cluster in clusters.keys():
for dim in range(3):
clusters[cluster][f"dim{dim}"] /= cluster_records[cluster]
return clusters.values()
def clear(self, context: 'ProcessWindowFunction.Context') -> None:
pass
Is this the right approach for this problem?
Do I need to consider anything else for the ProcessWindowFunction, like actually implementing the clear method?
I'd be very grateful for any help, or any more elaborate examples of windowed analytics applications in pyflink. Thank you!
I have two tables, one table has FROM_SERIAL, TO_SERIAL and TRANSACTION_DATE. And another table has SERIAL_NO and ACTIVATION_DATE. I want to merge both two table within a particular range.
Example:
First Table
FROM_SERIAL TO_SERIAL TRANSACTION_DATE
10003000100 10003000500 22-APR-19
10003001100 10003001300 25-MAY-19
10005002001 10005002500 30-AUG-19
Second Table
SERIAL_NO ACTIVATION_DATE
10003000150 30-APR-19
10005002300 01-OCT-19
Expecting Table
FROM_SERIAL TO_SERIAL SERIAL_NO ACTIVATION_DATE
10003000100 10003000500 10003000150 30-APR-19
10005002001 10005002500 10005002300 01-OCT-19
I want to merge both tables in the above scenario.
The code may be Oracle or Python, it doesn't matter.
Pandas solution:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
df = df[df['SERIAL_NO'].between(df['FROM_SERIAL'], df['TO_SERIAL'])]
df = df.drop(['a','TRANSACTION_DATE'], 1)
print (df)
FROM_SERIAL TO_SERIAL SERIAL_NO ACTIVATION_DATE
0 10003000100 10003000500 10003000150 30-APR-19
5 10005002001 10005002500 10005002300 01-OCT-19
But if large data better is use some oracle solution.
Consider:
SELECT
t1.from_serial,
t1.to_serial,
t2.serial_no,
t2.activation_date
FROM table1 t1
INNER JOIN table2 t2
ON t2.serial_no >= t1.from_serial AND t2.serial_no < t1.to_serial
You may ajust the inequalities as you wish. Beware that, if a given serial_no in table2 belongs to more than one range in table1, they will all match and you will get duplicated table1 records in the result set.
Join with between.
SQL> with
2 tfirst (from_serial, to_serial) as
3 (select 3000100, 3000500 from dual union all
4 select 3001100, 3001300 from dual union all
5 select 5002001, 5002500 from dual
6 ),
7 tsecond (serial_no, activation_date) as
8 (select 3000150, date '2019-04-30' from dual union all
9 select 5002300, date '2019-10-01' from dual
10 )
11 select a.from_serial,
12 a.to_serial,
13 b.serial_no,
14 b.activation_date
15 from tfirst a join tsecond b on b.serial_no between a.from_serial and a.to_serial;
FROM_SERIAL TO_SERIAL SERIAL_NO ACTIVATION
----------- ---------- ---------- ----------
5002001 5002500 5002300 01.10.2019
3000100 3000500 3000150 30.04.2019
SQL>
you can also do it using numpy broadcast feature like below. Explanation in comment
df1 = pd.DataFrame([('10003000100', '10003000500', '22-APR-19'), ('10003001100', '10003001300', '25-MAY-19'), ('10005002001', '10005002500', '30-AUG-19')], columns=('FROM_SERIAL', 'TO_SERIAL', 'TRANSACTION_DATE'))
df2 = df = pd.DataFrame([('10003000150', '30-APR-19'), ('10005002300', '01-OCT-19')], columns=('SERIAL_NO', 'ACTIVATION_DATE'))
## 'df1[["FROM_SERIAL"]]'' is a column vector of size m and 'df2["SERIAL_NO"].values' is row
## vector of size n then broad cast will result array of shape m,n which is
## result of comparing each pair of m and n
compare = (df1[["FROM_SERIAL"]].values<df2["SERIAL_NO"].values) & (df1[["TO_SERIAL"]].values>df2["SERIAL_NO"].values)
mask = np.arange(len(df1)*len(df2)).reshape(-1, len(df2))[compare]
pd.concat([df1.iloc[mask//len(df2)].reset_index(drop=True), df2.iloc[mask%len(df2)].reset_index(drop=True)], axis=1, sort=False)
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278
I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])