Merge two table - python

I have two tables, one table has FROM_SERIAL, TO_SERIAL and TRANSACTION_DATE. And another table has SERIAL_NO and ACTIVATION_DATE. I want to merge both two table within a particular range.
Example:
First Table
FROM_SERIAL TO_SERIAL TRANSACTION_DATE
10003000100 10003000500 22-APR-19
10003001100 10003001300 25-MAY-19
10005002001 10005002500 30-AUG-19
Second Table
SERIAL_NO ACTIVATION_DATE
10003000150 30-APR-19
10005002300 01-OCT-19
Expecting Table
FROM_SERIAL TO_SERIAL SERIAL_NO ACTIVATION_DATE
10003000100 10003000500 10003000150 30-APR-19
10005002001 10005002500 10005002300 01-OCT-19
I want to merge both tables in the above scenario.
The code may be Oracle or Python, it doesn't matter.

Pandas solution:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
df = df[df['SERIAL_NO'].between(df['FROM_SERIAL'], df['TO_SERIAL'])]
df = df.drop(['a','TRANSACTION_DATE'], 1)
print (df)
FROM_SERIAL TO_SERIAL SERIAL_NO ACTIVATION_DATE
0 10003000100 10003000500 10003000150 30-APR-19
5 10005002001 10005002500 10005002300 01-OCT-19
But if large data better is use some oracle solution.

Consider:
SELECT
t1.from_serial,
t1.to_serial,
t2.serial_no,
t2.activation_date
FROM table1 t1
INNER JOIN table2 t2
ON t2.serial_no >= t1.from_serial AND t2.serial_no < t1.to_serial
You may ajust the inequalities as you wish. Beware that, if a given serial_no in table2 belongs to more than one range in table1, they will all match and you will get duplicated table1 records in the result set.

Join with between.
SQL> with
2 tfirst (from_serial, to_serial) as
3 (select 3000100, 3000500 from dual union all
4 select 3001100, 3001300 from dual union all
5 select 5002001, 5002500 from dual
6 ),
7 tsecond (serial_no, activation_date) as
8 (select 3000150, date '2019-04-30' from dual union all
9 select 5002300, date '2019-10-01' from dual
10 )
11 select a.from_serial,
12 a.to_serial,
13 b.serial_no,
14 b.activation_date
15 from tfirst a join tsecond b on b.serial_no between a.from_serial and a.to_serial;
FROM_SERIAL TO_SERIAL SERIAL_NO ACTIVATION
----------- ---------- ---------- ----------
5002001 5002500 5002300 01.10.2019
3000100 3000500 3000150 30.04.2019
SQL>

you can also do it using numpy broadcast feature like below. Explanation in comment
df1 = pd.DataFrame([('10003000100', '10003000500', '22-APR-19'), ('10003001100', '10003001300', '25-MAY-19'), ('10005002001', '10005002500', '30-AUG-19')], columns=('FROM_SERIAL', 'TO_SERIAL', 'TRANSACTION_DATE'))
df2 = df = pd.DataFrame([('10003000150', '30-APR-19'), ('10005002300', '01-OCT-19')], columns=('SERIAL_NO', 'ACTIVATION_DATE'))
## 'df1[["FROM_SERIAL"]]'' is a column vector of size m and 'df2["SERIAL_NO"].values' is row
## vector of size n then broad cast will result array of shape m,n which is
## result of comparing each pair of m and n
compare = (df1[["FROM_SERIAL"]].values<df2["SERIAL_NO"].values) & (df1[["TO_SERIAL"]].values>df2["SERIAL_NO"].values)
mask = np.arange(len(df1)*len(df2)).reshape(-1, len(df2))[compare]
pd.concat([df1.iloc[mask//len(df2)].reset_index(drop=True), df2.iloc[mask%len(df2)].reset_index(drop=True)], axis=1, sort=False)

Related

BigQuery use conditions to create a table from other tables

I am facing a serious blockage on a project of mine.
Here is the summary of what i would like to do :
I have a big hourly file (10 Go) with the following extract (no header) :
ID_A|segment_1,segment_2
ID_B|segment_2,segment_3,segment_4,segment_5
ID_C|segment_1
ID_D|segment_2,segment_4
Every ID (from A to D) can be linked to one or multiple segments (from 1 to 5).
I would like to process this file in order to have the following result (the result file contains a header) :
ID|segment_1|segment_2|segment_3|segment_4|segment_5
ID_A|1|1|0|0|0
ID_B|0|1|1|1|1
ID_C|1|0|0|0|0
ID_D|0|1|0|1|0
1 means that the ID is included in the segment, 0 means that it is not.
I can clearly do this task by using a python script with multiple loops and conditions, however, i need a fast script that can do the same work.
I would like to use BigQuery to perform this operation.
Is it possible to do such task in BigQuery ?
How can it be done ?
Thanks to all for your help.
Regards
Let me assume that the file is loaded into a BQ table with an id column and a segments column (which is a string). Then I would recommend storing the result values as an array, but that is not your question.
You can use the following select to create the table:
select id,
countif(segment = 'segment_1') as segment_1,
countif(segment = 'segment_2') as segment_2,
countif(segment = 'segment_3') as segment_3,
countif(segment = 'segment_4') as segment_4,
countif(segment = 'segment_5') as segment_5
from staging s cross join
unnest(split(segments, ',')) as segment
group by id;
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
IF('segment_1' IN UNNEST(list), 1, 0) AS segment_1,
IF('segment_2' IN UNNEST(list), 1, 0) AS segment_2,
IF('segment_3' IN UNNEST(list), 1, 0) AS segment_3,
IF('segment_4' IN UNNEST(list), 1, 0) AS segment_4,
IF('segment_5' IN UNNEST(list), 1, 0) AS segment_5
FROM `project.dataset.table`,
UNNEST([STRUCT(SPLIT(segments) AS list)])
Above assumes you have your data in the table like in below TCE
WITH `project.dataset.table` AS (
SELECT 'ID_A' id, 'segment_1,segment_2' segments UNION ALL
SELECT 'ID_B', 'segment_2,segment_3,segment_4,segment_5' UNION ALL
SELECT 'ID_C', 'segment_1' UNION ALL
SELECT 'ID_D', 'segment_2,segment_4'
)
if to apply above query to such data - result will be
Row id segment_1 segment_2 segment_3 segment_4 segment_5
1 ID_A 1 1 0 0 0
2 ID_B 0 1 1 1 1
3 ID_C 1 0 0 0 0
4 ID_D 0 1 0 1 0

Pyspark: why does the ST_intersects function return duplicated rows?

I am using the ST_Intersects function of geospark to make the intersection between points and polygons.
queryOverlap = """
SELECT p.ID, z.COUNTYNS as zone, p.date, timestamp, p.point
FROM gpsPingTable as p, zoneShapes as z
WHERE ST_Intersects(p.point, z.geometry)
"""
pingsDay = spark.sql(queryOverlap)
pingsDay.show()
Why does return for each row a duplicate?
+--------------------+--------+----------+-------------------+--------------------+
| ID| zone| date| timestamp| point|
+--------------------+--------+----------+-------------------+--------------------+
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
+--------------------+--------+----------+-------------------+--------------------+
The most obvious reason would be if the points or zones in source tables are not unique. If there are duplicate points or zones, you obviously get duplicates
Check source tables for uniqueness:
SELECT p.ID, p.date count(*) c
FROM gpsPingTable as p
GROUP BY ID, data HAVING c > 1
This will report duplicate points. And this will report duplicate zones:
SELECT z.COUNTYNS as zone, COUNT(*) c
FROM zoneShapes as z
GROUP BY zone HAVING c > 1

Replacing values with the first value of a group in a different table

Let's have ratings and books tables.
RATINGS
User-ID ISBN Book-Rating
244662 0373630689 7
19378 0812515595 10
238625 0441892604 9
180315 0140439072 0
242471 3548248950 0
BOOKS
ISBN Book-Title Book-Author Year-Of-Publication Publisher
0393000753 A Reckoning May Sarton 1981 W W Norton
Since many of the books have the same names and authors but different publishers and years of publication, I want to group them by title and replace ISBN in the rating table with the ISBN of the first row in the group.
More concretely, if the grouping looks like this
Book-Name ISBN
Name1 A
B
C
Name2 D
E
Name3 F
G
and the ratings like
User-ID ISBN Book-Rating
X B 3
X E 6
Y D 1
Z F 8
I want ratings to look like
User-ID ISBN Book-Rating
X A 3
X D 6
Y D 1
Z G 8
to save memory needed for pivot_table. The data set can be found here.
My attempt was along the lines of
book_rating_view = ratings.merge(books, how='left', on='ISBN').groupby(['Book-Title'])['ISBN']
ratings['ISBN'].replace(ratings['ISBN'], pd.Series([book_rating_view.get_group(key).min() for key,_ in book_rating_view]))
which doesn't seem to work. Another attempt was to construct the pivot_table directly as
isbn_vector = books.groupby(['Book-Title']).first()
utility = pd.DataFrame(0, index=explicit_ratings['User-ID'], columns=users['User-ID'])
for name, group in explicit_ratings.groupby('User-ID'):
user_vector = pd.DataFrame(0, index=isbn_vector, columns = [name])
for row, index in group:
user_vector[books.groupby(['Book-Title']).get_group(row['ISBN']).first()] = row['Book-Rating']
utility.join(user_vector)
which leads to a MemoryError, even though reduced table should fit into the memory.
Thanks for any advice!
I want you show us BOOK dataframe little bit more and also desired output most of all, but how about below? (Even I usually don't recommend store the data as list in the dataframe...)
Say df1 = RATINGS, df2 = BOOKS,
dfm = df2.merge(df1, on='ISBN').groupby('Book-Title').apply(list)
dfm['Book-Rating'] = dfm['Book-Rating'].map(sum)

How to iterate over columns in a df and compare value with previous column and perform a action in Python

The operation I am trying to perform is similar to this mysql delete statement :
DELETE FROM ABCD WHERE val_2001>val_2000*1.5 OR val_2001>val_1999*POW(1.5,2);
And the column_names varies from val_2001 to val_2017.
All the data from the table ABCD is dumped into a csv and loaded into df.
How to iterate over each column and compare with previous column and perform a delete? ( new to python)
The table data sample :
val_2000 val_2001 val_2002 val_2003
100 112.058663384525 119.070787312921 117.033250060214
100 118.300395256917 124.655238202362 128.723125524235
100 109.333236619151 116.785836024946 117.390803371386
100 120.954175930764 126.099776250454 124.491022271481
100 107.776153227575 105.560100052722 108.07490649383
100 151.596517146962 306.608812920781 124.610273175528
Note: there are columns which need not be iterated as well.
The sample output :
val_2000 val_2001 val_2002 val_2003
100 112.058663384525 119.070787312921 117.033250060214
100 118.300395256917 124.655238202362 128.723125524235
100 109.333236619151 116.785836024946 117.390803371386
100 120.954175930764 126.099776250454 124.491022271481
100 107.776153227575 105.560100052722 108.07490649383
100 NULL NULL 124.610273175528
EDIT : - Currently trying this way:
df = pd.read_csv("singleDataFile.csv")
for values in xrange(2000,2016):
val2 = values+1
df['val_'+str(val2)] = df['val_'+str(val2)].where((df['val_'+str(val2)]>df['val_'+str(values)]*1.5) | (df['val_'+str(val2)]<df['val_'+str(values)]*0.75))
print(df)
Getting a format error
This code creates a random DataFrame that fairly closely mimics your DataFrame. It seems one of the key components of your questions was iterating through multiple columns, which this does (via pandas).
Build DataFrame:
cols = [ 'val_{}'.format(c) for c in range(2000, 2018)]
d = {}
for c in cols:
d[c] = np.random.rand(10) * 200 + 100
df = pd.DataFrame(d, columns = cols)
Output:
val_2000 val_2001 val_2002 val_2003 val_2004 val_2005 \
0 138.795742 178.467087 131.461771 151.475698 217.449107 107.680520
1 127.857106 217.484552 248.528498 155.661208 281.914679 211.313490
2 278.366253 137.543827 167.605495 129.869768 272.923010 190.659691
3 221.798435 206.622385 145.636888 236.499951 212.404028 122.954408
4 122.994183 299.793792 171.987895 246.948802 290.938506 127.846811
5 264.400326 203.226235 121.972832 137.858361 161.812761 270.464277
6 156.253907 280.101596 138.100352 164.018757 121.044386 297.869079
7 186.572007 146.406624 110.309996 270.895300 101.975819 229.314098
8 195.470896 286.125937 251.778581 259.112738 207.539354 127.895095
9 168.135585 261.295740 203.234246 279.825177 188.648541 197.145975
Core Code:
df[(df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)] = 'NULL'
Output:
val_2000 val_2001 val_2002 val_2003 val_2004 val_2005 \
0 138.795742 178.467 131.461771 151.476 NULL 107.681
1 127.857106 NULL 248.528498 155.661 NULL 211.313
2 278.366253 137.544 167.605495 129.87 NULL 190.66
3 221.798435 206.622 145.636888 NULL 212.404 122.954
4 122.994183 NULL 171.987895 NULL 290.939 127.847
5 264.400326 203.226 121.972832 137.858 161.813 NULL
6 156.253907 NULL 138.100352 164.019 121.044 NULL
7 186.572007 146.407 110.309996 NULL 101.976 NULL
8 195.470896 NULL 251.778581 259.113 207.539 127.895
9 168.135585 NULL 203.234246 NULL 188.649 197.146
You want to use the Series.where function on the columns you want to change. For example the first column can be acheived by:
df['val_2001'] = df['val_2001'].where( df['val_2001']>df['val_2000']*1.5 )
Edit (in response to OP comment): You can add OR using the python notation |, for example, as follows:
df['val_2001'] = df['val_2001'].where( (df['val_2001']>df['val_2000']*1.5) | (df['val_2001']<df['val_2000']*0.75) )

Pandas joining based on date

I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])

Categories