I have a dataframe of weather date that looks like this:
+----+------------+----------+-----------+
| ID | Station_ID | Latitude | Longitude |
+----+------------+----------+-----------+
| 0 | 6010400 | 52.93 | -82.43 |
| 1 | 6010400 | 52.93 | -82.43 |
| 2 | 6010400 | 52.93 | -82.43 |
| 3 | 616I001 | 45.07 | -77.88 |
| 4 | 616I001 | 45.07 | -77.88 |
| 5 | 616I001 | 45.07 | -77.88 |
+----+------------+----------+-----------+
I want to create a new column called postal_code using an API lookup based on the latitude and longitude values. I cannot perform a lookup for each row in the dataframe as that would be inefficient, since there are over 500,000 rows and only 186 unique Station_IDs. It's also unfeasible due to rate limiting on the API I need to use.
I believe I need to perform a groupby transform but can't quite figure out how to get it to work correctly.
Any help with this would be greatly appreciated.
I believe, you can use groupby only for aggregations, which is not what you want.
First combine both 'Latitude' and 'Longitude'. It gives a new column with tuples.
df['coordinates'] = list(zip(df['Latitude'],df['Longitude']))
Then you can use this 'coordinates' column to create all unique values of (Latitude,Longitude) using set datatype, so it doesn't contain duplicates.
set(list(df['coordinates']))
Then fetch the postal_codes of these coordinates using API calls as you said and store them as a dict.
Then you can use this dict to populate postal codes for each row.
postal_code_dict = {'key':'value'} #sample dictionary
df['postal_code'] = df['coordinates'].apply(lambda x: postal_code_dict[x])
Hope this helps.
Related
I am trying to use Pandas to detect changes across two CSVs. I would like it ideally to highlight which UIDs have been changed. I've attached an example of the ideal output here.
CSV 1 (imported as DataFrame):
| UID | Email |
| -------- | --------------- |
| U01 | u01#email.com |
| U02 | u02#email.com |
| U03 | u03#email.com |
| U04 | u04#email.com |
CSV 2 (imported as DataFrame):
| UID | Email |
| -------- | --------------- |
| U01 | u01#email.com |
| U02 | newemail#email.com |
| U03 | u03#email.com |
| U04 | newemail2#email.com |
| U05 | u05#email.com |
| U06 | u06#email.com |
Over the two CSVs, U02 and U04 saw email changes, whereas U05 and U06 were new records entirely.
I have tried using the pandas compare function, and unfortunately it doesn't work because CSV2 has more records than CSV1.
I have since concatenated the UID and email field, like so, and then created a new field called "Unique" to show whether the concatenated value is a duplication as True or False (but doesn't show if it's a new record entirely)
df3['Concatenated'] = df3["UID"] +"~"+ df3["Email"]
df3['Unique'] = ~df3['Concatenated'].duplicated(keep=False)
This works to an extent, but it feels clunky, and I was wondering if anyone had a smarter way of doing this - especially when it comes into showing whether the record is new or not.
The strategy here is to merge the two dataframes on UID, then compare the email columns, and finally see if the new UIDs are in the UID list.
df_compare = pd.merge(left=df, right=df_new, how='outer', on='UID')
df_compare['Change Status'] = df_compare.apply(lambda x: 'No Change' if x.Email_x == x.Email_y else 'Change', axis=1)
df_compare.loc[~df_compare.UID.isin(df.UID),'Change Status'] = 'New Record'
df_compare = df_compare.drop(columns=['Email_x']).rename(columns={'Email_y': 'Email'})
gives df_compare as:
UID Email Change Status
0 U01 u01#email.com No Change
1 U02 newemail#email.com Change
2 U03 u03#email.com No Change
3 U04 newemail2#email.com Change
4 U05 u05#email.com New Record
5 U06 u06#email.com New Record
I have a GridDB container where I have stored my database. I want to copy the table but this would exclude a few columns. The function I need should extract all columns matching a given keyword and then create a new table from that. It must always include the first column *id because it is needed on every table.
For example, in the table given below:
'''
-- | employee_id | department_id | employee_first_name | employee_last_name | employee_gender |
-- |-------------|---------------|---------------------|---------------------|-----------------|
-- | 1 | 1 | John | Matthew | M |
-- | 2 | 1 | Alexandra | Philips | F |
-- | 3 | 2 | Hen | Lotte | M |
'''
Suppose I need to get the first column and every other column starting with "employee". How can I do this through a Python function?
I am using GridDB Python client on my Ubuntu machine and I have already stored the database.csv file in the container. Thanks in advance for your help!
I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.
If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.
MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.
I have a large PySpark DataFrame that I would like to manipulate as in the example below. I think it is easier to visualise it than to describe it. Hence, for illustrative purposes, let us take a simple DataFrame df:
df.show()
+----------+-----------+-----------+
| series | timestamp | value |
+----------+-----------+-----------+
| ID1 | t1 | value1_1 |
| ID1 | t2 | value2_1 |
| ID1 | t3 | value3_1 |
| ID2 | t1 | value1_2 |
| ID2 | t2 | value2_2 |
| ID2 | t3 | value3_2 |
| ID3 | t1 | value1_3 |
| ID3 | t2 | value2_3 |
| ID3 | t3 | value3_3 |
+----------+-----------+-----------+
In the above DataFrame, each of the three unique values contained in column series (i.e. ID1, ID2 and ID3) have corresponding values (under column values) occurring simulaneously at the same time (i.e. same entries in column timestamp).
From this DataFrame, I would like to have a transformation which ends up with the following DataFrame, named, say, results. As it can be seen, the size of the DataFrame has changed and even the columns have been renamed according to entries of the original DataFrame.
result.show()
+-----------+-----------+-----------+-----------+
| timestamp | ID1 | ID2 | ID3 |
+-----------+-----------+-----------+-----------+
| t1 | value1_1 | value1_2 | value1_3 |
| t2 | value2_1 | value2_2 | value2_3 |
| t3 | value3_1 | value3_2 | value3_3 |
+-----------+-----------+-----------+-----------+
The order of the columns in result is arbitrary and should not affect the final answer. This illustrative example only contains three unique values in series (i.e. ID1, ID2 and ID3). Ideally, I would like to write a piece of code which automatically detects unique values in series and therefore generates a new corresponding column. Does anyone know where can I start from? I have tried to group by timestamp and then to collect a set of distinct series and value by using the aggregate function collect_set but with no luck:(
Many thanks in advance!
Marioanzas
Just a simple pivot:
import pyspark.sql.functions as F
result = df.groupBy('timestamp').pivot('series').agg(F.first('value'))
Make sure that each row in df is distinct; otherwise duplicate entries may be silently deduplicated.
Extendind on mck's answer, I have found out a way of improving the pivot performance. pivot is a very expensive operation, hence, for Spark 2.0 on-wards, it is recommended to provide column data (if known) as an argument to the function as shown in the code below. This will improve the performance of the code for DataFrames much larger than the illustrative one posed in this question. Given that the values of series are known beforehand, we can use:
import pyspark.sql.functions as F
series_list = ('ID1', 'ID2', 'ID3')
result = df.groupBy('timestamp').pivot('series', series_list).agg(F.first('value'))
result.show()
+---------+--------+--------+--------+
|timestamp| ID1| ID2| ID3|
+---------+--------+--------+--------+
| t1|value1_1|value1_2|value1_3|
| t2|value2_1|value2_2|value2_3|
| t3|value3_1|value3_2|value3_3|
+---------+--------+--------+--------+
I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')