PySpark find if pattern in one column is present in another column - python
I've two pyspark data frames. One contain FullAddress field(say col1) and another data frame contains name of city/town/suburb in one of the columns(say col2). I want to compare col2 with col1 and return col2 if there is a match.
Additionally, the suburb name could be a list of suburb name.
Dataframe1 that contains full address
+--------+--------+----------------------------------------------------------+
|Postcode|District|City/ Town/ Suburb |
+--------+--------+----------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks |
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+----------------------------------------------------------+
+-----------------------------------------------------------+
|FullAddress |
+-----------------------------------------------------------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |
| HAY STREET HAYMARKET 2000, NSW, Australia |
| SMART STREET FAIRFIELD 2165, NSW, Australia |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |
+-----------------------------------------------------------+
I would like to have something like this
+-----------------------------------------------------------++-----------+
|FullAddress |suburb |
+-----------------------------------------------------------++-----------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE |
| HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
| SMART STREET FAIRFIELD 2165, NSW, Australia |NULL |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+-----------------------------------------------------------++-----------+
There are two DataFrames -
DataFrame 1: DataFrame containing the complete address.
DataFrame 2: DataFrame containing the base data - Postcode, District & City / Town / Suburb.
The aim of the problem is to extract the appropriate suburb for DataFrame 1 from DataFrame 2. Though OP has not explicitly specified the key on which we can join the two DataFrames, but Postcode only seems to be the reasonable choice.
# Importing requisite functions
from pyspark.sql.functions import col,regexp_extract,split,udf
from pyspark.sql.types import StringType
Let's create the DataFrame 1 as df. In this DataFrame we need to extract the Postcode. In Australia, all the post codes are 4 digit long, so we use regexp_extract() to extract 4 digit number from the string column.
df = sqlContext.createDataFrame([('BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia ',),
('HAY STREET HAYMARKET 2000, NSW, Australia',),
('SMART STREET FAIRFIELD 2165, NSW, Australia',),
('CLARENCE STREET SYDNEY 2000, NSW, Australia',)],
('FullAddress',))
df = df.withColumn('Postcode', regexp_extract('FullAddress', "(\\d{4})" , 1 ))
df.show(truncate=False)
+---------------------------------------------+--------+
|FullAddress |Postcode|
+---------------------------------------------+--------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |2113 |
|HAY STREET HAYMARKET 2000, NSW, Australia |2000 |
|SMART STREET FAIRFIELD 2165, NSW, Australia |2165 |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |2000 |
+---------------------------------------------+--------+
Now, that we have extracted the Postcode, we have created the key to join the two DataFrames. Let's create the DataFrame 2, from which we need to extract respective suburb.
df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
(2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)
+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb |
+--------+--------+--------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+--------------------------------------------------------+
Joining the two DataFrames with left join -
df = df.join(df_City_Town_Suburb.select('Postcode','City_Town_Suburb'), ['Postcode'],how='left')
df.show(truncate=False)
+--------+---------------------------------------------+--------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+--------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |North Ryde |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
+--------+---------------------------------------------+--------------------------------------------------------+
Splitting the column City_Town_Suburb into an array using split() function -
df = df.select('Postcode','FullAddress',split(col("City_Town_Suburb"), ",\s*").alias("City_Town_Suburb"))
df.show(truncate=False)
+--------+---------------------------------------------+----------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+----------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |[North Ryde] |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
+--------+---------------------------------------------+----------------------------------------------------------+
Finally creating a UDF to check each and every element of the array City_Town_Suburb if it exists in the column FullAddress. If there exists a one, we return that immediately, else None is returned.
def suburb(FullAddress,City_Town_Suburb):
# Check for the case where there is no Array, otherwise we will get an Error
if City_Town_Suburb == None:
return None
# Checking each and every Array element if it exists in 'FullAddress',
# and if a match is found, it's immediately returned.
for sub in City_Town_Suburb:
if sub.strip().upper() in FullAddress:
return sub.upper()
return None
suburb_udf = udf(suburb,StringType())
Applying this UDF -
df = df.withColumn('suburb', suburb_udf(col('FullAddress'),col('City_Town_Suburb'))).drop('City_Town_Suburb')
df.show(truncate=False)
+--------+---------------------------------------------+----------+
|Postcode|FullAddress |suburb |
+--------+---------------------------------------------+----------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE|
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+--------+---------------------------------------------+----------+
Related
how to iterate through column values of pyspark dataframe
I have a pyspark dataframe I want to check each row for the address column and if it contains the substring "india" then I need to add another column and say true else false and also i wanted to check the substring is present in the column value string if yes print yes else no.. this has to iterate for all the rows in dataframe. like: if "india" or "karnataka" is in sparkDF["address"]: print("yes") else: print("no") I'm getting the wrong results as it's checking for each character instead of the substring. How to achieve this? How to achieve this? I wasn't able to achieve this
You can utilise contains or like for this Data Preparation s = StringIO(""" user,address rishi,XYZ Bangalore Karnataka kirthi,ABC Pune India tushar,ASD Orissa India """ ) df = pd.read_csv(s,delimiter=',') sparkDF = sql.createDataFrame(df) sparkDF.show() +------+-----------------------+ |user |address | +------+-----------------------+ |rishi |XYZ Bangalore Karnataka| |kirthi|ABC Pune India | |tushar|ASD Orissa India | +------+-----------------------+ Contains sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).contains("india")) sparkDF.show(truncate=False) +------+-----------------------+------+ |user |address |result| +------+-----------------------+------+ |rishi |XYZ Bangalore Karnataka|false | |kirthi|ABC Pune India |true | |tushar|ASD Orissa India |true | +------+-----------------------+------+ Like - Multiple Search Patterns sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).like("%india%") | F.lower(F.col('address')).like("%karnataka%") ) sparkDF.show(truncate=False) +------+-----------------------+------+ |user |address |result| +------+-----------------------+------+ |rishi |XYZ Bangalore Karnataka|true | |kirthi|ABC Pune India |true | |tushar|ASD Orissa India |true | +------+-----------------------+------+
Changing values in a column based on a match
I have a Pandas DataFrame which contains names of brazilians universities, but somethings I have these names in a short way or in a long way (for example, the Universidade Federal do Rio de Janeiro sometimes is identified as UFRJ). The DataFrame look like this: | college | |----------------------------------------| | Universidade Federal do Rio de Janeiro | | UFRJ | | Universidade de Sao Paulo | | USP | | Catholic University of Minas Gerais | And I have another one which has in separate columns the short name and the long name of SOME (not all) of those universities. Which looks likes this: | long_name | short_name | |----------------------------------------|------------| | Universidade Federal do Rio de Janeiro | UFRJ | | Universidade de Sao Paulo | USP | What I want is: substitute all short names by long names, so in this context, the first dataframe would have the college column changed to this: | college | |----------------------------------------| | Universidade Federal do Rio de Janeiro | | Universidade Federal do Rio de Janeiro | | Universidade de Sao Paulo | | Universidade de Sao Paulo | | Catholic University of Minas Gerais | <--- note: this one does not have a match, so it stays the same Is there a way to do that using pandas and numpy (or any other library)?
Use Series.map with replace by second DataFrame, if no match get missing values, so added Series.fillna: df1['college'] = (df1['college'].map(df2.set_index('short_name')['long_name']) .fillna(df1['college'])) print (df1) college 0 Universidade Federal do Rio de Janeiro 1 Universidade Federal do Rio de Janeiro 2 Universidade de Sao Paulo 3 Universidade de Sao Paulo 4 Catholic University of Minas Gerais
Pandas: How to map the values of a Dataframe to another Dataframe?
I am totally new to Python and just learning with some use cases I have. I have 2 Data Frames, one is where I need the values in the Country Column, and another is having the values in the column named 'Countries' which needs to be mapped in the main Data Frame referring to the column named 'Data'. (Please accept my apology if this question has already been answered) Below is the Main DataFrame: Name Data | Country ----------------------------- | --------- Arjun Kumar Reddy las Vegas | Divya london Khosla | new delhi Pragati Kumari | Will London Turner | Joseph Mascurenus Bombay | Jason New York Bourne | New york Vice Roy | Joseph Mascurenus new York | Peter Parker California | Bruce (istanbul) Wayne | Below is the Referenced DataFrame: Data | Countries -------------- | --------- las Vegas | US london | UK New Delhi | IN London | UK bombay | IN New York | US New york | US new York | US California | US istanbul | TR Moscow | RS Cape Town | SA And what I want in the result will look like below: Name Data | Country ----------------------------- | --------- Arjun Kumar Reddy las Vegas | US Divya london Khosla | UK new delhi Pragati Kumari | IN Will London Turner | UK Joseph Mascurenus Bombay | IN Jason New York Bourne | US New york Vice Roy | US Joseph Mascurenus new York | US Peter Parker California | US Bruce (istanbul) Wayne | TR Please note, Both the dataframes are not same in size. I though of using map or Fuzzywuzzy method but couldn't really achieved the result.
Find the country key that matches in the reference dataframe and extract it. regex = '(' + ')|('.join(ref_df['Data']) + ')' df['key'] = df['Name Data'].str.extract(regex, flags=re.I).bfill(axis=1)[0] >>> df Name Data key 0 Arjun Kumar Reddy las Vegas las Vegas 1 Bruce (istanbul) Wayne istanbul 2 Joseph Mascurenus new York new York >>> ref_df Data Country 0 las Vegas US 1 new York US 2 istanbul TR Merge both the dataframes on key extracted. pd.merge(df, ref_df, left_on='key', right_on='Data') Name Data key Data Country 0 Arjun Kumar Reddy las Vegas las Vegas las Vegas US 1 Bruce (istanbul) Wayne istanbul istanbul TR 2 Joseph Mascurenus new York new York new York US
It looks like everything is sorted so you can merge on index mdf.merge(rdf, left_index=True, right_index=True)
Pandas map (reorder/rename) columns using JSON template
I have a data frame like so: |customer_key|order_id|subtotal|address | ------------------------------------------------ |12345 |O12356 |123.45 |123 Road Street| |10986 |945764 |70.00 |634 Road Street| |32576 |678366 |29.95 |369 Road Street| |67896 |198266 |837.69 |785 Road Street| And I would like to reorder/rename the columns based on the following JSON that contains the current column name and the desired column name: { "customer_key": "cust_id", "order_id": "transaction_id", "address": "shipping_address", "subtotal": "subtotal" } to have the resulting Dataframe: |cust_id|transaction_id|shipping_address|subtotal| -------------------------------------------------- |12345 |O12356 |123 Road Street |123.45 | |10986 |945764 |634 Road Street |70.00 | |32576 |678366 |369 Road Street |29.95 | |67896 |198266 |785 Road Street |837.69 | is this something that's possible? if it makes it easier, the order of the columns isn't critical.
For renaming and ordering you would need to reindex after renaming df.rename(columns=d).reindex(columns=d.values()) or: df.reindex(columns=d.keys()).rename(columns=d)
map US state name to two letter acronyms that was given in dictionary separately
Suppose now I have a dataframe with 2 columns: State and City. Then I have a separate dict with the two-letter acronym for each state. Now I want to add a third column to map state name with its two-letter acronym. What should I do in Python/Pandas? For instance the sample question is as follows: import pandas as pd a = pd.Series({'State': 'Ohio', 'City':'Cleveland'}) b = pd.Series({'State':'Illinois', 'City':'Chicago'}) c = pd.Series({'State':'Illinois', 'City':'Naperville'}) d = pd.Series({'State': 'Ohio', 'City':'Columbus'}) e = pd.Series({'State': 'Texas', 'City': 'Houston'}) f = pd.Series({'State': 'California', 'City': 'Los Angeles'}) g = pd.Series({'State': 'California', 'City': 'San Diego'}) state_city = pd.DataFrame([a,b,c,d,e,f,g]) state_2 = {'OH': 'Ohio','IL': 'Illinois','CA': 'California','TX': 'Texas'} Now I have to map the column State in the df state_city using the dictionary of state_2. The mapped df state_city should contain three columns: state, city, and state_2letter. The original dataset I had had multiple columns with nearly all US major cities. Therefore it will be less efficient to do it manually. Is there any easy way to do it?
For one, it's probably easier to store the key-value pairs like state name: abbreviation in your dictionary, like this: state_2 = {'Ohio': 'OH', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX'} You can achieve this easily: state_2 = {state: abbrev for state, abbrev in state_2.items()} Using pandas.DataFrame.map: >>> state_city['abbrev'] = state_city['State'].map(state_2) >>> state_city City State abbrev 0 Cleveland Ohio OH 1 Chicago Illinois IL 2 Naperville Illinois IL 3 Columbus Ohio OH 4 Houston Texas TX 5 Los Angeles California CA 6 San Diego California CA
I do agree with #blacksite that the state_2 dictionary should map its values like that: state_2 = {'Ohio': 'OH','Illinois': 'IL','California': 'CA','Texas': 'TX'} Then using pandas.DataFrame.replace state_city['state_2letter'] = state_city.State.replace(state_2) state_city |-|State |City |state_2letter| |-|----- |------ |----------| |0| Ohio | Cleveland | OH| |1| Illinois | Chicago | IL| |2| Illinois | Naperville | IL| |3| Ohio | Columbus | OH| |4| Texas | Houston | TX| |5| California| Los Angeles | CA| |6| California| San Diego | CA|