I have a pyspark dataframe
I want to check each row for the address column and if it contains the substring "india"
then I need to add another column and say true
else false
and also i wanted to check the substring is present in the column value string if yes print yes else no.. this has to iterate for all the rows in dataframe.
like:
if "india" or "karnataka" is in sparkDF["address"]:
print("yes")
else:
print("no")
I'm getting the wrong results as it's checking for each character instead of the substring. How to achieve this?
How to achieve this?
I wasn't able to achieve this
You can utilise contains or like for this
Data Preparation
s = StringIO("""
user,address
rishi,XYZ Bangalore Karnataka
kirthi,ABC Pune India
tushar,ASD Orissa India
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+------+-----------------------+
|user |address |
+------+-----------------------+
|rishi |XYZ Bangalore Karnataka|
|kirthi|ABC Pune India |
|tushar|ASD Orissa India |
+------+-----------------------+
Contains
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).contains("india"))
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|false |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+
Like - Multiple Search Patterns
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).like("%india%")
| F.lower(F.col('address')).like("%karnataka%")
)
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|true |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+
I have data that looks like this:
service | company
--------------------
sequencing| Fischer
RNA tests | Fischer
Cell tests| 23andMe
consulting| UCLA
DNA tests | UCLA
mouse test| UCLA
and I want to concat services together into a list on equal company names like this:
service_list | company
-------------------------------------------------
['sequencing','RNA tests'] | Fischer
['Cell tests'] | 23andMe
['consulting','DNA tests','mouse test']| UCLA
Not sure how to begin doing this.
Lets try groupby(), aggregate to list
df.groupby('company').service.agg(list).reset_index()
company service
0 23andMe [Celltests]
1 Fischer [sequencing, RNAtests]
2 UCLA [consulting, DNAtests, mousetest]
I've two pyspark data frames. One contain FullAddress field(say col1) and another data frame contains name of city/town/suburb in one of the columns(say col2). I want to compare col2 with col1 and return col2 if there is a match.
Additionally, the suburb name could be a list of suburb name.
Dataframe1 that contains full address
+--------+--------+----------------------------------------------------------+
|Postcode|District|City/ Town/ Suburb |
+--------+--------+----------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks |
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+----------------------------------------------------------+
+-----------------------------------------------------------+
|FullAddress |
+-----------------------------------------------------------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |
| HAY STREET HAYMARKET 2000, NSW, Australia |
| SMART STREET FAIRFIELD 2165, NSW, Australia |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |
+-----------------------------------------------------------+
I would like to have something like this
+-----------------------------------------------------------++-----------+
|FullAddress |suburb |
+-----------------------------------------------------------++-----------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE |
| HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
| SMART STREET FAIRFIELD 2165, NSW, Australia |NULL |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+-----------------------------------------------------------++-----------+
There are two DataFrames -
DataFrame 1: DataFrame containing the complete address.
DataFrame 2: DataFrame containing the base data - Postcode, District & City / Town / Suburb.
The aim of the problem is to extract the appropriate suburb for DataFrame 1 from DataFrame 2. Though OP has not explicitly specified the key on which we can join the two DataFrames, but Postcode only seems to be the reasonable choice.
# Importing requisite functions
from pyspark.sql.functions import col,regexp_extract,split,udf
from pyspark.sql.types import StringType
Let's create the DataFrame 1 as df. In this DataFrame we need to extract the Postcode. In Australia, all the post codes are 4 digit long, so we use regexp_extract() to extract 4 digit number from the string column.
df = sqlContext.createDataFrame([('BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia ',),
('HAY STREET HAYMARKET 2000, NSW, Australia',),
('SMART STREET FAIRFIELD 2165, NSW, Australia',),
('CLARENCE STREET SYDNEY 2000, NSW, Australia',)],
('FullAddress',))
df = df.withColumn('Postcode', regexp_extract('FullAddress', "(\\d{4})" , 1 ))
df.show(truncate=False)
+---------------------------------------------+--------+
|FullAddress |Postcode|
+---------------------------------------------+--------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |2113 |
|HAY STREET HAYMARKET 2000, NSW, Australia |2000 |
|SMART STREET FAIRFIELD 2165, NSW, Australia |2165 |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |2000 |
+---------------------------------------------+--------+
Now, that we have extracted the Postcode, we have created the key to join the two DataFrames. Let's create the DataFrame 2, from which we need to extract respective suburb.
df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
(2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)
+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb |
+--------+--------+--------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+--------------------------------------------------------+
Joining the two DataFrames with left join -
df = df.join(df_City_Town_Suburb.select('Postcode','City_Town_Suburb'), ['Postcode'],how='left')
df.show(truncate=False)
+--------+---------------------------------------------+--------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+--------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |North Ryde |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
+--------+---------------------------------------------+--------------------------------------------------------+
Splitting the column City_Town_Suburb into an array using split() function -
df = df.select('Postcode','FullAddress',split(col("City_Town_Suburb"), ",\s*").alias("City_Town_Suburb"))
df.show(truncate=False)
+--------+---------------------------------------------+----------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+----------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |[North Ryde] |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
+--------+---------------------------------------------+----------------------------------------------------------+
Finally creating a UDF to check each and every element of the array City_Town_Suburb if it exists in the column FullAddress. If there exists a one, we return that immediately, else None is returned.
def suburb(FullAddress,City_Town_Suburb):
# Check for the case where there is no Array, otherwise we will get an Error
if City_Town_Suburb == None:
return None
# Checking each and every Array element if it exists in 'FullAddress',
# and if a match is found, it's immediately returned.
for sub in City_Town_Suburb:
if sub.strip().upper() in FullAddress:
return sub.upper()
return None
suburb_udf = udf(suburb,StringType())
Applying this UDF -
df = df.withColumn('suburb', suburb_udf(col('FullAddress'),col('City_Town_Suburb'))).drop('City_Town_Suburb')
df.show(truncate=False)
+--------+---------------------------------------------+----------+
|Postcode|FullAddress |suburb |
+--------+---------------------------------------------+----------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE|
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+--------+---------------------------------------------+----------+
Currently, I imported the following data frame from Excel into pandas and I want to delete duplicate values based in the values of two columns.
# Python 3.5.2
# Pandas library version 0.22
import pandas as pd
# Save the Excel workbook in a variable
current_workbook = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')
# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num')
# current output
print(current_worksheet)
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
CHARLS Charlie & Associates $5,700.00 South
CHARLS Charlie & Associates $5,700.00 North
CHARLS Charlie & Associates $5,700.00 West
HUGHES Hughinos $3,800.00 Central
HUGHES Hughinos $3,800.00 South
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,500.00 North
FERNAS Fernanda Industries $3,000.00 West
....
What I want is to remove those duplicate values based in the columns quantity and source:
Review the quantity and source column values:
1.1. If the quantity of a vendor is equal in another row from the same
vendor and source is not equal to Central then drop the repeated
rows from this vendor except the row Central.
1.2. Else if the quantity of a vendor is equal in another row from the same vendor and there is no source Central then drop the repeated rows.
Desired result
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
HUGHES Hughinos $3,800.00 Central
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,000.00 West
....
So far, I have tried the following code but pandas is not even detecting any duplicate rows.
print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())
I have tried to figure out the solution but I am struggling quite a bit in this problem, so any help in this question is greatly appreciated. Feel free to improve the question.
Here is one way.
df['CentralFlag'] = (df['source'] == 'Central')
df = df.sort_values('CentralFlag', ascending=False)\
.drop_duplicates(['vend_name', 'quantity'])\
.drop('CentralFlag', 1)
# vend_number vend_name quantity source
# 0 CHARLS Charlie&Associates $5,700.00 Central
# 4 HUGHES Hughinos $3,800.00 Central
# 6 FERNAS FernandaIndustries $3,500.00 South
# 8 FERNAS FernandaIndustries $3,000.00 West
Explanation
Create a flag column, sort by this descending, so Central is prioritised.
Sort by vend_name and quantity, then drop the flag column.
You can do it two steps
s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]
pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])
Suppose now I have a dataframe with 2 columns: State and City.
Then I have a separate dict with the two-letter acronym for each state. Now I want to add a third column to map state name with its two-letter acronym. What should I do in Python/Pandas? For instance the sample question is as follows:
import pandas as pd
a = pd.Series({'State': 'Ohio', 'City':'Cleveland'})
b = pd.Series({'State':'Illinois', 'City':'Chicago'})
c = pd.Series({'State':'Illinois', 'City':'Naperville'})
d = pd.Series({'State': 'Ohio', 'City':'Columbus'})
e = pd.Series({'State': 'Texas', 'City': 'Houston'})
f = pd.Series({'State': 'California', 'City': 'Los Angeles'})
g = pd.Series({'State': 'California', 'City': 'San Diego'})
state_city = pd.DataFrame([a,b,c,d,e,f,g])
state_2 = {'OH': 'Ohio','IL': 'Illinois','CA': 'California','TX': 'Texas'}
Now I have to map the column State in the df state_city using the dictionary of state_2. The mapped df state_city should contain three columns: state, city, and state_2letter.
The original dataset I had had multiple columns with nearly all US major cities.
Therefore it will be less efficient to do it manually. Is there any easy way to do it?
For one, it's probably easier to store the key-value pairs like state name: abbreviation in your dictionary, like this:
state_2 = {'Ohio': 'OH', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX'}
You can achieve this easily:
state_2 = {state: abbrev for state, abbrev in state_2.items()}
Using pandas.DataFrame.map:
>>> state_city['abbrev'] = state_city['State'].map(state_2)
>>> state_city
City State abbrev
0 Cleveland Ohio OH
1 Chicago Illinois IL
2 Naperville Illinois IL
3 Columbus Ohio OH
4 Houston Texas TX
5 Los Angeles California CA
6 San Diego California CA
I do agree with #blacksite that the state_2 dictionary should map its values like that:
state_2 = {'Ohio': 'OH','Illinois': 'IL','California': 'CA','Texas': 'TX'}
Then using pandas.DataFrame.replace
state_city['state_2letter'] = state_city.State.replace(state_2)
state_city
|-|State |City |state_2letter|
|-|----- |------ |----------|
|0| Ohio | Cleveland | OH|
|1| Illinois | Chicago | IL|
|2| Illinois | Naperville | IL|
|3| Ohio | Columbus | OH|
|4| Texas | Houston | TX|
|5| California| Los Angeles | CA|
|6| California| San Diego | CA|