I have a pyspark dataframe
I want to check each row for the address column and if it contains the substring "india"
then I need to add another column and say true
else false
and also i wanted to check the substring is present in the column value string if yes print yes else no.. this has to iterate for all the rows in dataframe.
like:
if "india" or "karnataka" is in sparkDF["address"]:
print("yes")
else:
print("no")
I'm getting the wrong results as it's checking for each character instead of the substring. How to achieve this?
How to achieve this?
I wasn't able to achieve this
You can utilise contains or like for this
Data Preparation
s = StringIO("""
user,address
rishi,XYZ Bangalore Karnataka
kirthi,ABC Pune India
tushar,ASD Orissa India
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+------+-----------------------+
|user |address |
+------+-----------------------+
|rishi |XYZ Bangalore Karnataka|
|kirthi|ABC Pune India |
|tushar|ASD Orissa India |
+------+-----------------------+
Contains
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).contains("india"))
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|false |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+
Like - Multiple Search Patterns
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).like("%india%")
| F.lower(F.col('address')).like("%karnataka%")
)
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|true |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+
Related
how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400
movies
| Movies | Release Date |
| -------- | -------------- |
| Star Wars: Episode VII - The Force Awakens (2015) | December 16, 2015 |
| Avengers: Endgame (2019 | April 24, 2019 |
I am trying to have a new column and use split to have the year.
import pandas as pd
df = pd.DataFrame({'Movies': ['Star Wars: Episode VII - The Force Awakens (2015)', 'Avengers: Endgame (2019'],
'Release Date': ['December 16, 2015', 'April 24, 2019' ]})
movies["year"]=0
movies["year"]= movies["Release Date"].str.split(",")[1]
movies["year"]
TO BE
| Movies | year |
| -------- | -------------- |
| Star Wars: Episode VII - The Force Awakens (2015) | 2015 |
| Avengers: Endgame (2019) | 2019 |
BUT
> ValueError: Length of values does not match length of index
Using str.extract we can target the 4 digit year:
df["year"] = df["Release Date"].str.extract(r'\b(\d{4})\b')
Explanation
movies["Release Date"].str.split(",") returns a series of of the lists returns by split()
movies["Release Date"].str.split(",")[1] return the second element of this series.
This is obviouly not what you want.
Solutions
Keep using pandas.str.split. but then a function that gets the 2nd item of the series rows for example:
movies["Release Date"].str.split(",").map(lambda x: x[1])
Do something different as suggestted by #Tim Bielgeleisen
Background
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.
Request
Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Here is the Python code from the blog post.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts.
A simplified example in SAS:
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.
I have data that looks like this:
service | company
--------------------
sequencing| Fischer
RNA tests | Fischer
Cell tests| 23andMe
consulting| UCLA
DNA tests | UCLA
mouse test| UCLA
and I want to concat services together into a list on equal company names like this:
service_list | company
-------------------------------------------------
['sequencing','RNA tests'] | Fischer
['Cell tests'] | 23andMe
['consulting','DNA tests','mouse test']| UCLA
Not sure how to begin doing this.
Lets try groupby(), aggregate to list
df.groupby('company').service.agg(list).reset_index()
company service
0 23andMe [Celltests]
1 Fischer [sequencing, RNAtests]
2 UCLA [consulting, DNAtests, mousetest]
I have a data frame like so:
|customer_key|order_id|subtotal|address |
------------------------------------------------
|12345 |O12356 |123.45 |123 Road Street|
|10986 |945764 |70.00 |634 Road Street|
|32576 |678366 |29.95 |369 Road Street|
|67896 |198266 |837.69 |785 Road Street|
And I would like to reorder/rename the columns based on the following JSON that contains the current column name and the desired column name:
{
"customer_key": "cust_id",
"order_id": "transaction_id",
"address": "shipping_address",
"subtotal": "subtotal"
}
to have the resulting Dataframe:
|cust_id|transaction_id|shipping_address|subtotal|
--------------------------------------------------
|12345 |O12356 |123 Road Street |123.45 |
|10986 |945764 |634 Road Street |70.00 |
|32576 |678366 |369 Road Street |29.95 |
|67896 |198266 |785 Road Street |837.69 |
is this something that's possible? if it makes it easier, the order of the columns isn't critical.
For renaming and ordering you would need to reindex after renaming
df.rename(columns=d).reindex(columns=d.values())
or:
df.reindex(columns=d.keys()).rename(columns=d)