How to use split to have a new columns - python

movies
| Movies | Release Date |
| -------- | -------------- |
| Star Wars: Episode VII - The Force Awakens (2015) | December 16, 2015 |
| Avengers: Endgame (2019 | April 24, 2019 |
I am trying to have a new column and use split to have the year.
import pandas as pd
df = pd.DataFrame({'Movies': ['Star Wars: Episode VII - The Force Awakens (2015)', 'Avengers: Endgame (2019'],
'Release Date': ['December 16, 2015', 'April 24, 2019' ]})
movies["year"]=0
movies["year"]= movies["Release Date"].str.split(",")[1]
movies["year"]
TO BE
| Movies | year |
| -------- | -------------- |
| Star Wars: Episode VII - The Force Awakens (2015) | 2015 |
| Avengers: Endgame (2019) | 2019 |
BUT
> ValueError: Length of values does not match length of index

Using str.extract we can target the 4 digit year:
df["year"] = df["Release Date"].str.extract(r'\b(\d{4})\b')

Explanation
movies["Release Date"].str.split(",") returns a series of of the lists returns by split()
movies["Release Date"].str.split(",")[1] return the second element of this series.
This is obviouly not what you want.
Solutions
Keep using pandas.str.split. but then a function that gets the 2nd item of the series rows for example:
movies["Release Date"].str.split(",").map(lambda x: x[1])
Do something different as suggestted by #Tim Bielgeleisen

Related

how to iterate through column values of pyspark dataframe

I have a pyspark dataframe
I want to check each row for the address column and if it contains the substring "india"
then I need to add another column and say true
else false
and also i wanted to check the substring is present in the column value string if yes print yes else no.. this has to iterate for all the rows in dataframe.
like:
if "india" or "karnataka" is in sparkDF["address"]:
print("yes")
else:
print("no")
I'm getting the wrong results as it's checking for each character instead of the substring. How to achieve this?
How to achieve this?
I wasn't able to achieve this
You can utilise contains or like for this
Data Preparation
s = StringIO("""
user,address
rishi,XYZ Bangalore Karnataka
kirthi,ABC Pune India
tushar,ASD Orissa India
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+------+-----------------------+
|user |address |
+------+-----------------------+
|rishi |XYZ Bangalore Karnataka|
|kirthi|ABC Pune India |
|tushar|ASD Orissa India |
+------+-----------------------+
Contains
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).contains("india"))
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|false |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+
Like - Multiple Search Patterns
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).like("%india%")
| F.lower(F.col('address')).like("%karnataka%")
)
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|true |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+

Pyhton pandas for manipulate text & inconsistent data

how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400

SAS Programming: How to replace missing values in multiple columns using one column?

Background
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.
Request
Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Here is the Python code from the blog post.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts.
A simplified example in SAS:
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

How to clean a string to get value_counts for words of interest by date?

I have the following data generated from a groupby('Datetime') and value_counts()
Datetime 0
01/01/2020 Paul 8
03 2
01/02/2020 Paul 2
10982360967 1
01/03/2020 religion 3
..
02/28/2020 l 18
02/29/2020 Paul 78
march 22
03/01/2020 church 63
l 21
I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers).
Do you know how I could further clean this selection?
Expected output to avoid confusion:
Datetime 0
01/03/2020 religion 3
..
02/29/2020 march 22
03/01/2020 church 63
I removed Paul, 03, 109..., and l.
Raw data:
Datetime Corpus
01/03/2020 Paul: examples of religion
01/03/2020 Paul:shinto is a religion 03
01/03/2020 don't talk to me about religion, Paul 03
...
02/29/2020 march is the third month of the year 10982360967
02/29/2020 during march, there are some cold days.
...
03/01/2020 she is at church right now
...
I cannot put all the raw data as I have more than 100 sentences.
The code I used is:
df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
Since I got a Key error, I had to edit the code as follows:
df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
To extract the words I used str.extractall
Cleaning strings is a multi-step process
Create dataframe
import pandas as pd
from nltk.corpus import stopwords
import string
# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
'Corpus': ['Paul: Examples of religion',
'Paul:shinto is a religion 03',
"don't talk to me about religion, Paul 03",
'march is the third month of the year 10982360967',
'during march, there are some cold days.',
'she is at church right now']}
test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)
| | Datetime | Corpus |
|---:|:--------------------|:-------------------------------------------------|
| 0 | 2020-01-03 00:00:00 | Paul: Examples of religion |
| 1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03 |
| 2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03 |
| 3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
| 4 | 2020-02-29 00:00:00 | during march, there are some cold days. |
| 5 | 2020-03-01 00:00:00 | she is at church right now |
Clean Corpus
Add extra words to the remove_words list
They should be lowercase
Some cleaning steps could be combined, but I do not recommend that
Step-by-step makes it easier to determine if you've made a mistake
This is a small example of text cleaning.
There are entire books on the subject.
There's not context analysis
example = 'We march to the church in March.'
value_count for 'march' in example.lower() is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words) # add other words to exclude in lowercase
# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)
test.dropna(inplace=True) # drop any na rows
# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '') # remove numbers
test.Corpus = test.Corpus.str.replace(punc, ' ') # remove punctuation
test.Corpus = test.Corpus.str.replace('\\s+', ' ') # remove occurrences of more than one whitespace
test.Corpus = test.Corpus.str.strip() # remove whitespace from beginning and end of string
test.Corpus = test.Corpus.str.lower() # convert all to lowercase
test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words)) # remove words
| | Datetime | Corpus |
|---:|:--------------------|:-------------|
| 0 | 2020-01-03 00:00:00 | ['religion'] |
| 1 | 2020-01-03 00:00:00 | ['religion'] |
| 2 | 2020-01-03 00:00:00 | ['religion'] |
| 3 | 2020-02-29 00:00:00 | ['march'] |
| 4 | 2020-02-29 00:00:00 | ['march'] |
| 5 | 2020-03-01 00:00:00 | ['church'] |
Explode Corpus & groupby
# explode list
test = test.explode('Corpus')
# dropna incase there are empty rows from filtering
test.dropna(inplace=True)
# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})
word_count
Datetime Corpus
2020-01-03 religion 3
2020-02-29 march 2
2020-03-01 church 1

Phrase similarity from List

Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?

Categories