Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?
Related
how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400
Background
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.
Request
Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Here is the Python code from the blog post.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts.
A simplified example in SAS:
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.
I am working on a Python program that aims to take Excel data that is vertical and make it horizontal.
For example, the data is shaped something like this:
County | State | Number | Date
Oakland | MI | 19 | 1/12/10
Oakland | MI | 32 | 1/19/10
Wayne | MI | 9 | 1/12/10
Wayne | MI | 6 | 1/19/10
But I want it like this (purposefully excluding the state):
County | 1/12/10 | 1/19/10
Oakland | 19 | 32
Wayne | 9 | 6
(And for the actual data, it’s quite long).
My logic so far:
Read in the Excel File
Loop through the counties
If county name is the same, place # in Row 1?
Make a new Excel File?
Any ideas of how to write this out? I think I am a little stuck on the syntax here.
I am totally new to Python and just learning with some use cases I have.
I have 2 Data Frames, one is where I need the values in the Country Column, and another is having the values in the column named 'Countries' which needs to be mapped in the main Data Frame referring to the column named 'Data'.
(Please accept my apology if this question has already been answered)
Below is the Main DataFrame:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas |
Divya london Khosla |
new delhi Pragati Kumari |
Will London Turner |
Joseph Mascurenus Bombay |
Jason New York Bourne |
New york Vice Roy |
Joseph Mascurenus new York |
Peter Parker California |
Bruce (istanbul) Wayne |
Below is the Referenced DataFrame:
Data | Countries
-------------- | ---------
las Vegas | US
london | UK
New Delhi | IN
London | UK
bombay | IN
New York | US
New york | US
new York | US
California | US
istanbul | TR
Moscow | RS
Cape Town | SA
And what I want in the result will look like below:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas | US
Divya london Khosla | UK
new delhi Pragati Kumari | IN
Will London Turner | UK
Joseph Mascurenus Bombay | IN
Jason New York Bourne | US
New york Vice Roy | US
Joseph Mascurenus new York | US
Peter Parker California | US
Bruce (istanbul) Wayne | TR
Please note, Both the dataframes are not same in size.
I though of using map or Fuzzywuzzy method but couldn't really achieved the result.
Find the country key that matches in the reference dataframe and extract it.
regex = '(' + ')|('.join(ref_df['Data']) + ')'
df['key'] = df['Name Data'].str.extract(regex, flags=re.I).bfill(axis=1)[0]
>>> df
Name Data key
0 Arjun Kumar Reddy las Vegas las Vegas
1 Bruce (istanbul) Wayne istanbul
2 Joseph Mascurenus new York new York
>>> ref_df
Data Country
0 las Vegas US
1 new York US
2 istanbul TR
Merge both the dataframes on key extracted.
pd.merge(df, ref_df, left_on='key', right_on='Data')
Name Data key Data Country
0 Arjun Kumar Reddy las Vegas las Vegas las Vegas US
1 Bruce (istanbul) Wayne istanbul istanbul TR
2 Joseph Mascurenus new York new York new York US
It looks like everything is sorted so you can merge on index
mdf.merge(rdf, left_index=True, right_index=True)
I have dataframe with description column, under one row of description there are multiple lines of texts, basically those are set of information for each record.
Example: Regarding information no 1 at 07-01-2019 we got update as the sky is blue and at 05-22-2019 we again got update as Apples are red, that are arranged between two dates. Firstly, I would like to extract text between the date and split the respective details in new columns as date, name, description.
The raw description looks like
info no| Description
--------------------------------------------------------------------------
1 |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
| Clouds are looking lovely.
| 05-22-2019 12:00:49 - MNX (Work notes) Apples are red in color.
--------------------------------------------------------------------------
| 02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2 | 02-25-2019 16:57:57 - lMN (Work notes) He came by train.
| That train was 15 min late.
| He missed the concert.
| 02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.
Desired output is
info No |DATE | NAME | DESCRIPTION
--------|------------------------------------------------------
1 |07-01-2019 12:59:41 | xyz | The sky is blue in color.
| | | Clouds are looking lovely.
--------|---------------------------------------------------------
1 |05-22-2019 12:00:49 | MNX | Apples are red in color
--------|---------------------------------------------------------
2 | 02-26-2019 12:53:18 | ABC | Task is to separate blue balls.
--------|---------------------------------------------------------
2 | 02-25-2019 16:57:57 | IMN | He came by train
| | | That train was 15 min late.
| | | He missed the concert.
--------|---------------------------------------------------------
| 02-25-2019 11:08:01 | sbc | She is my grandmother.
I tried:
myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
newQueue = newQueue.date.str.split(-,n=3)
Having this dataframe
df
Description
Sl No
1 07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2 05-22-2019 12:00:49 - MNX (Work notes) Apples...
3 02-26-2019 12:53:18 - ABC (Work notes) Task is...
4 02-25-2019 16:57:57 - lMN (Work notes) He came...
5 02-25-2019 11:08:01 - sbc (Work notes) She is ...
you can split the strings at the description column by "(Work notes)" and then you can use values.tolist to split it into 2 columns as follows:
x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))
x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)
print(x)
0 1
Sl No
1 07-01-2019 12:59:41 - XYZ The sky is blue in color.
2 05-22-2019 12:00:49 - MNX Apples are red in color.
3 02-26-2019 12:53:18 - ABC Task is to separate balls.
4 02-25-2019 16:57:57 - lMN He came by train.
5 02-25-2019 11:08:01 - sbc She is my grandmother.