Multiline regex: How to extract text between dates in pandas dataframe? - python

I have dataframe with description column, under one row of description there are multiple lines of texts, basically those are set of information for each record.
Example: Regarding information no 1 at 07-01-2019 we got update as the sky is blue and at 05-22-2019 we again got update as Apples are red, that are arranged between two dates. Firstly, I would like to extract text between the date and split the respective details in new columns as date, name, description.
The raw description looks like
info no| Description
--------------------------------------------------------------------------
1 |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
| Clouds are looking lovely.
| 05-22-2019 12:00:49 - MNX (Work notes) Apples are red in color.
--------------------------------------------------------------------------
| 02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2 | 02-25-2019 16:57:57 - lMN (Work notes) He came by train.
| That train was 15 min late.
| He missed the concert.
| 02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.
Desired output is
info No |DATE | NAME | DESCRIPTION
--------|------------------------------------------------------
1 |07-01-2019 12:59:41 | xyz | The sky is blue in color.
| | | Clouds are looking lovely.
--------|---------------------------------------------------------
1 |05-22-2019 12:00:49 | MNX | Apples are red in color
--------|---------------------------------------------------------
2 | 02-26-2019 12:53:18 | ABC | Task is to separate blue balls.
--------|---------------------------------------------------------
2 | 02-25-2019 16:57:57 | IMN | He came by train
| | | That train was 15 min late.
| | | He missed the concert.
--------|---------------------------------------------------------
| 02-25-2019 11:08:01 | sbc | She is my grandmother.
I tried:
myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
newQueue = newQueue.date.str.split(-,n=3)

Having this dataframe
df
Description
Sl No
1 07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2 05-22-2019 12:00:49 - MNX (Work notes) Apples...
3 02-26-2019 12:53:18 - ABC (Work notes) Task is...
4 02-25-2019 16:57:57 - lMN (Work notes) He came...
5 02-25-2019 11:08:01 - sbc (Work notes) She is ...
you can split the strings at the description column by "(Work notes)" and then you can use values.tolist to split it into 2 columns as follows:
x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))
x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)
print(x)
0 1
Sl No
1 07-01-2019 12:59:41 - XYZ The sky is blue in color.
2 05-22-2019 12:00:49 - MNX Apples are red in color.
3 02-26-2019 12:53:18 - ABC Task is to separate balls.
4 02-25-2019 16:57:57 - lMN He came by train.
5 02-25-2019 11:08:01 - sbc She is my grandmother.

Related

Pyhton pandas for manipulate text & inconsistent data

how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400

SAS Programming: How to replace missing values in multiple columns using one column?

Background
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.
Request
Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Here is the Python code from the blog post.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts.
A simplified example in SAS:
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

Python program that reorganizes Excel formatting?

I am working on a Python program that aims to take Excel data that is vertical and make it horizontal.
For example, the data is shaped something like this:
County | State | Number | Date
Oakland | MI | 19 | 1/12/10
Oakland | MI | 32 | 1/19/10
Wayne | MI | 9 | 1/12/10
Wayne | MI | 6 | 1/19/10
But I want it like this (purposefully excluding the state):
County | 1/12/10 | 1/19/10
Oakland | 19 | 32
Wayne | 9 | 6
(And for the actual data, it’s quite long).
My logic so far:
Read in the Excel File
Loop through the counties
If county name is the same, place # in Row 1?
Make a new Excel File?
Any ideas of how to write this out? I think I am a little stuck on the syntax here.

How to clean a string to get value_counts for words of interest by date?

I have the following data generated from a groupby('Datetime') and value_counts()
Datetime 0
01/01/2020 Paul 8
03 2
01/02/2020 Paul 2
10982360967 1
01/03/2020 religion 3
..
02/28/2020 l 18
02/29/2020 Paul 78
march 22
03/01/2020 church 63
l 21
I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers).
Do you know how I could further clean this selection?
Expected output to avoid confusion:
Datetime 0
01/03/2020 religion 3
..
02/29/2020 march 22
03/01/2020 church 63
I removed Paul, 03, 109..., and l.
Raw data:
Datetime Corpus
01/03/2020 Paul: examples of religion
01/03/2020 Paul:shinto is a religion 03
01/03/2020 don't talk to me about religion, Paul 03
...
02/29/2020 march is the third month of the year 10982360967
02/29/2020 during march, there are some cold days.
...
03/01/2020 she is at church right now
...
I cannot put all the raw data as I have more than 100 sentences.
The code I used is:
df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
Since I got a Key error, I had to edit the code as follows:
df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
To extract the words I used str.extractall
Cleaning strings is a multi-step process
Create dataframe
import pandas as pd
from nltk.corpus import stopwords
import string
# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
'Corpus': ['Paul: Examples of religion',
'Paul:shinto is a religion 03',
"don't talk to me about religion, Paul 03",
'march is the third month of the year 10982360967',
'during march, there are some cold days.',
'she is at church right now']}
test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)
| | Datetime | Corpus |
|---:|:--------------------|:-------------------------------------------------|
| 0 | 2020-01-03 00:00:00 | Paul: Examples of religion |
| 1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03 |
| 2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03 |
| 3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
| 4 | 2020-02-29 00:00:00 | during march, there are some cold days. |
| 5 | 2020-03-01 00:00:00 | she is at church right now |
Clean Corpus
Add extra words to the remove_words list
They should be lowercase
Some cleaning steps could be combined, but I do not recommend that
Step-by-step makes it easier to determine if you've made a mistake
This is a small example of text cleaning.
There are entire books on the subject.
There's not context analysis
example = 'We march to the church in March.'
value_count for 'march' in example.lower() is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words) # add other words to exclude in lowercase
# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)
test.dropna(inplace=True) # drop any na rows
# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '') # remove numbers
test.Corpus = test.Corpus.str.replace(punc, ' ') # remove punctuation
test.Corpus = test.Corpus.str.replace('\\s+', ' ') # remove occurrences of more than one whitespace
test.Corpus = test.Corpus.str.strip() # remove whitespace from beginning and end of string
test.Corpus = test.Corpus.str.lower() # convert all to lowercase
test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words)) # remove words
| | Datetime | Corpus |
|---:|:--------------------|:-------------|
| 0 | 2020-01-03 00:00:00 | ['religion'] |
| 1 | 2020-01-03 00:00:00 | ['religion'] |
| 2 | 2020-01-03 00:00:00 | ['religion'] |
| 3 | 2020-02-29 00:00:00 | ['march'] |
| 4 | 2020-02-29 00:00:00 | ['march'] |
| 5 | 2020-03-01 00:00:00 | ['church'] |
Explode Corpus & groupby
# explode list
test = test.explode('Corpus')
# dropna incase there are empty rows from filtering
test.dropna(inplace=True)
# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})
word_count
Datetime Corpus
2020-01-03 religion 3
2020-02-29 march 2
2020-03-01 church 1

Phrase similarity from List

Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?

Categories