Pandas - Matching reference number to find earliest date

Pandas - Matching reference number to find earliest date - python

I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35

It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.

Related

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.

I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")

One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Creating a new column with concatenated values from another column

I am trying to create a new column in this data frame. The data set has multiple records for each PERSON because each record is a different account. The new column values should be a combination of the values for each PERSON in the TYPE column. For example, if John Doe has four accounts the value next to his nae in the new column should be a concatenation of the values in TYPE. An example of the final data frame is below. Thanks in advance.
enter image description here

You can do this in two lines (first code, then explanation):
Code:
in: name_types = df.pivot_table(index='Name', values='AccountType', aggfunc=set)
out:
AccountType
Name
Jane Doe {D}
John Doe {L, W, D}
Larry Wild {L, D}
Patti Shortcake {L, W}
in: df['ClientType'] = df['Name'].apply(lambda x: name_types.loc[x]['AccountType'])
Explanation:
The pivot table gets all the AccountTypes for each individual name, and removes all duplicates using the 'set' aggregate function.
The apply function then iterates through each 'Name' in the main data frame, looks up the AccountType associated with each in name_typed, and adds it to the new column ClientType in the main dataframe.
And you're done!
Addendum:
If you need the column to be a string instead of a set, use:
in: def to_string(the_set):
string = ''
for item in the_set:
string += item
return string
in: df['ClientType'] = df['ClientType'].apply(to_string)
in: df.head()
out:
Name AccountType ClientType
0 Jane Doe D D
1 John Doe D LDW
2 John Doe D LDW
3 John Doe L LDW
4 John Doe D LDW

How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?

I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)

Python : How to collapse the contents of several rows into one cell when importing csv in pandas

I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:
Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:
HISTORY: History of meningioma, surveillance
FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.
Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
HISTORY: Admitted patient with new fever, concern for pneumonia
FINDINGS: A CT of the chest demostrates bla bla bla
bla bla bla
[report_end]
When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.
It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.
Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).
Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).
Any thoughts?
Thanks!
Maya

Please be careful that your [report_end] is consistent. You gave both [report_end] and [report end]. I'm assuming that is a typo.
Assuming your file name is test.txt
txt = open('test.txt').read()
names, txt_ = txt.split('\n', 1)
names = names.split('|')
pd.DataFrame(
[t.strip().split('|') for t in txt_.split('[report_end]') if t.strip()],
columns=names)
Name MRN DOB Type_Imaging Report_Status Report_Text
0 John Doe 1234 01/01/1995 MRI Complete Exam Number: A5678\nReport status: final\nTyp...
1 James Smith 5678 05/05/1987 CT Complete Exam Number: A8623\nReport status: final\nType...

I ended up doing this which worked:
import re
import pandas as pd
f = open("filename.txt", "r”)
data = f.read().replace("\n", “”)
matches = re.findall("\|Exam Number:(.*?)\[report_end\]", data, re.DOTALL)
df= pd.read_csv("filename.txt", sep="|", parse_dates=[5]).dropna(axis=0, how="any”)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Matching reference number to find earliest date - python

Related

Python - Matching and extracting data from excel with pandas

Python categorize data in excel based on key words from another excel sheet

Creating a new column with concatenated values from another column

How do I combine multiple rows of a CSV that share data into one row using Pandas?

Python : How to collapse the contents of several rows into one cell when importing csv in pandas

Categories

Resources