Parse raw text data and extract a particular value in Python - python

One of the columns in my database stores text information in the below mentioned format. The text is not in a standard format sometimes there might be additional text before "Insurance Date" field. When I do the split in Python it might place this "Insurance date" in different columns. I need to search for the value "Insurance date in all columns in this case.
Sample text
"Accumulation Period - period of time insured must incur eligible medical expenses at least equal to the deductible amount in order to establish a benefit period under a major medical expense or comprehensive medical expense policy.\n
Insurance Date 12/17/2018\n
Insurance Number 235845\n
Carrier Name SKGP\n
Coverage $240000"
Expected result
INS_NO Insurance Date Carrier Name
235845 12/17/2018 SKGP
How do we parse raw text information like this and extract the value of Insurance Date
I'm using the below logic to extract this but I'm don't know how to extract the date into another column
df= pd.read_sql(query, conn)
df2=df["NOTES"].str.split("\n", expand=True)

Use regex
If the text follows a pattern (more or less), you could use regex.
See the python documentation for regular expressions operations here.
Example
See and try with the code of two possible solutions here.
Below you can find a simplified example.
text = """
Accumulation Period - period of time insured must incur eligible medical expenses at least equal to the deductible amount in order to establish a benefit period under a major medical expense or comprehensive medical expense policy.
Insurance Date 12/17/2018
Insurance Number 235845
Carrier Name SKGP
Coverage $240000
"""
pattern = re.compile(r"Insurance Date (.*)\nInsurance Number (.*)\nCarrier Name (.*)\n")
match = pattern.search(text)
print("Found:")
if match:
for g in match.groups():
print(g)
The output
Found:
12/17/2018
235845
SKGP

If I understand you correctly, this may get you close to what you need:
insurance = """
"Accumulation Period - period of time insured must incur eligible medical expenses at least equal to the deductible amount in order to establish a benefit period under a major medical expense or comprehensive medical expense policy.\n
Insurance Date 12/17/2018\n
Insurance Number 235845\n
Carrier Name SKGP\n
Coverage $240000"
"""
items = insurance.split('\n')
filtered_items = list(filter(lambda x: x != "", items))
del filtered_items[0]
del filtered_items[-1]
row = []
for item in filtered_items:
row.append(item.split(' ')[-1])
columns = ["INS_NO ", "Insurance Date", "Carrier Name"]
df = pd.DataFrame([row],columns=columns)
df
Output:
INS_NO Insurance Date Carrier Name
0 12/17/2018 235845 SKGP

Related

Pandas - groupby and show aggregate on all "levels"

I am a Pandas newbie and I am trying to automate the processing of ticket data we get from our IT ticketing system. After experimenting I was able to get 80 percent of the way to the result I am looking for.
Currently I pull in the ticket data from a CSV into a "df" dataframe. I then want to summarize the data for the higher ups to review and get high level info like totals and average "age" of tickets (number of days between ticket creation date and current date).
Here's an example of the ticket data for "df" dataframe:
I then create "df2" dataframe to summarize df using:
df2 = df.groupby(["dept", "group", "assignee", "ticket_type"]).agg(task_count=('ticket_type', 'size'), mean_age_in_days=('age', 'mean'),)
And here's what it I am getting if I print out df2...which is very close to what I need.
As you can see we look at the count of tickets assigned to each staff member, separated by type (incident, request), and also look at the average "age" of each ticket type (incident, request) for each staff member.
The roadblock that I am hitting now and have been pulling my hair out about is I need to show the aggregates (count and averages of ages) at all 3 levels (sorry if I am using the wrong jargon). Basically I need to show the count and average age for all tickets associated with a group, then the same thing for tickets at the department ("Division") level, and lastly the grand total and grand average in green...for all tickets which is the entire organization (all tickets in all departments, groups).
Here's an example of the ideal result I am trying to get:
You will see in red I want the count of tickets and average age for tickets for a given group. Then, in blue I want the count and average age for all tickets on the dept/division level (all tickets for all groups belonging to a given dept./division). Lastly, I want the grand total and grand average for all tickets in the entire organization. In the end both the df2 (summary of ticket data) and df will be dumped to an Excel file on separate worksheets in the same workbook.
Please have mercy on me! Can someone show me how I could generate the desired "summary" with counts and average age at all levels (group, dept., and organization)? Thanks in advance for any assistance, I'd really, really appreciate it!
*Added link to CSV with sample ticket data below:
on Github
Also, here's raw CSV text for the sample ticket data:
,number,created_on,dept,group,assignee,ticket_type,age
0,14500,2021-02-19 11:48:28,IT_Services_Division,Helpdesk,Jane Doe,Incident,361
1,16890,2021-04-20 10:51:49,IT_Services_Division,Helpdesk,Jane Doe,Incident,120
2,16891,2021-04-20 11:51:00,IT_Services_Division,Helpdesk,Tilly James,Request,120
3,15700,2021-06-09 09:05:28,IT_Services_Division,Systems,Steve Lee,Incident,252
4,16000,2021-08-12 09:32:39,IT_Services_Division,Systems,Linda Nguyen,Request,188
5,16100,2021-08-18 17:43:54,IT_Services_Division,TechSupport,Joseph Wills,Incident,181
6,19000,2021-01-17 15:01:50,IT_Services_Division,TechSupport,Bill Gonzales,Request,30
7,18990,2021-01-10 13:00:01,IT_Services_Division,TechSupport,Bill Gonzales,Request,37
8,18800,2021-12-03 21:13:12,Data_Division,DataGroup,Bob Simpson,Incident,74
9,16880,2021-10-18 11:56:03,Data_Division,DataGroup,Bob Simpson,Request,119
10,18000,2021-11-09 14:28:44,IT_Services_Division,Systems,Veronica Paulson,Incident,98
Here's a different approach which is easier, but results in a different structure
agg_df = df.copy()
#Add dept-level info to the department
gb = agg_df.groupby('dept')
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['dept'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add group-level info to the group label
gb = agg_df.groupby(['dept','group'])
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['group'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add org-level info
agg_df['org'] = 'Org [{} tasks, avg age = {}]'.format(len(agg_df),agg_df['age'].mean().round(2))
agg_df = (
agg_df.groupby(['org','dept','group','assignee','ticket_type']).agg(
task_count=('ticket_type','count'),
mean_ticket_age=('age','mean'))
)
agg_df
Couldn't think of a cleaner way to get the structure you want and had to manually loop through the different groupby levels adding one row at a time
multi_ind = pd.MultiIndex.from_tuples([],names=('dept','group','assignee','ticket_type'))
agg_df = pd.DataFrame(index=multi_ind, columns=['task_count','mean_age_in_days'])
data = lambda df: {'task_count':len(df),'mean_age_in_days':df['age'].mean()}
for dept,dept_g in df.groupby('dept'):
for group,group_g in dept_g.groupby('group'):
for assignee,assignee_g in group_g.groupby('assignee'):
for ticket_type,ticket_g in assignee_g.groupby('ticket_type'):
#Add ticket totals
agg_df.loc[(dept,group,assignee,ticket_type)] = data(ticket_g)
#Add group totals
agg_df.loc[(dept,group,assignee,'Group Total/Avg')] = data(group_g)
#Add dept totals
agg_df.loc[(dept,group,assignee,'Dept Total/Avg')] = data(dept_g)
#Add org totals
agg_df.loc[('','','','Org Total/Avg')] = data(df)
agg_df
Output

Text Processing: how can I extract the right fields from a list of strings?

I want to extract six fields from content_list and put them into a dataframe. The fields are: Seq. #, Name, Coding Instructions, Target Value, Selections, and Supporting Definitions. However, the regex I have to get the metadata object,is not giving me Seq. # for each item in the list, and missing a few other items, so when I go to subset it, it gives me an index out of range error. I am not sure what I'm doing wrong. Can you help me? Thank you!
import re
import pandas as pd
content_list = ['\nSeq. #:\n2031', 'Name:\nSSN N/A\nThe value on arrival at this facility\nTarget Value:\nSelection Text\nDefinition\nNo\nYes\nSelections:\n(none)\nSupporting Definitions:\nIndicate the number created and automatically inserted by the software that uniquely identifies this patient.\nCoding Instructions:\nOnce assigned to a patient at the participating facility, this number will never be changed or reassigned to a different patient. If the \npatient returns to the same participating facility or for followup, they will receive this same unique patient identifier.\nNote(s):', '\nSeq. #:\n2040', 'Name:\nNCDR Patient ID\nThe value on arrival at this facility\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\nAn optional patient identifier, such as Medical Record Number, that can be associated with the patient.\nCoding Instructions:\nThis element is referenced in The Joint Commission AMI Core Measures, AMI-1 through AMI-5. AMI-7, 7a, 8, 8a and AMI-9.\nNote(s):', '\nSeq. #:\n2045', "Name:\nOther ID\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\nIndicate the patient's date of birth.\nCoding Instructions:\nThis element is referenced in The Joint Commission AMI Core Measures, AMI-1 through AMI-5. AMI-7, 7a, 8, 8a and AMI-9.\nNote(s):", '\nSeq. #:\n2050', "Name:\nBirth Date\nThe value on arrival at this facility\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\n© 2007, American College of Cardiology Foundation\n3/31/2014\nPage 2 of 137\nEffective for Patient Discharges January 01, 2015\nCoder's Data Dictionary\nNCDR® ACTION Registry®-GWTGŽ v2.4\nA. Demographics\nIndicate the patient's sex at birth.\nCoding Instructions:\nThis element is referenced in The Joint Commission AMI Core Measures, AMI-1 through AMI-5. AMI-7, 7a, 8, 8a and AMI-9.\nNote(s):", '\nSeq. #:\n2060', 'Name:\nSex\nThe value on arrival at this facility\nTarget Value:\nSelection Text\nDefinition\nMale\nFemale\nSelections:\n(none)\nSupporting Definitions:\nIndicate if the patient is White.\nCoding Instructions:\nIf the patient has multiple race origins, specify them using the other race selections in addition to this one.\nThis element is referenced in The Joint Commission AMI Core Measures, AMI-1 through AMI-5. AMI-7, 7a, 8, 8a and AMI-9.\nNote(s):', '\nSeq. #:\n2070', 'Name:\nRace - White\nThe value on arrival at this facility\nTarget Value:\nSelection Text\nDefinition\nNo\nYes\nSelections:\nWhite (race)\n:\nHaving origins in any of the original peoples of Europe, the Middle East, or North Africa.\nSource:\nU.S. Office of Management and Budget. Classification of Federal Data on Race and Ethnicity\nSupporting Definitions:\nIndicate if the patient is Black or African American.\nCoding Instructions:\nIf the patient has multiple race origins, specify them using the other race selections in addition to this one.\nThis element is referenced in The Joint Commission AMI Core Measures, AMI-1 through AMI-5. AMI-7, 7a, 8, 8a and AMI-9.\nNote(s):','\nSeq. #:\n1040', 'Name:\nTransmission Number\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\nVendor Identification (agreed upon by mutual selection between the vendor and the NCDR) to identify software vendor. Vendors \nmust use consistent name identification across sites. Changes to Vendor Name Identification must be approved by the NCDR.\nCoding Instructions:', '\nSeq. #:\n1050', "Name:\nVendor Identifier\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\nVendor's software product name and version number identifying the software which created this record (assigned by vendor). \nVendor controls the value in this field. Version passing certification/harvest testing will be noted at the NCDR.\nCoding Instructions:", '\nSeq. #:\n1060', "Name:\nVendor Software Version\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\n© 2007, American College of Cardiology Foundation\n3/31/2014\nPage 136 of 137\nEffective for Patient Discharges January 01, 2015\nCoder's Data Dictionary\nNCDR® ACTION Registry®-GWTGŽ v2.4\nZ. Administration\nThe NCDR Registry Identifier describes the data registry to which these records apply. It is implemented in the software at the time \nthe data is collected and the records are created. This is entered into the schema automatically by software.\nCoding Instructions:", '\nSeq. #:\n1070', 'Name:\nRegistry Identifier\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\nRegistry Version describes the version number of the Data Specifications/Dictionary, to which each record conforms. It identifies \nwhich fields should have data, and what are the valid data for each field. It is the version implemented in the software at the time \nthe data is collected and the records are created. This is entered into the schema automatically by software.\nCoding Instructions:', '\nSeq. #:\n1080', 'Name:\nRegistry Version\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\nReserved for future use.\nCoding Instructions:', '\nSeq. #:\n1200', "Name:\nAuxiliary 0\nN/A\nTarget Value:\n(none)\nSelections:\n(none)\nSupporting Definitions:\n© 2007, American College of Cardiology Foundation\n3/31/2014\nPage 137 of 137\nEffective for Patient Discharges January 01, 2015\nCoder's Data Dictionary\nNCDR® ACTION Registry®-GWTGŽ v2.4\nZ. Administration"]
sequence_list = []
metadata = []
for i in content_list:
metadata = list(filter(None, re.split("\s*(?:Seq. #:|Name:|Coding Instructions:|Target Value:|Selections:|Supporting Definitions:)\s*", i)))
sequence_list.append([metadata[0], metadata[1], metadata[2], metadata[3], metadata[4], metadata[5]])
df = pd.DataFrame(sequence_list, columns = ['Seq #:','Name','Coding Instructions','Target Value','Supporting Definitions','Selections'])
df['Seq #:'] = df['Seq #:'].astype(int)
df.head()
You can join the items in content_list with a newline and then split the resulting string with a double newline to get paragraphs, that you can later parse with a matching regex like
pattern = r'(?s)^Seq\. #:\s*(.*?)\nName:\s*(.*?)\nTarget Value:\s*(.*?)\nSelections:\s*(.*?)\nSupporting Definitions:\s*(.*?)(?:\nCoding Instructions:\s*(.*))?$'
See the regex demo. It seems like Coding Instructions can be missing, so it is made optional in the regex.
Python demo:
sequence_list = []
pattern = r'^Seq\. #:\s*(.*?)\nName:\s*(.*?)\nTarget Value:\s*(.*?)\nSelections:\s*(.*?)\nSupporting Definitions:\s*(.*?)(?:\nCoding Instructions:\s*(.*))?$'
for i in re.split(r'\n{2,}', '\n'.join(content_list)):
m = re.match(pattern, i.strip(), re.S)
if m:
sequence_list.append(m.groups())
df = pd.DataFrame(sequence_list, columns = ['Seq #:','Name','Coding Instructions','Target Value','Supporting Definitions','Selections'])
Note that each paragraph is only parsed if the regex matches, and if it does, match .groups() data is used to populate the dataframe later.

How to create a column in a data frame based on the values of another two columns?

I am pre-formatting some data for a tax filing and I am using python to automate some of the excel work. I have a data frame with three columns: Account; Opposite Account; Amount. I only have the names of the opposite account and the values, but the values for the same pair of account - opposite account should be exactly the same. For example:
Account Opposite Acc. Amount
Cash -240.56
Supplies 240.56
Dentist -10.45
Gum 10.45
From that, I can deduce that Cash is the opposite of Supplies and Dentist is the opposite to Gum, so I would like my output to be:
Account Opposite Acc. Amount
Supplies Cash -240.56
Cash Supplies 240.56
Gum Dentist -10.45
Dentist Gum 10.45
Right now I doing this manually by using str.contains
df = df.assign(en_accounts = df['Opposite Acc.'])
df['Account'] = df['Account'].fillna("0")
df.loc[df['Account'].str.contains('Cash'), 'Account'] = 'Supplies'
But there are many variables and I wonder if there is a way to automate this process in python. One strategy could be: if two rows add up to 0, the accounts are a match --> therefore when item A (such as supplies) happens in "Opposite Acc.", item B (such as Cash) is put in the same row but in "Account".
This is what I have so far:
df['Amount'] = np.abs(df["Amount"])
c1 = df['Amount']
c2 = df['Opposing Acc.']
for i in range(1,len(c1)-1):
p = c1[i-1]
x = c1[i]
n = c1[i+1]
if p == x:
for i in range(1,len(c2)-1):
a = c2[i-1]
df.loc[df['en_account']] = a
But I get the following error: "None of [Index[....]\n dtype='object', length=28554)] are in the [index]"

Natural language processing - extracting data

I need help with processing unstructured data of day-trading/swing-trading/investment recommendations. I've the unstructured data in the form of CSV.
Following are 3 sample paragraphs from which data needs to be extracted:
Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with
an intra-day target price of Rs 338 . The current market
price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to
keep stop loss at Rs 318 .
Kotak Securities Limited has a buy call on Engineers India Ltd. with a
target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers
India Ltd. price to reach the defined target. Engineers India enjoys a
healthy market share in the Hydrocarbon consultancy segment. It enjoys
a prolific relationship with few of the major oil & gas companies like
HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a
recovery in the infrastructure spending in the hydrocarbon sector.
Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a
target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days
when Ceat Ltd. price can reach the defined target. Kunal Bothra
maintained stop loss at Rs 1240.
Its been a challenge extracting 4 information out of the paragraphs:
each recommendation is differently framed but essentially has
Target Price
Stop Loss Price
Current Price.
Duration
and not necessarily all the information will be available in all the recommendations - every recommendation will atleast have Target Price.
I was trying to use regular expressions, but not very successful, can anyone guide me how to extract this information may be using nltk?
Code I've so far in cleaning the data:
import pandas as pd
import re
#etanalysis_final.csv has 4 columns with
#0th Column having data time
#1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.
df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])
This is a hard question in that there are different possibilities in which each of the 4 pieces of information might be written. Here is a naive approach that might work, albeit would require verification. I'll do the example for the target but you can extend this to any:
CONTEXT = 6
def is_float(x):
try:
float(x)
return True
except ValueError:
return False
def get_target_price(s):
words = s.split()
n = words.index('target')
words_in_range = words[n-CONTEXT:n+CONTEXT]
return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
This is a simple approach to get you started but you can put extra checks to make this safer. Things to potentially improve:
Make sure that the the index before the one where the proposed float is found is Rs.
If no float is found in the context range, expand the context
Add user verification if there are ambiguities i.e. more than one instance of target or more than one float in the context range etc.
I got the solution :
Code here contains only solution part of the question asked. It shall be possible to greatly improve this solution using fuzzywuzzy library.
from nltk import word_tokenize
periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
stop = ['target', 'current', 'stop', 'period', 'stoploss']
def extractinfo(row):
if 'intra day' in row.lower():
row = row.lower().replace('intra day', 'intra-day')
tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
tgt = ''
crt = ''
stp = ''
prd = ''
if 'target' in tks:
if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
tgt = tks[tks.index('target'):tks.index('target')+2][-1]
if 'current' in tks:
if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
crt = tks[tks.index('current'):tks.index('current')+2][-1]
if 'stop' in tks:
if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
prdd = set(periods).intersection(tks)
if 'period' in tks:
pdd = tks[tks.index('period'):tks.index('period')+3]
prr = set(periods).intersection(pdd)
if len(prr) > 0:
if len(pdd) > 2:
prd = ' '.join(pdd[-2::1])
elif len(pdd) == 2:
prd = pdd[-1]
elif len(prdd) > 0:
prd = list(prdd)[0]
return (crt, tgt, stp, prd)
Solution is relatively self explanatory - otheriwse please let me know.

Is there a way to automatically get general info of many stocks like P/E ratio, Yield, and so on?

I know some ways to get daily stock prices and volumes in R or python, but just wondering whether these is a way (using either R or python) to get more info about stocks such as P/E ratio, company website, Yield and so on, preferably not just current value, but also historical values.
Thanks.
Historical is going to be difficult. The quantmod package for R has getQuote which together with yahooQF will be all you need to get current values.
require("quantmod")
getQuote("GS", what = yahooQF(c("Market Capitalization", "Earnings/Share",
"P/E Ratio", "Book Value", "EBITDA", "52-week Range")))
Trade Time Market Capitalization Earnings/Share P/E Ratio Book Value EBITDA 52-week Range
GS 2012-06-21 04:00:00 47.870B 6.764 14.27 134.476 0 84.27 - 139.25
Also, try
getQuote("GS", what=yahooQF())
which will give you a menu of choices for what fields to request.
You can get recent financial statements from Google Finance with getFinancials
There is also the FinancialInstrument package which has several update_instruments.* functions to download metadata about instruments (stocks in this case). For example, here's what the yahoo one does
require("FinancialInstrument")
stock("GS", currency("USD")) # define the stock
#[1] "GS"
update_instruments.yahoo("GS") #update with yahoo
#[1] "GS"
getInstrument("GS")
#primary_id :"GS"
#currency :"USD"
#multiplier :1
#tick_size :0.01
#identifiers : list()
#type :"stock"
#name :"Goldman Sachs Gro"
#exchange :"NYSE"
#market.cap :"47.870B"
#avg.volume :5480530
#EPS :6.76
#EPS.current.year.est:11.4
#EPS.next.year.est :12.9
#book.value :134
#EBITDA :0
#range.52wk :"84.27 - 139.25"
#defined.by :"yahoo"
#updated : POSIXct, format: "2012-06-21 19:31:11"
If you have an InteractiveBrokers account, you can use the outstanding IBrokers package to get lots of info about lots of instruments. Also, if you have an IB account you'll want to look at my twsInstrument package which has a lot of convenience functions.
Just to answer the website part of my question:
str <- paste("http://investing.money.msn.com/investments/company-report?symbol=", ticker, sep = "")
page <- paste(readLines(url(str, open = "rt")), collapse = "\n")
match <- regexpr("Website", page, perl = TRUE)
if (attr(match, "match.length") > 0) {
site <- substring(page, attr(match, "capture.start"), attr(match, "capture.start") + attr(match, "capture.length") - 1)
site <- strsplit(site, "/")[[1]][1]
}

Categories