Oversampling a class in classification problem - python

I have nearly 100000 data point with 15 features for 'disease' and 'no disease' as target.
But my data is imbalanced. 97% of my data is no disease and 3% is disease.
To overcome this I manually created disease data by creating 7 copies from the actual data and merged it with the original data.
using this code.
#selecting data with disease is 1
# Even created unique 'patient ID' by adding a dummy letter as a suffix to the #original ID.
ia = df[df['disease']==1]
dup = pd.DataFrame()
for i,j in zip(['a','b','c','d','e','f'],['B','C','E','F','G','H']):
i = ia.copy()
i['dum'] = j
i["patient ID"] = i["Employee Code"]+ i['dum']
dup= pd.concat([dup,i])
# adding the copies to the original data
df = pd.concat([dup,df])
Please let me know if this is the correct method for oversampling.

Related

How to save results (and recall them when needed) of a simulation in Python?

I started (based on the idea shown in this model an actuarial project in Python in which I want to simulate, based on a set of inputs and adding (as done here: https://github.com/Saurabh0503/Financial-modelling-and-valuationn/blob/main/Dynamic%20Salary%20Retirement%20Model%20Internal%20Randomness.ipynb) some degree of internal randomness, how much it will take for an individual to retire, with a certain amount of wealth and a certain amount of annual salary and by submitting a certain annual payment (calculated as the desired cash divided by the years that will be necessary to retire). In my model's variation, the user can define his/her own parameters, making the model more flexible and user friendly; and there is a function that calculates the desired retirement cash based on individual's propensity both to save and spend.
The problem is that since I want to summarize (by taking the mean, max, min and std. deviation of wealth, salary and years to retirement) the output I obtain from the model, I have to save results (and to recall them) when I need to do so; but I don't have idea of what to do in order to accomplish this task.
I tried this solution, consisting in saving the simultation's output in a pandas dataframe. In particular I wrote that function:
def get_salary_wealth_year_case_df(data):
all_ytrs = []
salary = []
wealth = []
annual_payments = []
for i in range(data.n_iter):
ytr = years_to_retirement(data, print_output=False)
sal = salary_at_year(data, year, case, print_output=False)
wlt = wealth_at_year(data, year, prior_wealth, case, print_output=False)
pmt = annual_pmts_case_df(wealth_at_year, year, case, print_output=False)
all_ytrs.append(ytr)
salary.append(sal)
annual_payments.append(pmt)
df = pd.DataFrame()
df['Years to Retirement'] = all_ytrs
df['Salary'] = sal
df['Wealth'] = wlt
df['Annual Payments'] = pmt
return df
I need a feedback about what I'm doing. Am I doing it right? If so, are there more efficient ways to do so? If not, what should I do? Thanks in advance!
Given the inputs used for the function, I'm assuming your code (as it is) will do just fine in terms of computation speed.
As suggested, you can add a saving option to your function so the results that are being returned are stored in a .csv file.
def get_salary_wealth_year_case_df(data, path):
all_ytrs = []
salary = []
wealth = []
annual_payments = []
for i in range(data.n_iter):
ytr = years_to_retirement(data, print_output=False)
sal = salary_at_year(data, year, case, print_output=False)
wlt = wealth_at_year(data, year, prior_wealth, case, print_output=False)
pmt = annual_pmts_case_df(wealth_at_year, year, case, print_output=False)
all_ytrs.append(ytr)
salary.append(sal)
annual_payments.append(pmt)
df = pd.DataFrame()
df['Years to Retirement'] = all_ytrs
df['Salary'] = sal
df['Wealth'] = wlt
df['Annual Payments'] = pmt
# Save the dataframe to a given path inside your workspace
df.to_csv(path, header=False)
return df
After saving, returning the object might be optional. This depends on if you are going to use this dataframe on your code moving forward.

Need help running a full match between two large dataframes based off the full business name compared to the description of a newsfeed

I have two dataframes:
One with a single column of business names that I call 'bus_names_2' with a column name of 'BUSINESS_NAME'
One with an array of records and fields that was pulled from a RSS feed that I call 'df_newsfeed'. The import field is 'Description_2' field which represents the RSS feeds contents after scrubbing stopwords and symbols. This was also conducted on the 'bus_names_2' dataframe as well.
I am trying to look through each record in the 'df_newsfeed's 'Description_2' field to see if any array of words contains a business name from the 'bus_names_2' dataframe. This is easily done using the following:
def IdentityResolution_demo(bus_names, df, col='Description_2', upper=True):
n_rows = df.shape[0]
description_col = df.columns.get_loc(col)
df['Company'] = ''
company_col = df.columns.get_loc('Company')
if upper:
df.loc[:,col] = df.loc[:,col].str.upper()
for ind in range(n_rows):
businesses = []
description = df.iloc[ind,description_col]
for bus_name in bus_names:
if bus_name in description:
businesses.append(bus_name)
if len(businesses) > 0:
company = '|'.join(businesses)
df.iloc[ind,company_col] = company
df = df[['Source', 'RSS', 'Company', 'Title', 'PublishedDate', 'Description', 'Link']].drop_duplicates()
return df
bus_names_3 = list(set(bus_names_2['BUSINESS_NAME'].tolist()))
test = IdentityResolution_demo(bus_names_3, df_newsfeed.iloc[:10])
test[test['Company']!='']
This issue with this asides from the length of time it takes is that it is bringing back everything in a contains manner. I only want full word matches. Meaning if I have a company in my 'bus_names_2' dataframe called 'Bank of A' that it only brings back that name into the company category if the full word of 'Bank of A' exist in the 'Description_2' column of the 'df_newsfeed' dataframe and not when 'Bank of America' shows up.
Essentially, I need something like this ingrained in my function to produce the proper output for the 'Company' column but I don't know how to implement it. The below code gets the point accross.
Description_2 = 'GUARDFORCE AI CO LIMITED AI GFAIW RIVERSOFT INC PEAKWORK COMPANY GFAIS CONCIERGE GUARDFORCE AI RIVERSOFT ROBOT TRAVEL AGENCY'
bus_name_2 = ['GUARDFORCE AI CO']
for i in bus_name_2:
bus_name = re.compile(fr'\b{i}\b')
print(f"{i if bus_name.match(Description_2) else ''}")
This would produce an output of 'GUARDFORCE AI CO' but if I change the bus_name_2 to:
bus_name_2 = ['GUARDFORCE AI C']
It would produce a null output.
This function is written in the way it is because comparing two dataframes turned into a very long query and so optimization required a non-dataframe format.

Iterate Pandas Chunks through preprocessing/cleaning before concatenating

Good day Python/Pandas Gurus:
I deal with memory issues while performing data analysis on my local machine. I typically deal with data in the shape of (15000000+, 50+). I typically chunk the data into chunksize=1000000 in pd.read_csv(), and this always works great for me.
I am wondering how I can iterate each chunk through my entire data cleaning/preprocessing section, so that I do not have to run the entire data frame through this section of code. I find I hit system limitations and run out of memory.
I want to read the pandas chunks, iterate each through a function or just a series of steps that renames columns, filters the data frame, and assign data types. Once this preprocessing is complete for all chunks, I would like the now processed chunks to then be concatenated together, creating the completed data frame.
df_chunks = pandas.read_csv("File.path", chunksize=10000)
for chunks in df_chunks:
Task 1: Rename Columns
Task 2: Filter(s)
Task 3: Assign data types to non-object fields
processed_df = pd.concat(df_chunks)
Here is a samples of code that I run the entire Data Frame through for preprocessing, but hit system limitations for the volume of data that I have:
billing_docs_clean.columns = ['BillingDocument', 'BillingDocumentItem', 'BillingDocumentType', 'BillingCategory', 'DocumentCategory',
'DocumentCurrency', 'SalesOrganization', 'DistributionChannel', 'PricingProcedure',
'DocumentConditionNumber', 'ShippingConditions', 'BillingDate', 'CustomerGroup', 'Incoterms',
'PostingStatus', 'PaymentTerms', 'DestinationCountry', 'Region', 'CreatedBy', 'CreationTime',
'SoldtoNumber', 'Curr1', 'Divison', 'Curr2', 'ExchangeRate', 'BilledQuantitySUn', 'SalesUnits',
'Numerator', 'Denominator', 'BilledQuantityBUn', 'BaseUnits', 'RequiredQuantity', 'BUn1', 'ExchangeRate2',
'ItemNetValue', 'Curr3', 'ReferenceDocument', 'ReferenceDocumentItem', 'ReferencyDocumentCategory',
'SalesDocument', 'SalesDocumentItem', 'Material', 'MaterialDescription', 'MaterialGroup',
'SalesDocumentItemCategory', 'SalesProductHierarchy', 'ShippingPoint', 'Plant', 'PlantRegion',
'SalesGroup', 'SalesOffice', 'Returns', 'Cost', 'Curr4', 'GrossValue', 'Curr5', 'NetValue', 'Curr6',
'CashDiscount', 'Curr7', 'FreightCharges', 'Curr8', 'Rebate', 'Curr9', 'OVCFreight', 'Curr10', 'ProfitCenter',
'CreditPrice', 'Curr11', 'SDDocumentCategory']
# Filter data to obtain US, Canada, and Mexico industrial sales for IFS Profit Center
billing_docs_clean = billing_docs_clean[
(billing_docs_clean['DistributionChannel'] == '02') &
(billing_docs_clean['ProfitCenter'].str.startswith('00001', na=False)) &
(billing_docs_clean['ReferenceDocumentItem'].astype(float) < 900000) &
(billing_docs_clean['PostingStatus']=='C') &
(billing_docs_clean['PricingProcedure'] != 'ZEZEFD') &
(billing_docs_clean['SalesDocumentItemCategory'] != 'TANN')]
# Correct Field Formats and data types
Date_Fields_billing_docs_clean = ['BillingDate']
for datefields in Date_Fields_billing_docs_clean:
billing_docs_clean[datefields] = pd.to_datetime(billing_docs_clean[datefields])
Trim_Zeros_billing_docs_clean = ['BillingDocument', 'BillingDocumentItem', 'ProfitCenter', 'Material', 'ReferenceDocument',
'ReferenceDocumentItem', 'SalesDocument', 'SalesDocumentItem']
for TrimFields in Trim_Zeros_billing_docs_clean:
billing_docs_clean[TrimFields] = billing_docs_clean[TrimFields].str.lstrip('0')
Numeric_Fields_billing_docs_clean = ['ExchangeRate', 'BilledQuantitySUn', 'Numerator', 'Denominator', 'BilledQuantityBUn',
'RequiredQuantity', 'ExchangeRate2', 'ItemNetValue', 'Cost', 'GrossValue', 'NetValue',
'CashDiscount', 'FreightCharges', 'Rebate', 'OVCFreight', 'CreditPrice']
for NumericFields in Numeric_Fields_billing_docs_clean:
billing_docs_clean[NumericFields] = billing_docs_clean[NumericFields].astype('str').str.replace(',','').astype(float)
I am still relatively new with python coding for data analytics, but eager to learn! So I appreciate any and all explanations or any other recommendations for the code in this post.
Thanks!
Task 1: Rename Columns
For this you can harness pandas.read_csv optional arguments header and names. Consider following simple example, let file.csv content be
A,B,C
1,2,3
4,5,6
then
import pandas as pd
df = pd.read_csv("file.csv", header=0,names=["X","Y","Z"])
print(df)
output
X Y Z
0 1 2 3
1 4 5 6

Assistance with splitting data frame to new columns

I'm having trouble splitting a data frame by _ and creating new columns from it.
The original strand
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
my current code
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
Output:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
I would really like to split up 'section' in a way that puts it in new columns based on '_'
I've tried so many different variations of regex to split 'section' and all of them either gave me headings with no fill or they added columns after section and text, which isn't useful. I should also add theres going to be around 100,000 observations.
Desired result:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
Any guidance would be appreciated.
If you always know the number of splits, you can do something like:
import pandas as pd
df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
That way, the dataframe will turn into
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
Since you want the columns to be on a certain order, you can reorder the dataframe's columns as below:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d

Python Scikit-Learn PCA: Get Component Score

I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)

Categories