I have a table of "Borrower Personal ID" and "Loan ID".
BwrPersonld LoanId
113225 16330
113225 27073
113225 68842
113253 16341
113269 16348
113285 16354
113289 26768
113297 16360
113299 16361
113319 16369
113418 16403
113418 26854
I'm trying to know which loans belong to the same borrower. So I "groupby" the "BwrPersonalId" and "LoanId" like below.
Now I'm expecting like this.
Here is my code, but it doesn't work.
grouped = pd.DataFrame()
unique = loan['BwrPersonId'].unique()
grouped['BwrPersonId'] = ''*len(loan['BwrPersonId'].unique())
grouped['Loan1'] = ''
grouped['Loan2'] = ''
grouped['Loan3'] = ''
grouped['Loan4'] = ''
grouped['Loan5'] = ''
grouped.iloc[:,0] = unique
for i in grouped.index:
idloan = loan.loc[loan['BwrPersonId'] == unique[i], 'LoanId']
grouped.iloc[i,1:len(idloan)+1] = idloan
print(i)
How can I do it now? And is there any other way that can simplify the code? Thanks a lot for your help.
Basically, what you need to do is create a temp that will be utilizing the data that will be sorted, and the name that will be in charge of the Id to sort the Ids according to the loans.
import pandas as pd
import numpy as np
from collections import defaultdict
from itertools import count
dict = defaultdict(count)
id, name = pd.factorize([*zip(grouped.id, grouped.name)])
joined = np.array([next(dict[x]) for x in id])
lenOfr, Max = len(name), joined.max() + 1
temp = np.empty((lenOfr, Max), dtype=np.object)
temp[id, joined] = grouped.LoanId
df1 = pd.DataFrame(name.tolist(), columns=['BwrPersonId'])
df2 = pd.DataFrame(temp, columns=['Loan1', 'Loan2', 'Loan3', 'Loan4', 'Loan5'])
final = df1.join(df2)
Related
I want to know how to get the union of serveral time intervals for each id and name?
import pandas as pd
id = [1,1,1,1,1,1,1,2,2,2]
name = ['A','A','A','A','A','B','B','C','C','C']
Start_time = ['2005-06-27','2005-07-07','2005-07-12','2006-11-15','2008-08-22','2009-03-03','2009-03-06','2007-10-26','2007-10-31','2007-11-06']
Final_time = ['2005-07-07','2005-07-12','2005-09-26','2008-08-22','2009-02-24','2009-03-06','2009-03-12','2007-10-31','2007-11-05','2007-11-09']
dataframe = pd.DataFrame({'id':id,'name':name,'Start_time':Start_time,'Final_time':Final_time})
dataframe['Start_time'] = pd.to_datetime(dataframe['Start_time'])
dataframe['Final_time'] = pd.to_datetime(dataframe['Final_time'])
The result may be like:
If the time intervals can be merged for each id and name, then the related result_S and result_F should be the same,just like the image shows.
You can accomplish that with combination of mask, backward fill and forward fill methods in pandas. Example code below:
df_s = dataframe[1:]
df_f = dataframe[:-1]
conds = (df_s.Start_time.values == df_f.Final_time.values) & (df_s.name.values == df_f.name.values) & (df_s.id.values == df_f.id.values)
conds_s = [False] + list(conds)
conds_f = list(conds) + [False]
dataframe['Result_S'] = dataframe['Start_time'].mask(conds_s).ffill()
dataframe['Result_F'] = dataframe['Final_time'].mask(conds_f).bfill()
Note that in the solution above we assume that dataframe is already sorted.
import numpy as np
#Collect the compound values for each news source
score_table = df.pivot_table(index='User', values="Compound", aggfunc = np.mean)
score_table
from collections import Counter
import pandas as pd
a = dict(Counter(HT_positive))
t = list(a.items())
compound = score_table["Compound"]
df = pd.DataFrame(t, columns=["Hashtags", "Number of Occurence"])
df4 = df.append(compound)
df.to_csv('hashtags.csv', index=False)
df4_saved_file = pd.read_csv('hashtags.csv')
df4_saved_file
I'm getting a KeyError: "Compound". I understand how to add the "Compound" column in between "Hashtags" and "Number of Occurence"
I think that you could check if "Compound" or "User" should be the key that you use to query the value from the dictionary.
Currently working with pybliometrics (scopus) I want to create a loop that allows me to get affiliation information from multiple authors.
Basically, this is the idea of my loop. How do I do that with many authors?
from pybliometrics.scopus import AuthorRetrieval
import pandas as pd
import numpy as np
au = AuthorRetrieval(authorid)
au.affiliation_history
au.identifier
x = au.identifier
refs2 = au.affiliation_history
len(refs2)
refs2
df = pd.DataFrame(refs2)
df.columns
a_history = df
df['authorid'] = x
#moving authorid to 0
cols = list(df)
cols.insert(0, cols.pop(cols.index('authorid')))
df = df.loc[:, cols]
df.to_excel("af_historyfinal.xlsx")
Turning your code into a loop over multiple author IDs? Nothing easier than that. Let's say AUTHOR_IDS equals 7004212771 and 57209617104:
import pandas as pd
from pybliometrics.scopus import AuthorRetrieval
def retrieve_affiliations(auth_id):
"""Author's affiliation history from Scopus as DataFrame."""
au = AuthorRetrieval(authorid)
df = pd.DataFrame(au.affiliation_history)
df["auth_id"] = au.identifier
return df
AUTHOR_IDS = [7004212771, 57209617104]
# Option 1, for few IDs
df = pd.concat([retrieve_affiliations(a) for a in AUTHOR_IDS])
# Option 2, for many IDs
df = pd.DataFrame():
for a in AUTHOR_IDS:
df = df.append(retrieve_affiliations(a))
# Have author ID as first column
df = df.set_index("authorid").reset_index()
df.to_excel("af_historyfinal.xlsx", index=False)
If, say, your IDs are in a comma-separated file called "input.csv", with one column called "authors", then you start with
AUTHOR_IDS = pd.read_csv("input.csv")["authors"].unique()
So I know my code isn't that close to right, but I am trying to loop through a list of csv's, line by line, to create a new csv where each line will list all csv's that met a condition. First column in all csv's is "date", I want to list the name of all csv's where data["entry"] > 3 on that date with date still being the 1st column.
Update: What I'm trying to do is for each csv, make a new list of each date the condition was met and on those days on the new csv append file_name to that row/rows.
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
complete_string = ' is complete'
listdrs_confirmation = [ x + complete_string for x in listdrs]
#print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
##print(listdr)
# Convert date to timestamp and make index
data.index = data["date"].apply(lambda x: pd.Timestamp(x))
data.drop("date", axis=1, inplace=True)
return data
##create new table and append data
data = data[data.Entry > 3]
for date in data.date:
new_table[date].append(file_path)
new_table_data = data.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print(new_table_data)
I would do something like this. You need to modify the following snippet according to your needs.
import pandas as pd
from glob import glob
from collections import defaultdict
# create and save some random data
df1 = pd.DataFrame({'date':[1,2,3], 'entry':[4,3,2]})
df2 = pd.DataFrame({'date':[1,2,3], 'entry':[1,2,4]})
df3 = pd.DataFrame({'date':[1,2,3], 'entry':[3,1,5]})
df1.to_csv('table1.csv')
df2.to_csv('table2.csv')
df3.to_csv('table3.csv')
# read all the csv
tables = glob('*.csv')
new_table = defaultdict(list)
# create new table
for table in tables:
df = pd.read_csv(table)
df = df[df.entry > 2]
for date in df.date:
new_table[date].append(table)
new_table_df = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print (new_table_df)
date table names
0 1 table3.csv,table1.csv
1 2 table1.csv
2 3 table2.csv,table3.csv
Had some issues with the other code, here is the final solution I was able to come up with.
if 'Entry' in data:
##create new table and append data
data = data[data.Entry > 3]
if 'date' in data:
for date in data.date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.date == date].Entry)]}))
new_table
elif 'Date' in data:
for date in data.Date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.Date == date].Entry)]}))
# sorted(new_table, key=lambda x: x[0])
def find_max(tbl):
new_table_data = {}
for date in sorted(tbl.keys()):
merged_dt = pd.concat(tbl[date])
max_entry_v = max(list(merged_dt.Entry))
tbl_names = list(merged_dt[merged_dt.Entry == max_entry_v].FileName)
new_table_data[date] = tbl_names
return new_table_data
new_table_data = find_max(tbl=new_table)
#df = pd.DataFrame(new_table, columns =['date', 'tickers'])
#df.to_csv(input_path, index = False, header = True)
# find_max(new_table)
# new_table_data = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())],
# columns=['date', 'table names'])
print(new_table_data)
I have a list of patient id and drug names and a list of patient id and disease names. I want to find the most indicative drug for each disease.
To find this I want to do Fisher exact test to get the p-value for each disease/drug pair. But the loop runs very slowly, more than 10 hours. Is there a way to make the loop more efficient, or a better way to solve this association problem?
My loop:
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
for i in disease_list:
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
You need multiprocessing/multithreading, I have added the code.:
from multiprocessing.dummy import Pool as ThreadPool
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
def run_pairwise(i):
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
pool = ThreadPool(3)
pairwise_test_results = pool.map(run_pairwise,disease_list)
pool.close()
pool.join()
notes:http://chriskiehl.com/article/parallelism-in-one-line/
Crunching faster is good, but a better algorithm usually beats it ;-)
Filling in a bit,
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
# data files can be download at
# https://github.com/Saynah/platform/tree/d7e9f150ef2ff436387585960ca312a301847a46/data
meps_meds = pd.read_csv("meps_meds.csv") # 8 cols * 1,148,347 rows
meps_base_data = pd.read_csv("meps_base_data.csv") # 18 columns * 61,489 rows
# merge to get disease and drug info in same table
merged = pd.merge( # 25 cols * 1,148,347 rows
meps_base_data, meps_meds,
how='inner', left_on='id', right_on='id'
)
rx_list = meps_meds.rxName.unique().tolist() # 9218 items
disease_list = meps_base_data.columns.values[8:].tolist() # 10 items
Note that rx_list has a LOT of duplicates (eg. 52 entries for Amoxicillin, if you include misspellings).
Then
most_indicative = {}
for disease in disease_list:
# get unique (id, disease, prescription)
subset = merged[['id', disease, 'rxName']].drop_duplicates()
# keep only Yes/No entries under disease
subset = subset[subset[disease].isin(['Yes', 'No'])]
# summarize (replaces the inner loop)
tab = pd.crosstab(subset.rxName, subset[disease])
# bind "No" values for Fisher exact function
nf, yf = tab.sum().values
def p_value(x, nf=nf, yf=yf):
return fisher_exact([[nf - x.No, x.No], [yf - x.Yes, x.Yes]])[1]
# OPTIONAL:
# We can probably assume that the most-indicative drugs are among
# the most-prescribed; get just the 100 most-prescribed drugs
# Note: you have to get the nf, yf values before doing this!
tab = tab.sort_values("Yes", ascending=False)[:100]
# and apply the function
tab["P-value"] = tab.apply(p_value, axis=1)
# find the best match
best_med = tab.sort_values("P-value").index[0]
most_indicative[disease] = best_med
This now runs in about 18 minutes on my machine, and you could probably combine it with EM28's answer to speed it up by another factor of 4 or more.