How to Merge Multilevel Column Dataframes on a Low Level Column - python

I have several small datasets from a databse displaying genes in different biological pathways. My end goal is to find what are the genes showing up in different datasets. For this reason, i tried to make multilevel dataframes from each dataset and merge them on a single column. However, it looks like it is getting nowhere.
Test samples: https://www.mediafire.com/file/bks9i9unfci0h1f/sample.rar/file
Making multilevel columns:
import pandas as pd
df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv")
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv")
df3 = pd.read_csv("Endocytosis.csv")
title1 = "Bacterial invasion of epithelial cells"
title2 = "C-type lectin receptor signaling pathway"
title3 = "Endocytosis"
final1 = pd.concat({title1: df1}, axis = 1)
final2 = pd.concat({title2: df2}, axis = 1)
final3 = pd.concat({title3: df3}, axis = 1)
I tried to use pandas.merge() to merge the dataframes on "User ID" column:
pd.merge(final1, final2, on = "User ID", how = "outer")
But i get an error. I can not use droplevel(), because i need the title on top. So, i can see which dataset each sample belongs to.
Any sugesstion?

Seeing as you want to see which genes appear in different datasets, it sounds like an inner join might be more useful? With User ID as just a single row index.
df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv").set_index('User ID')
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv").set_index('User ID')
df3 = pd.read_csv("Endocytosis.csv").set_index('User ID')
final1 = pd.concat({"Bacterial invasion of epithelial cells": df1}, axis = 1)
final2 = pd.concat({"C-type lectin receptor signaling pathway": df2}, axis = 1)
final3 = pd.concat({"Endocytosis": df3}, axis = 1)
final1.merge(final3, left_index=True, right_index=True)#.merge(final2, left_index=True, right_index=True)
Output:
Bacterial invasion of epithelial cells Endocytosis
Gene Symbol Gene Name Entrez Gene Score Gene Symbol Gene Name Entrez Gene Score
User ID
P51636 CAV2 caveolin 2 858 1.3911 CAV2 caveolin 2 858 1.3911
Q03135 CAV1 caveolin 1 857 1.5935 CAV1 caveolin 1 857 1.5935
(I've commented out the second merge operation with final2 as there aren't any overlapping genes between it and the other two, but you can repeat that process with as many datasets as you like.)

Related

Assistance with splitting data frame to new columns

I'm having trouble splitting a data frame by _ and creating new columns from it.
The original strand
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
my current code
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
Output:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
I would really like to split up 'section' in a way that puts it in new columns based on '_'
I've tried so many different variations of regex to split 'section' and all of them either gave me headings with no fill or they added columns after section and text, which isn't useful. I should also add theres going to be around 100,000 observations.
Desired result:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
Any guidance would be appreciated.
If you always know the number of splits, you can do something like:
import pandas as pd
df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
That way, the dataframe will turn into
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
Since you want the columns to be on a certain order, you can reorder the dataframe's columns as below:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d

Match two columns in different dataframes and show match score in python - fast

I have two dataframes df1 and df2.
df1 = pd.DataFramE({'Name': ['Zebra system','Lion healthcare'], 'Type': ['S','A']})
df2 = pd.DataFrame({'AltName': ['Zebra system llc','abra inc. 54','Lions corp health care','Zebra sys co','lions system atl'], 'Adr': ['45 main st','23 zoo ave', '12 zoo blvd.','56 veg st','23 peach st']})
Example above. df2 has about 300k records and df1 has about 10k records. I want to match df1 Name with df2 AltName and get a new dataframe with df1 and potential matched rows from df2 along with the score. I want to be able to have a score threshold that i can adjust, for instance add matches from df2 that are above 80.
This is what I have right now:
matched=pd.DataFrame({'df1-name':[],'df1-type':[],'df2-name':[],'df2-adr':[]})
for row in df1.index:
first = df1.loc[row,"Name"]
type1 = df1.loc[row,"Type"]
for row2 in df2.index:
second = df2.loc[row2,"AltName"]
adr1 = df2.loc[row2,"Adr"]
matched_token = fuzz.partial_ratio(first,second)
if matched_token>60:
matched.loc[row2,"df1-name"]=first
matched.loc[row2,"df1-type"]=type1
matched.loc[row2,"df2-name"]=second
matched.loc[row2,"df2-adr"]=adr1
however this is very slow, code has been running for over 16hrs and going...

PythonValueError: Can only compare identically-labeled Series objects

The 2 dataframes I am comparing are of different size (have the same index though) and I suppose that is why I am getting the error. Can you please suggest me a way to get around that. I am looking for those rows in df2 whose user_id match with those of df1. Thanks and appreciate your response.
data = np.array([['user_id','comment','label'],
[100,'RT #Dvillain_: #oomf should text me.',0],
[100,'Buy viagra',1],
[101,'#nowplaying M.C. Shan - Juice Crew Law on',0],
[101,'Buy viagra two',1]])
data2 = np.array([['user_id','comment','label'],
[100,'First comment',0],
[100,'Buy viagra',1],
[102,'Buy viagra two',1]])
df1 = pd.DataFrame(data=data[1:,0:],columns = data[0,0:])
df2 = pd.DataFrame(data=data2[1:,0:],columns = data[0,0:])
df = df2[df2['user_id'] == df1['user_id']]
You are looking for isin
df = df2[df2['user_id'].isin(df1['user_id'])]
df
Out[814]:
user_id comment label
0 100 First comment 0
1 100 Buy viagra 1

Pandas Dataframe: Accessing via composite index created by groupby operation

I want to calculate a group specific ratio gathered from two datasets.
The two Dataframes are read from a database with
leases = pd.read_sql_query(sql, connection)
sales = pd.read_sql_query(sql, connection)
one for real estate offered for sale, the other for rented objects.
Then I group both of them by their city and the category I'm interested in:
leasegroups = leases.groupby(['IDconjugate', "city"])
salegroups = sales.groupby(['IDconjugate', "city"])
Now I want to know the ratio between the cheapest rental object per category and city and the most expensively sold object to obtain a lower bound for possible return:
minlease = leasegroups['price'].min()
maxsale = salegroups['price'].max()
ratios = minlease*12/maxsale
I get an output like: Category - City: Ratio
But I cannot access the ratio object by city nor category. I tried creating a new dataframe with:
newframe = pd.DataFrame({"Minleases" : minlease,"Maxsales" : maxsale,"Ratios" : ratios})
newframe = newframe.loc[newframe['Ratios'].notnull()]
which gives me the correct rows, and newframe.index returns the groups.
index.name gives ['IDconjugate', 'city'] but indexing results in a KeyError. How can I make an index out of the different groups: ID0+city1, ID0+city2 etc... ?
EDIT:
The output looks like this:
Maxsales Minleases Ratios
IDconjugate city
1 argeles gazost 59500 337 0.067966
chelles 129000 519 0.048279
enghien-les-bains 143000 696 0.058406
esbly 117990 495 0.050343
foix 58000 350 0.072414
The goal was to select the top ratios and plot them with bokeh, which takes a
dataframe object and plots a column versus an index as I understand it:
topselect = ratio.loc[ratio["Ratios"] > ratio["Ratios"].quantile(quant)]
dots = Dot(topselect, values='Ratios', label=topselect.index, tools=[hover,],
title="{}% best minimal Lease/Sale Ratios per City and Group".format(topperc*100), width=600)
I really only needed the index as a list in the original order, so the following worked:
ids = []
cities = []
for l in topselect.index:
ids.append(str(int(l[0])))
cities.append(l[1])
newind = [i+"_"+j for i,j in zip(ids, cities)]
topselect.index = newind
Now the plot shows 1_city1 ... 1_city2 ... n_cityX on the x-axis. But I figure there must be some obvious way inside the pandas framework that I'm missing.

Pandas row analysis for consecutive dates

Following a "chain" of rows and counting the consecutive months from a CSV file.
Currently I am reading a CSV file with 5 columns of interest (based on insurance policies):
CONTRACT_ID START-DATE END-DATE CANCEL_FLAG OLD_CON_ID
123456 2015-05-30 2016-05-30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...
I want to count the number of consecutive months a Contract chain goes for.
Example: Taking the START-DATE from the contract at the "front" of the chain (oldest contract) and the END-DATE from the end of the chain (newest contract). Oldest contract being defined by either the one before a cancelled contract in a chain or the one that has no OLD_CON_ID value.
Each row represents a contract and the prev_Con_ID points to the previous contract ID. The desired output is how many months the contract chains goes back until a gap (i.e. customer didn't have a contract for a period of time). If nothing in that column then that is the first contract in this chain.
CANCEL_FLAG should also cut the chain because a value of 1 designates that the contract was cancelled.
Current code counts the number of active contracts for each year by editing the dataframe like so:
df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) &
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']
Here are the first 6 lines of code in which I create the dataframes and adjust the datetime values:
file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)
df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])
Ideal output is something like:
FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS
1234567 5 60
1500001 1 4
800 10 180
Those data points would then be graphed.
EDIT2: CSV file changed, might be easier now. Question updated.
Not sure if I totally undertand your requirement, but does something like this work?:
df_contract['TOTAL_YEARS'] = (df_contract['END_DATE'] - df_contract['START_DATE']
)/np.timedelta64(1,'Y')
df_contract['TOTAL_YEARS'][(df['CANCEL_FLAG'] == 1) && (df['TOTAL_YEARS'] > 1)] = 1
After a lot of trial and error I got it working!
This finds the time difference between the first and last contracts in the chain and finds the length of the chain.
Not the cleanest code by far, but it works:
test = 'START_DATE'
df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},
inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)
df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)
df_merge1 = pd.merge(df_short, df_copy2,
how='left',
on=['PID'])
df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)
The copy, merge, fillna, and rename code is repeated for 5 merged dataframes then:
df_merged = pd.merge(df_merge5, df_test,
how='right',
on=['CID'])
df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
)/np.timedelta64(1,'M'))
df_merged4 = df_merged[
(df_merged['PED'] >= pd.to_datetime('2015-07-06'))
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3
Hopefully my code is understood and will help someone in the future.

Categories