How to compare two dataframes in pandas - python

I have two dataframes:
The first one has n row of names.
The second one has n row of names.
for each name in the first dataframe:
see how many times it appears in the second dataframe.
The code looks something like this:
df5 = pd.read_excel(item1, usecols="B",skiprows=6)
df10 = pd.read_excel('SMR4xx_Change_situation.xlsm', sheet_name='LoPN',usecols='D', skiprows=4)
how do i count the number of times a name appears in the second database and output it besides the name in the first database?
Ex: The first name in the database is John. John appears in the second dataframe 4 times => output John 4
either print it in the console or write in a separate excel file the first database and on the second column the number of appearances.
Anything could help.

Well, you can create a datarame for the records you are seeking.
You can first get list of unique names in the first dataframe like
uniqueNames = df5['B'].unique() # Assuming column B contains the names
dfCount = pd.DataFrame(columns=['name', 'count'])
Now you can iterate through each of the unique names in the first dataframe and compare against the second dataframe like this:
for eachName in uniqueNames:
dfCount = dfCount.append({'name':eachName,
'count':(df10['D'] == eachName).sum()},
ignore_index=True) # Assuming you need to compare with column D
Or
If you want the counts to be present in the first database, something like this should work
import numpy as np
df10['counts'] = np.nan
df10['counts'] = np.select([dfCount['name']==df5['B']], [dfCount['count']], np.nan)

Related

Pulling columns of dataframe into separate dataframe, then replacing duplicates with mean values

I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.
There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().

Subsetting a dataframe in pandas according to column name values

I have a dataframe in pandas that i need to split up. It is much larger than this, but here is an example:
ID A B
a 0 0
b 1 1
c 2 2
and I have a list: keep_list = ['ID','A']
and another list: recode_list = ['ID','B']
I'd like the split the dataframe up by the column headers into two dataframes: one dataframe with those columns and values whose column headers match the keep_list, and one with those column headers and data that match the recode_alleles list. Every code I have tried thus far has not worked as it is trying to compare the values to the list, not the column names.
Thank you so much in advance for your help!
Assuming your DataFrame's name is df:
you can simply do
df[keep_list] and df[recode_list] to get what you want.
You can do this by index.intersection:
df1 = df[df.columns.intersection(keep_list)]
df2 = df[df.columns.intersection(recode_list)]

convert group of repeated columns to one column each using python

I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image

Deleting specific columns pandas

I have the following code:
dfs = glob.glob(path + "/*.csv")
df = pd.concat([pd.read_csv(df) for df in dfs], axis=1, ignore_index=False)
df1 = df.loc[:,~df.columns.duplicated()]
df1.to_csv("userpath.csv")
The purpose of this code is to take random/multiple csv files all taken from the same database and to merge them together next to each other. These files all have the same rows with different columns names but have the same code on the first row. For example csv file one will have J1_01,J1_02,J2_01,J2_02..... and then it will repeat with the other merged csv file J1_01,J1_02,J2_01,J2_02,J3_01.... All the csv files will have varying columns. The second row provides the title description of the column's value. Each csv file has three columns that give a description of name of the row and the ID number of the row for example: Id, Id2, Label Name. I want to keep the first instance of these three and delete the remaining duplicates. I used the code df.loc[:,~df.columns.duplicated()] however, since the J1_01,J1_02,J2_01,J2_02,J3_01.... will eventually duplicate as the new csv file is merged, I loose some columns. Is there any way to specify the df.loc[:,~df.columns.duplicated()] code to just drop the three Id, Id2, Label Name specific duplicates after keeping the first three? Thanks! As a follow up question if anyone is willing to help, if I want to replace specific characters present in each column(":",";" or spaces) with say an underscore, is there any way to do this with pandas? Thanks again!
Edit: Here's a screenshot of the merged csv file.
I want to keep the first instance of 'GEO.id','GEO.id2' and 'Geo.displ' and delete anytime these three columns are repeated.
From your image it seems that the columns you want to keep are the columns that begin with GEO. To do this, you can use regex to match the names, then get the indices of these columns, then splice the dataframe based on the column index.
import re
pattern = r'GEO' # or just "id" or whatever pattern best matches your data
# Returns list of indices that match your pattern
match_idx = [i for i, e in enumerate(df.columns) if re.search(pattern, e)]
# Select all but the first two columns (since you want to keep those)
drop_cols = match_idx[2:]
# Now choose all columns that don't match the indices of the columns you're dropping
usecols = [idx for idx, e in enumerate(df.columns) if idx not in drop_cols]
# Then select your data
df1 = df.iloc[:, usecols]
Note: if you try to select a single column like df['GEO.id'], it will return all the columns called GEO.id, which is why we have to drop the columns by index and not their name.

How to feed new columns every time in a loop to a spark dataframe?

I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.
This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows
You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want

Categories