I have two pandas dataframes, call them A and B. In A is most of my data (it's 110 million rows long), while B contains information I'd like to add (it's a dataframe that lists all the identifiers and counties). In dataframe A, I have a column called identifier. In dataframe B, I have two columns, identifier and county. I want to be able to merge the dataframes such that a new dataframe is created where I preserve all of the information in A, while also adding a new column county where I use the information provided in B to do so.
You need to use pd.merge
import pandas as pd
data_A = {'incident_date':['ert','szd','vfd','dvb','sfa','iop'] \
,'incident':['A','B','A','C','B','F']
}
data_B = {'incident':['A','B','C','D','E'] \
, 'number':[1,1,3,23,23]}
df_a = pd.DataFrame(data_A)
df_b = pd.DataFrame(data_B)
Inorder to preserve you df_A which has million rows
df_ans = df_a.merge(df_b[['number','incident']], on='incident',how='left')
The output
print(df_ans)
Output
Note:- There is NaN value since that value was not present in 2nd Dataframe
Related
I'd like to lookup several columns from another dataframe that I have in a list to bring them over to my main dataframe, essentially doing a "v-lookup" of ~30 columns using ID as the key or lookup value for all columns.
However, for the columns that are the same between the two dataframes, I don't want to bring over the duplicate columns but have those values be filled in df1 from df2.
I've tried below:
df = pd.merge(df,df2[['ID', [look_up_cols]]] ,
on ='ID',
how ='left',
#suffixes=(False,False)
)
but it brings in the shared columns from df2 when I want df2's values filled into the same columns in df1.
I've tried also created a dictionary with the column pairs from each df and doing this for loop to lookup each item in the dictionary (lookup_map) in the other df using ID as the key:
for col in look_up_cols:
df1[col] = df2['ID'].map(lookup_map)
but this just returns NaNs.
You should be able to do something like the following:
df = pd.merge(df,df2[look_up_cols + ['ID']] ,
on ='ID',
how ='left')
This just adds the ID column to the look_up_cols list and thereby allows it to be used in the merge function
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I have a 23 column dataframe where one of the columns called column_lei contain the LEIs of various companies, I also have a list called lei_codes which contains loads of specific LEIs which i need to find in the dataframe.
How could i run the list through each row in the lei column in the dataframe and if there is a match within any of the rows in that column to any of the values in the list for that entire row in the dataframe to be picked out and placed into a new dataframe. So at the end i have a new dataframe which contains all 23 columns of records where there was a match on the LEI column against the list.
You can use the function isin() on the column "column_lei"
Here an example with 3 columns:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.repeat("a", 5),
"column_lei": [1,2,3,4,5],
"b": np.repeat("b", 5)})
lei_codes = [1,3,5]
df_new = df[df.column_lei.isin(lei_codes)]
df_new
I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image
I have two dataframes. Both have the same structure with same columns/columnnames.
A-> dataframe with (v,w,x,y,z) columns ( Some values)
b -> dataframe with (v,w,x,y,z) columns ( All values)
I want to take the value from A dataframe and insert it into B dataframe.
Suppose when v=1, I need to fetch the rows from A dataframe where v==1 and insert into b dataframe. Also I want to insert it to the first row of the B Dataframe.
I tried the following,
b.insert(loc=1,values=A[A.v==1])
But getting errors
Can anybody help in doing this?
Thanks
Just concatenate?
import pandas as pd
b = pd.concat([A[A.v==1],b])