matching rows to list of values in python - python

I have a 23 column dataframe where one of the columns called column_lei contain the LEIs of various companies, I also have a list called lei_codes which contains loads of specific LEIs which i need to find in the dataframe.
How could i run the list through each row in the lei column in the dataframe and if there is a match within any of the rows in that column to any of the values in the list for that entire row in the dataframe to be picked out and placed into a new dataframe. So at the end i have a new dataframe which contains all 23 columns of records where there was a match on the LEI column against the list.

You can use the function isin() on the column "column_lei"
Here an example with 3 columns:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.repeat("a", 5),
"column_lei": [1,2,3,4,5],
"b": np.repeat("b", 5)})
lei_codes = [1,3,5]
df_new = df[df.column_lei.isin(lei_codes)]
df_new

Related

How can I take values from a selection of columns in my Pandas dataframe to create a new column that contains a list of those values?

I have a dataframe in Pandas where I would like to turn the values of a set of columns (more specifically, from column index 3 to the end) into a new single column that contains a list of those values in each row.
Right now, I have code that can print out a list of the values in the columns, but only for single row. How can I do this for the whole dataframe?
import pandas as pd
orig_df = pd.read_csv('zipcode_price_dataset.csv')
df = orig_df.loc[(orig_df['State'] == "CA")]
row = df.head(1)
print(row[df.columns[3:].values].values[0])
I could iterate through the rows using a for loop, but is there a more concise way to do this?
Something like the following:
df['new'] = df[df.columns[3:]].values.tolist()
Use .iloc:
df.iloc[: , 3:].agg(list, axis=1)

How to Create New Column in Pandas From Two Existing Dataframes

I have two pandas dataframes, call them A and B. In A is most of my data (it's 110 million rows long), while B contains information I'd like to add (it's a dataframe that lists all the identifiers and counties). In dataframe A, I have a column called identifier. In dataframe B, I have two columns, identifier and county. I want to be able to merge the dataframes such that a new dataframe is created where I preserve all of the information in A, while also adding a new column county where I use the information provided in B to do so.
You need to use pd.merge
import pandas as pd
data_A = {'incident_date':['ert','szd','vfd','dvb','sfa','iop'] \
,'incident':['A','B','A','C','B','F']
}
data_B = {'incident':['A','B','C','D','E'] \
, 'number':[1,1,3,23,23]}
df_a = pd.DataFrame(data_A)
df_b = pd.DataFrame(data_B)
Inorder to preserve you df_A which has million rows
df_ans = df_a.merge(df_b[['number','incident']], on='incident',how='left')
The output
print(df_ans)
Output
Note:- There is NaN value since that value was not present in 2nd Dataframe

How to modify pandas dataframe format?

there is a problem with my pandas dataframe. DF is my original dataframe. Then I select specific columns of my DF:
df1=df[['cod_far','geo_lat','geo_lon']]
Then I set new names for those columns:
df1.columns = ['new_col1', 'cod_far', 'lat', 'lon']
And finally I group by DF1 by specific columns and convert it to a new DF called "occur"
occur = df1.groupby(['cod_far','lat','lon' ]).size()
occur=pd.DataFrame(occur)
The problem is that I am getting this: a dataframe with only ONE column. Rows are fine, but there should be 3 columns! Is there any way to drop that "0" and convert my dataframe "occur" into a dataframe of 3 columns?

How can I create multiple dateframes consisting of columns from an existing df with a loop [Pandas, Python]?

I tried to create multiple dataframes based on the columns of an existing dataframe. To keep the code simple and scalable, I used a loop. This is what I tried:
import pandas as pd
for index in range(df.shape[1]):
df_index = df.iloc[:, [0, index]]
The output of the above code is one dataframe consisting of the first and last column of the dataframe. The desired output is multiple dataframes that consist of the first column and the index in a single iteration.
The dataset I am using consist of 85 columns. The desired output would consist of 85 dataframes.
your code should look like this
import pandas as pd
dfs = []
for index in range(df.shape[1]):
dfs.append(df.iloc[:, [0, index]])

convert group of repeated columns to one column each using python

I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image

Categories