Deleting specific columns pandas - python

I have the following code:
dfs = glob.glob(path + "/*.csv")
df = pd.concat([pd.read_csv(df) for df in dfs], axis=1, ignore_index=False)
df1 = df.loc[:,~df.columns.duplicated()]
df1.to_csv("userpath.csv")
The purpose of this code is to take random/multiple csv files all taken from the same database and to merge them together next to each other. These files all have the same rows with different columns names but have the same code on the first row. For example csv file one will have J1_01,J1_02,J2_01,J2_02..... and then it will repeat with the other merged csv file J1_01,J1_02,J2_01,J2_02,J3_01.... All the csv files will have varying columns. The second row provides the title description of the column's value. Each csv file has three columns that give a description of name of the row and the ID number of the row for example: Id, Id2, Label Name. I want to keep the first instance of these three and delete the remaining duplicates. I used the code df.loc[:,~df.columns.duplicated()] however, since the J1_01,J1_02,J2_01,J2_02,J3_01.... will eventually duplicate as the new csv file is merged, I loose some columns. Is there any way to specify the df.loc[:,~df.columns.duplicated()] code to just drop the three Id, Id2, Label Name specific duplicates after keeping the first three? Thanks! As a follow up question if anyone is willing to help, if I want to replace specific characters present in each column(":",";" or spaces) with say an underscore, is there any way to do this with pandas? Thanks again!
Edit: Here's a screenshot of the merged csv file.
I want to keep the first instance of 'GEO.id','GEO.id2' and 'Geo.displ' and delete anytime these three columns are repeated.

From your image it seems that the columns you want to keep are the columns that begin with GEO. To do this, you can use regex to match the names, then get the indices of these columns, then splice the dataframe based on the column index.
import re
pattern = r'GEO' # or just "id" or whatever pattern best matches your data
# Returns list of indices that match your pattern
match_idx = [i for i, e in enumerate(df.columns) if re.search(pattern, e)]
# Select all but the first two columns (since you want to keep those)
drop_cols = match_idx[2:]
# Now choose all columns that don't match the indices of the columns you're dropping
usecols = [idx for idx, e in enumerate(df.columns) if idx not in drop_cols]
# Then select your data
df1 = df.iloc[:, usecols]
Note: if you try to select a single column like df['GEO.id'], it will return all the columns called GEO.id, which is why we have to drop the columns by index and not their name.

Related

How to compare two dataframes in pandas

I have two dataframes:
The first one has n row of names.
The second one has n row of names.
for each name in the first dataframe:
see how many times it appears in the second dataframe.
The code looks something like this:
df5 = pd.read_excel(item1, usecols="B",skiprows=6)
df10 = pd.read_excel('SMR4xx_Change_situation.xlsm', sheet_name='LoPN',usecols='D', skiprows=4)
how do i count the number of times a name appears in the second database and output it besides the name in the first database?
Ex: The first name in the database is John. John appears in the second dataframe 4 times => output John 4
either print it in the console or write in a separate excel file the first database and on the second column the number of appearances.
Anything could help.
Well, you can create a datarame for the records you are seeking.
You can first get list of unique names in the first dataframe like
uniqueNames = df5['B'].unique() # Assuming column B contains the names
dfCount = pd.DataFrame(columns=['name', 'count'])
Now you can iterate through each of the unique names in the first dataframe and compare against the second dataframe like this:
for eachName in uniqueNames:
dfCount = dfCount.append({'name':eachName,
'count':(df10['D'] == eachName).sum()},
ignore_index=True) # Assuming you need to compare with column D
Or
If you want the counts to be present in the first database, something like this should work
import numpy as np
df10['counts'] = np.nan
df10['counts'] = np.select([dfCount['name']==df5['B']], [dfCount['count']], np.nan)

Checking panda dataframe column for a match in a list

I have a pandas dataframe with two columns, a file id number and a list of keywords from that file. I essentially want to be able to iterate through each row and see if a chosen keyword is in the list of file key words and if it is print out the file id. Or I could make a new dataframe with all positive matches and print the file id's from there.
After researching it I was wanting to use
df.loc[df['key words'] == key_word, :]
which would give me a new dataframe of all the postive matches. The issue with this was that there were no positive matches as I forgot my 'key words' column has a list of key words in each row. Would anyone be able to help me find a solution? Much appreciated
EDIT: I'm unable to provide a snippet of my table as the data is sensitive, however this is the general idea of what it's like:
A solution can be pandas inner join: You'd better first convert your key_word array to a pandas dataframe. let's say you have saved the array as "key_words.csv" and give the label of "my_key" to that:
col_name = ['my_key']
df1 = pd.read_csv("key_words.csv", names = col_name ,skiprows=[0],encoding ='utf-8')
use skip_rows[0] if your first line is comment if not ignore it.
!!!Note that: It is very important that both of your key_words encodings be exactly the same as they are string if not your code won't find any match**!!!**
To apply my comment you can do (sometimes it works without using convert_dtypes but some times not!):
df1[col_name] = df1[col_name].astype(str)
df1 = df1.convert_dtypes()
you need to repeat the same dtype converting for your df['key words'] column, too.
and you then can use inner join:
df12 = df1.merge(df, how ='inner', left_on = key1, right_on = key)
Key1 and Key being your labels of the columns you want to compare.
df12 includes only the rows with a common keyword string, that you can save it in a separate file.
I managed to get the code right. I did:
for i in range(len(df['file id'])):
if keyword in df.loc[i, 'key words']:
print("https://www.website" + df.loc[i, 'file id'])
A bit easier than I thought. Thanks everyone for your answers though.

convert group of repeated columns to one column each using python

I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

How to feed new columns every time in a loop to a spark dataframe?

I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.
This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows
You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want

Categories