I'm going over files in a folder, and I want to merge the datasets based on the variable called key.This is my code so far. And I have an example of what the datasets might looks like/what I expect the final to look like:
dfs=[]
for f in files:
for name, sheet in sheets_dict.items():
if name=="Main":
data = sheet
dfs.append(data)
Example of dfs:
df1 = {'key': ["A","B"], 'Answer':["yes","No"]}
df1 = pd.DataFrame(data=df1)
df2={'key': ["A","C"], 'Answer':["No","c"]}
df2 = pd.DataFrame(data=df2)
final output
final={'A': ["yes","No"], 'B':["No",""],'C':["","c"],'file':['df1','df2']}
final = pd.DataFrame(data=final)
This is what I have tried but I can't make it work:
df_key={'key': ["A","B","C"]}
df_key = pd.DataFrame(data=df_key)
df_final=[]
for df in dfs:
temp= pd.merge(df_key[['key']],df, on=['key'], how = 'left')
temp_t= temp.transpose()
df_final.append(temp_t)
Reshaping and concatenating the dataframes is pretty straightforward. But in order to add the file value you will need to either a) have the names of the dataframes in a list of strings, or b) generate new names as you go along.
Here is the code
dfs = [df1, df2] # populate dfs as needed
master_df = []
df_key = {'key': ["A","B","C"]}
df_key = pd.DataFrame(df_key) # assuming you already have this dataframe created
master_df.append(pd.Series(index=df_key.columns))
for i, df in enumerate(dfs):
df = df.set_index('key').squeeze()
df.loc['file'] = f'df{i+1}'
master_df.append(df)
# or iterate the dfs alongside their file names
# for fname, df in zip(file_names, dfs):
# df = df.set_index('key').squeeze()
# df.loc['file'] = fname
# master_df.append(df)
master_df = pd.concat(master_df, axis=1).T
# rearrange columns
master_df = master_df[
master_df.columns.difference(['file']).to_list() + ['file']
]
# fill NaNs with empty string
master_df.fillna('', inplace=True)
Output
A B C file
Answer yes No df1
Answer No c df2
Related
I have to append the data in CSV, the problem I am facing is intead of appending I am overwriting the data, not able to retain the old data, example :
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)"]
df["tergetz"] = ["str(target_path)"]
df["TMP"] = ["total_matching_points"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False)
Now if I add a new value
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)_New"]
df["tergetz"] = ["str(target_path)_New"]
df["TMP"] = ["total_matching_points_New"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False)
It is keeping the latest data in csv instead I want both data to be updated in csv. any idea?
I have tried to create a new csv with pandas dataframe and I want to append the values instead overwriting
I have tried:
finalDf = pd.DataFrame(columns=['sourcez', 'tergetz', 'TMP'])
df = pd.DataFrame()
df["sourcez"] = ["str(source_Path)"]
df["tergetz"] = ["str(target_path)"]
df["TMP"] = ["total_matching_points"]
finalDf = finalDf.append(df)
finalDf.to_csv('Testing.csv', index=False, mode='a+')
But the problem is heading is repeating csv:
sourcez,tergetz,TMP
str(source_Path),str(target_path),total_matching_points
sourcez,tergetz,TMP
str(source_Path)_New,str(target_path)_New,total_matching_points_New
How to remove repeated headings sourcez,tergetz,TMP
df1 = pd.read_excel('master_textbook_list_for_connections.xlsx')
df1.head(5)
df1.columns
new_df = df1.to_csv("csdatext.csv", index=False, columns=['PREFIX'='CSDA'])
** I have created a data frame and I need to save only the part where prefix = 'CSDA' called csdatext.csv**
I am assuming "Prefix" is a column here:
df1 = pd.read_excel('master_textbook_list_for_connections.xlsx')
df1.head(5)
df1.columns
new_df = df1[df1['PREFIX'] == 'CSDA']
new_df.to_csv("csdatext.csv", index=False)
I have around 400 tables that I want to merge based on some specific columns (some tables may not have all columns compared to column_list - then there should be NaN)
I am using the code below. It filters the columns of interest like intended but when appending filter_df to final, then final stays empty. Any help much appreciated.
final = pd.DataFrame(columns=column_list)
files = os.listdir(path)
num = len(files)
for idx, file in enumerate(files):
df = pd.read_csv(os.path.join(path, file), sep=',', index_col=False, header=3)
df = df.rename(columns=lambda x: x.strip()) # Some Column Names have trailing space
filter_df = df.loc[:, df.columns.isin(column_list)]
final.append(filter_df, ignore_index=True)
print('Progress:',round((idx+1)/num,4)*100, '%')
pd.DataFrame.to_csv(final, base_path + 'Master_File.csv')
try to set final to append result, like this
final = final.append(filter_df, ignore_index=True)
I'm a beginner python user and I am wondering if it is possible to store multiple dataframes that is generated from a loop, into a list.
Unfortunately, I do not have a reproducible example. What I am trying to do is to read in a directory of pdf files, make row 0 into the header, drop that row and store it into one dataframe within a list.
master_df= []
for i in range(1, len(pdffiles)):
df = read_pdf(pdffiles[i])
df.columns = df.iloc[0,] #get col names
df = df.reindex(df.index.drop(0)) #drop first row
df = df.replace(np.nan, '', regex=True, inplace = True)
master_df = df
This is the code that I have but I am getting this error at df.columns, reindex and replace.
AttributeError: 'NoneType' object has no attribute 'replace'
Could anyone point me in the right direction?
Update:
May I ask why the following code does not work? I'm trying to parse in a continue when the dataframe is not a None set.
master_df = []
for i in range(len(pdffiles)):
df = read_pdf(pdffiles[i])
if df is not None:
continue
df.columns = df.iloc[0,:] # get col names
df = df.reindex(df.index.drop(0)) # drop first row
df = df.fillna('')
master_df.append(df)
It is possible to store data frames in a list:
master_df = []
for i in range(len(pdffiles)):
df = read_pdf(pdffiles[i])
df.columns = df.iloc[0,:] # get col names
df = df.reindex(df.index.drop(0)) # drop first row
df = df.fillna('')
master_df.append(df)
You can use df.fillna() to replace NaN values with ''.
I am trying to parse multiple excel sheets with Pandas into separate individual DataFrames.
My code so far is:
sheet_names =[tab1, tab2]
df_names = [1,2]
def initilize_dataframes(sheet_names):
for name in sheet_names:
df = xls_file.parse(name) #parse the xlxs sheet
df = df.transpose() #transpose dates to index
new_header = df.iloc[0] #column header names
df = df[1:] #drop 1st row
df.rename(columns=new_header, inplace= True) #rename the columns
return df`
`
for i in df_names:
df_(i) = initilize_dataframes(sheet_names)#something like this idk
The last two lines I can not wrap my head around. I get that the function will return the df, but I would like it to take the values from the df_names list. And label the DataFrame accordingly.
For example, tab1 in the excel sheet the DataFrame should be named df_1 and looping for tab2 and df_2 respectively.
It is possible by globals:
for i, val in enumerate(df_names):
globals()['df_' + str(vals)] = initilize_dataframes(sheet_names[i])
But better is use dict of DataFrames, sheet_names select by positions from enumerate, but need substract 1, because python counts from 0:
dfs = {}
for i, val in enumerate(df_names):
dfs[val] = initilize_dataframes(sheet_names[i])
print (dfs[1])