I'm relatively new to python and pandas and I face the following problem: I have 20+ spreadsheets with multiple sheets. I'd like to concatenate the second sheet from each spreadsheet into a single spreadsheet. I'm using the below code, which works to the point that it creates a list of sheets but doesn't concatenate them correctly, the combined file has the only sheet from the first file. Each sheet has the same header row and the same structure.
Any help would be appreciated. The code I'm using is below:
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Site_Users")
extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
xl_list=[]
for f in all_filenames:
df=pd.read_excel(f, sheet_name = 1)
xl_list.append(df)
combined = pd.concat(xl_list, ignore_index = True)
combined.to_excel( "combined.xlsx", index=False)
Working under the assumption that you have a list of df's, try adding axis=0 to your concat.
i.e.
combined = pd.concat(xl_list, axis = 0, ignore_index = True)
Just to close the loop on this. I found the answer. The code was correct but there was number of rows which looked empty but they had formulas in them, which for the code looked like not empty cells, so it was adding those rows to the combined sheet. Because of this I missed the added rows as they were 400 rows below empty rows.
Related
there are many similar questions to this. I have looked through them all and I can't figure out how to fix my issue.
I have 11 dataframes. I would like to export all of these dataframes to one excel file, with one sheet per data frame. I have 2 lists: One is a list of dataframe objects, and one is a list of the names I want for each df. the lists are each ordered so that if you iterated through each list at the same time it would be the df and the df name I want.
Here is my code:
for (df, df_name) in zip(df_list, df_name_list):
sheetname = "{}".format(df_name)
df.to_excel(r"myfolder\myfile.xlsx", index=False, sheet_name=sheetname)
It exports to excel, but it appears to overwrite the sheet each time. The final sheet has the same name as the final dataframe, so it looped through both lists but it won't save separate sheets. Any help would be much appreciated!
UPDATE - ISSUE FIXED - EDITING TO ADD THE CODE THAT WORKED
`with pd.ExcelWriter(r"myfolder\myfile.xlsx") as writer:
for (df, df_name) in zip(df_list, df_name_list):
sheetname = "{}".format(df_name)
df.to_excel(writer, sheet_name=sheetname)'
I just tried something based docs example as it seems to work ok see:
import pandas as pd
data = [(f'Sheet {j}', pd.DataFrame({'a': [i for i in range(20)], 'b': [j for i in range(20)] })) for j in range(10)]
with pd.ExcelWriter('output.xlsx') as writer:
for sheet, df in data:
df.to_excel(writer, sheet_name=sheet)
So, I have 7200 txt files, each with 25 lines. I would like to create a dataframe from them, with 7200 rows and 25 columns -- each line of the .txt file would be a value a column.
For that, first I have created a list column_names with length 25, and tested importing one single .txt file.
However, when I try this:
pd.read_csv('Data/fake-meta-information/1-meta.txt', delim_whitespace=True, names=column_names)
I get 25x25 dataframe, with values only on the first column. How do I read this into the dataframe in a way that I can get the txt lines to be imputed as values into the columns, and not imputing everything into the first column and creating 25 rows?
My next step would be creating a for loop to append each text file as a new row.
Probably something like this:
dir1 = *folder_path*
list = os.listdir(dir1)
number_files = len(list)
for i in range(number_files):
title = list[i]
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True, names=column_names)
df = df.append(df_temp,ignore_index=True)
I hope I have been clear. Thank you all in advance!
read_csv generates a row per line in the source file but you want them to be columns. You could read the rows and pivot to columns, but since these files have a single value per line, you can just read them in numpy and use each resulting array as a row in a dataframe.
import numpy as np
import pandas as pd
from pathlib import Path
dir1 = Path(".")
df = pd.DataFrame([np.loadtxt(filename) for filename in dir1.glob("*.txt")])
print(df)
tdelaney's answer is probably "better" than mine, but if you want to keep your code more stylistically closer to what you are currently doing the following is another option.
You are getting your current output (25x25 with data in the first column only) because your read data is 25x1 but you are forcing the dataframe to have 25 columns with your names=column_names parameter.
To solve, just wait until the end to apply the column names:
Get a 25x1 df (drop the names param):
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True)
Append the 25x1 df forming a 25x7200 df: df = df.append(df_temp,ignore_index=True)
Transpose the df forming the final 7200x25 df: df=df.T
Add column names: df.columns=column_names
I have a dataframe which i created from merging one column from 7 different excel file. Below is the code i used:
import pandas as pd
import glob
my_excel_files = glob.glob(r"C:\Users\.........\*.xlsx")
total_dataframe = pd.DataFrame()
for file in my_excel_files:
new_df = df['Comments']
total_dataframe = pd.concat([total_dataframe, new_df], axis=1) # Puts together all Comments columns
As you can see from the code i grab the 'Comments' column from each excel and put them together into a new df, the only issue is i want to be able to add the filename into the column name so i know which column comes from which excel file, all of them are just called 'Comments' right now. So ideally one of the column headers would be 'Comments (first_response.xlsx)'
lets use pathlib and pd.concat
using a dict comprehension we can grab the .name attribute from the pathlib object and when using concat the filename will be set as the index.
from pathlib import Path
dfs = pd.concat({f.name : pd.read_excel(f) for f in Path(r'C:\Users\..').glob('*.xlsx')})
this will create an index with the file name, you can reset_index if you want to place it as a column.
I have been using the following code from another StackOverflow answer to concatenate data from multiple excel sheets in the same workbook into one sheet.
This works great when the column names is uniform across all sheets in a workbook. However, I'm running into an issue with one specific workbook where only the first column is named differently (or not named at all.. so is blank) but the rest of the columns are the same.
How do I merge such sheets? Is there a way to rename the first column of each sheet into one name so that I can then use the steps from the answer linked above?
Yes, you can rename all the columns as:
# read excel
dfs = pd.read_excel('tmp.xlsx', sheetname=None, ignore_index=True)
# rename columns
column_names = ['col1', 'col2', ...]
for df in dfs.values(): df.columns = column_names
# concat
total_df = pd.concat(dfs.values())
Or, you can ignore the header in read_excel so that the columns are labeled as 0,1,2,...:
# read ignore header
dfs = pd.read_excel('tmp.xlsx', sheet_name=None,
header=None, skiprows=1)
total_df = pd.concat(dfs.values)
# rename
total_df.columns = column_names
I am operating with python 2.7 and I wrote a script that should take the name of two .xlsx files, use pandas to convert them in two dataframe and then concatenate them.
The two files under consideration have the same rows and different columns.
Basically, I have these two Excel files:
I would like to keep the same rows and just unite the columns.
The code is the following:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = pd.concat([sheet10, sheet20], sort = False)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, 'Sheet 1')
output.save()
Instead of doing what I expected (given the examples I read online), the output becomes something like this:
Does anyone know I could I improve my script?
Thank you very much.
The best answer here really depends on the exact shape of your data. Based on the example you have provided it looks like the data is indexed identically between the two dataframes with differing column headers that you want preserved. If this is the case this would be the best solution:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = sheet10.merge(sheet20, how="left", left_index=True, right_index=True)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, sheet_name='Sheet 1', ignore_index=True)
output.save()
Since there is a direct match between the number of rows in the two initial dataframes it doesn't really matter if a left, right, outer, or inner join is used. In this example I used a left join.
If the rows in the two data frames do not perfectly line up though, the join method selected can have a huge impact on your output. I recommend looking at pandas documentation on merge/join/concatenate before you go any further.
To get the expected output using pd.concat, column names in both the dataframes should be same. Here's how to do,
# Create a 1:1 mapping of sheet10 and sheet20 columns
cols_mapping = dict(zip(sheet20.columns, sheet10.columns))
# Rename the columns in sheet20 to match with that of sheet10
sheet20_renamed = sheet20.rename(cols_mapping, axis=1)
concatenated = pd.concat([sheet10, sheet20_renamed])