I have 5 Excel files, each file contains 8 sheets, each sheet contains around 30 rows. So this means 1 file has in total 30 x 8 = 240 rows. Is there a speedy trick I can use to combine all these 5 files (including sheets) into 1 Excelfile with in total 240 x 5 rows = 1200 rows?
This is my current code:
import os
import pandas as pd
files = os.listdir('c:\data\KM\Desktop\my_folder')
os.chdir(r'c:\data\KM\Desktop\my_folder')
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file))
df.head()
df.to_excel('all_files.xlsx')
Now with this code I have 2 problems:
From all the files I just get the 1st sheet. So its merges 8 sheets in total instead of 40 (8x5) :(
For every file it copies also the column headers, this needs to be done just for the 1st file. All the files and sheets have the same column names.
Appreciate your help all.
Use read_excel with sheet_name=None for all sheets, join together by concat for list of DataFrames and last use it again for one big DataFrame:
import glob
files = glob.glob(r'c:/data/KM/Desktop/my_folder/*.xlsx')
dfs = (pd.concat(pd.read_excel(fp, sheet_name=None)) for fp in files)
dfbig = pd.concat(dfs, ignore_index=True)
Edit: For remove last sheetname convert orderdict to lists of DataFrame and remiove last by indexing:
files = glob.glob(r'D:/Dropbox/work-joy/so/files/*.xlsx')
dfs = (pd.concat([v for k, v in pd.read_excel(fp, sheet_name=None).items()][:-1])
for fp in files)
df = pd.concat(dfs, ignore_index=True)
Related
background: python, pandas, excel
there are 100 filse and type is .xlsx, and every file has 10000 rows and 50 cols;
I want to concat them into one excel;
I has tried to concat them by pandas.concat()
As shown below:
org_dir = 'D:/soft/project/excel'
out_filepath = 'D:/soft/project/excel/concat_file.xlsx'
res_df = []
for file in os.listdir(org_dir):
cur_df = pandas.read_excel(os.path.join(org_dir, file),
dtype=str)
res_df.append(cur_df)
concat_df = pandas.concat(res_df, ignore_index=True)
wr_concat = pandas.ExcelWriter(out_filepath, engine='openpyxl')
concat_df.to_excel(wr_concat, index=False)
wr_concat.close()
but it need more than 20 minutes;
so have one greater solution to solve it ?
So basically I have a ton of files that change each week that passes by, I want to know if there is a way that I can go ahead and specify to the python script to grab that sheet that contains an specific column name, so for example
For file test.xlsx I have the following structure
sheet 1
columnA
columnB
columnC
Columnd
ColumnE
ColumnF
dsf
sdfas
asdf
sadf
asfdsd
sdfasf
sdfsd
sadfsd
asdfsd
asdfasd
asdfdsf
sdfadf
Sheet 2
jira_tt
alignID
issueID
testID
dsf
sdfas
asdf
sadf
As you can see the excel file has 2 sheets, however this is just an example as some file may have more than 2 sheets or the names of the sheets will change, as stated above I want to read all the sheets in all the files that have the keyword "jira" on their columns, so far I have been able to create a script that reads all the files on the target folder, however I don't have a clue on how to specify the sheet as I needed, here is part of the code that I've created so far
import glob as glob
import pandas as pd
#using glob to get all the files that contains an xlsx extension
ingestor = glob.glob("*.xlsx")
for f in ingestor:
df = pd.read_excel(f)
df
Any kind of assistance or guidance will be appreciated it.
To include all your files as DataFrame's you can create a list to store it and use merge() method to split files in one DataFrame, for example:
import glob as glob
import pandas as pd
ingestor = glob.glob("*.xlsx")
df = reduce(lambda left,right: pd.merge(left,right), [pd.read_excel(data) for data in ingestor])
print(df.isin(['jira']))
if want just files that contains a specific word (like "jira"), you need to evaluate with a conditional using any() method on each iteration and merge data:
ingestor = glob.glob("*.xlsx")
df = pd.DataFrame()
for f in ingestor:
data = pd.read_excel(f)
df.insert(data[data.str.contains("jira"))
print(df)
First, note that pd.read_excel(f) returns the first sheet by default. If you want to get another sheet or more than one, you should use the sheet_name argument of pandas.read_excel, read here.
For the case that the number of sheets is unknown, specify None to get all worksheets:
pd.read_excel(f, sheet_name=None)
Note that now a dict of DataFrames is returned.
To get the sheets with a column that contain "jira", simply check the column's name:
for f in files:
df_dict = pd.read_excel(f, sheet_name=None)
for sheet_name, df in df_dict.items():
for column in df.columns:
if 'jira' in column:
# do something with this column or df
print(df[column])
I am trying to make a list using pandas before putting all data sets into 2D convolution layers.
And I was able to merge all data in the multiple excel files as a list.
However, the code only reads one chosen sheet name in the multiple excel files.
For example, I have 7 sheets in each excel file; named as 'gpascore1', 'gpascore2', 'gpascore3', 'gpascore4', 'gpascore5', 'gpascore6', 'gpascore7'.
And each sheet has 4 rows and 425 columns like
As shown below, you can see the code.
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'gpascore1') # Read only one chosen sheet available ->
gpascore1 is a sheet name.
df = df.append(data) # But there are 6 more sheets and I would like
to read data from all of the sheets
data_y = df['admit'].values
data_x = []
for i, rows in df.iterrows():
data_x.append([rows['gre'], rows['gpa'], rows['rank']])
df=df.dropna()
df.count()
Then, I got the result as below.
This is because the data from the 'gpascore1' sheet in 3 excel files were merged.
But, I want to read the data of 6 more sheets in the excel files.
Could anyone help me to find out the answer, please?
Thank you
===============<Updated code & errors>==================================
Thank you for the answers and I revised the read_excel() as
data = pd.read_excel(f, 'gpascore1') to
data = pd.read_excel(f, sheet_name=None)
But, I have key errors like below.
Could you give me any suggestions for this issue, please?
Thank you
I actually found this question under the tag of 'tensorflow'. That's hilarious. Ok, so you want to merge all Excel sheets into one dataframe?
import os
import pandas as pd
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
type(all_data)
I have a 20-30 csv files containing 3 columns like 'id','col1','col2','col3' and 1 big csv file of 20GB size that i want to read in chunks and merge with these samller csv files.
the bigger csv file has columns as 'id','name','zipdeails'.
both have ID column in same sequences,
smaple looks like
'id','name','zipdeails'
1,Ravi,2031345
2,Shayam,201344
3,Priya,20134
.........
1000,Pravn,204324
chunk file 1 looks like
'id','col1','col2','col3'
1,Heat,,
2,Goa,Next,
3,,,Delhi
all the smaller csv files are of same lenth(number of rows) except for the last file which may be smaller in length with header in each. the bigger csv file to which these are to be merged can be broken into chunksize that is equal to the length of these smaller files
so Last chunk looks like
'id','col1','col2','col3'
1000,Jaipur,Week,Trip
Now the output should look like
'id','name','zipdeails','col1','col2','col3'
1,Ravi,2031345,Heat,NAN,NAN
2,Shayam,201344,Goa,Next,NAN
3,Priya,20134,NAN,NAN,Delhi
.........
1000,Pravn,204324,Jaipur,Week,Trip
I think you need create list of DataFrames for all small files, then read big file to memory and concat all together by index created by id column:
import glob
#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['id']) for fp in files]
#if necessary
#df_big = df_big.set_index('id')
df_fin = pd.concat([df_big, dfs], axis=1)
There is possible solution a bit modify if there is same order of id values in all DataFrames without duplicates like 1,2,3...N with parameter nrows for read only first rows of big DataFrame by max length of smaller DataFrames:
#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['a']) for fp in files]
len_max= max([x.index.max() for x in dfs])
df_big= pd.read_csv('big_df_file.csv', index_col=['id'], nrows=len_max)
df_fin = pd.concat([df_big, dfs], axis=1)
EDIT:
#concat 30 files
files = glob.glob('files/*.csv')
#order of files is important for concat values -
#in first file are id = (1,100), second (101, 200)...
print (files)
#set by max rows of file
N = 100
#loop by big fileby chunk define in N
for i, x in enumerate(pd.read_csv('files/big.csv', chunksize=N, index_col=['id'])):
#added try for avoid errors if want seelct non exist file in list files
try:
df = pd.read_csv(files[i], index_col=['id'])
df1 = pd.concat([x, df], axis=1)
print (df1)
#in first loop create header in output
if i == 0:
pd.DataFrame(columns=df1.columns).to_csv('files/out.csv')
#append data to output file
df1.to_csv('files/out.csv', mode='a', header=False)
except IndexError as e:
print ('no files in list')
I have two data frames with the same structure in a CSV. I want to read both CSV and merge them to create one bigger data frame. In the directory there are only the two data frames.
The first CSV is called "first":
ad 7 8
as 5 8
ty 9 y
The second CSV is called "second":
ewtw 5 2
as 1 2
ty 4 9
My code is:
import os
import pandas as pd
targetdir = "C:/Documents and Settings/USER01/Mis documentos/experpy"
filelist = os.listdir(targetdir)
for file in filelist :
df_csv=pd.read_csv(file)
big_df = pd.concat(df_csv)
Unfortunately, it didn’t work. How Can I fix that?
if you are going to have only two CSVs then you may just want to use pd.merge
first = pd.read_csv( 'first.csv' ) # insert your file path
second = pd.read_csv( 'second.csv' )
big_df = (first, second, how='outer') # union of first and second
concat takes a list or dict of series: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.tools.merge.concat.html, so what you can do is make a list of the dataframes and concat them all together to make your big df:
filelist = os.listdir(targetdir)
df_list=[]
big_df=None
for file in filelist :
df_list.append(pd.read_csv(file))
big_df = pd.concat(df_list,ignore_index=True)
Alternatively you can append:
filelist = os.listdir(targetdir)
big_df=None
for file in filelist :
big_df.append(pd.read_csv(file), ignore_index=True)
I think you should change your path to this:
targetdir = r'C:\Documents and Settings\USER01\Mis documentos\experpy'
The above uses a raw string avoids the ambiguous parsing of slashes on Windows systems