I am writing a small program to concatenate a load of measurements from multiple csv files. into one excel file. I have pretty much all the program written and working, the only thing i'm struggling to do is to get the data from the csv files to automatically turn into numbers when the dataframe places them into the excel file.
The code I have looks like this:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import os
import csv
import glob
os.chdir(r"directoryname")
retval = os.getcwd()
print ("Directory changed to %s" % retval)
files = glob.glob(r"directoryname\datafiles*csv")
print(files)
files.sort(key=lambda x: os.path.getmtime(x))
writer = pd.ExcelWriter('test.xlsx')
df = pd.read_csv("datafile.csv", index_col=False)
df = df.iloc[0:41, 1]
df.to_excel(writer, 'sheetname', startrow =0, startcol=1, index=False)
for f in files:
i+=1
df = pd.read_csv(f, index_col=False)
df = df.iloc[0:41,2]
df.to_excel(writer, 'sheetname', startrow=0, startcol=1+i, index=False)
Thanks in advance
Do you mean:
df.loc[:,'measurements'] = df.loc[:,'measurements'].astype(float)
So when you read the dataframe you can cast all your columns like that for example.
Different solution is, while reading your csv to cast the columns by using dtypes (see Documentation)
EXAMPLE
df = pd.read_csv(os.path.join(savepath,'test.csv') , sep=";" , dtype={
ID' : 'Int64' , 'STATUS' : 'object' } ,encoding = 'utf-8' )
Related
I'm trying to merge a series of xlsx files into one, which works fine.
However, when I read a file, columns containing ints are transformed into floats (or dates?) when I merge and output them to csv. I have tried to visualize this in the picture. I have seen some solutions to this where dtype is used to "force" specific columns into int format. However, I do not always know the index nor the title of the column, so i need a more scalable solution.
Anyone with some thoughts on this?
Thank you in advance
#specify folder with xlsx-files
xlsFolder = "{}/system".format(directory)
dfMaster = pd.DataFrame()
#make a list of all xlsx-files in folder
xlsFolderContent = os.listdir(xlsFolder)
xlsFolderList = []
for file in xlsFolderContent:
if file[-5:] == ".xlsx":
xlsFolderList.append(file)
for xlsx in xlsFolderList:
print(xlsx)
xl = pd.ExcelFile("{}/{}".format(xlsFolder, xlsx))
for sheet in xl.sheet_names:
if "_Errors" in sheet:
print(sheet)
dfSheet = xl.parse(sheet)
dfSheet.fillna(0, inplace=True)
dfMaster = dfMaster.append(dfSheet)
print("len of dfMaster:", len(dfMaster))
dfMaster.to_csv("{}/dfMaster.csv".format(xlsFolder),sep=";")
Data sample:
Try to use dtype='object' as parameter of pd.read_csv or (ExcelFile.parse) to prevent Pandas to infer the data type of each column. You can also simplify your code using pathlib:
import pandas as pd
import pathlib
directory = pathlib.Path('your_path_directory')
xlsFolder = directory / 'system'
data = []
for xlsFile in xlsFolder.glob('*.xlsx'):
sheets = pd.read_excel(xlsFile, sheet_name=None, dtype='object')
for sheetname, df in sheets.items():
if '_Errors' in sheetname:
data.append(df.fillna('0'))
pd.concat(data).to_csv(xlsxFolder / dfMaster.csv, sep=';')
I need to combine all the excel files in my directory into one excel file.
for example, I've 3 excel files:
file1:
file 2:
file 3:
I need to concatenate them and get an output as follows
but instead, they were appended one after another
this is my code :
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('C:/Users/test-results/FinalResult/05-01-2019/*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
during my quest, all I found was how to append dfs,as shown in my code and I didn't figure out how too use join
You could use
files = glob.glob('C:/Users/test-results/FinalResult/05-01-2019/*.xlsx')
dfs = (pd.read_excel(f, index_col=0) for f in files)
all_data = pd.concat(dfs, axis=1)
Try this:
all_data = pd.concat([all_data,df],axis=1)
all_data = all_data.merge(df, on = ['first_column_name'], how = 'left')
I'm trying to contact all excel files and worksheets in them into one using the below script. It kinda works but then the excel file c.xlsx is overwritten per file, so only the last excel file is concated not sure why?
import pandas as pd
import os
import ntpath
import glob
dir_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(dir_path)
cdf = None
for excel_names in glob.glob('*.xlsx'):
print(excel_names)
df = pd.read_excel(excel_names, sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
cdf.to_excel("c.xlsx", header=False, index=False)
Idea is create list of DataFrames in list comprehension, but because working with orderdict is necessary concat in loop and then again concat for one big final DataFrame:
cdf = [pd.read_excel(excel_names, sheet_name=None, ignore_index=True).values()
for excel_names in glob.glob('files/*.xlsx')]
df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
#print (df)
df.to_excel("c.xlsx", index=False)
I just tested the code below. It merges data from all Excel files in a folder into one, single, Excel file.
import pandas as pd
import numpy as np
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
print(all_data)
df = pd.DataFrame(all_data)
df.shape
df.to_excel("C:\\your_path\\final.xlsx", sheet_name='Sheet1')
I got it working using the below script which uses #ryguy72's answer but works on all worksheets as well as the header row.
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob("my_path/*.xlsx"):
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
all_data = all_data.append(cdf,ignore_index=True)
print(all_data)
df = pd.DataFrame(all_data)
df.shape
df.to_excel("my_path/final.xlsx", sheet_name='Sheet1')
Please help me to find solution for the problem with importing data from multiple csv files to one DataFrame in python.
Code is:
import pandas as pd
import os
import glob
path = r'my_full_path'
os.chdir(path)
results = pd.DataFrame()
for counter, current_file in enumerate(glob.glob("*.csv")):
namedf = pd.read_csv(current_file, header=None, sep=",", delim_whitespace=True)
results = pd.concat([results, namedf], join='outer')
results.to_csv('Result.csv', index=None, header=None, sep=",")
The problem is that some part of data are moving to the rows instead of new columns as required.
What is wrong in my code?
P.S.: I found questions about importing multiple csv-files to DataFrame, for example here: Import multiple csv files into pandas and concatenate into one DataFrame, but solution doesn't solve my issue:-(
it was solved by using join inside of pd.read_csv.read_csv() -> append(dataFrames) -> concat:
def get_merged_files(files_list, **kwargs):
dataframes = []
for file in files_list:
df = pd.read_csv(os.path.join(file), **kwargs)
dataframes.append(df)
return pd.concat(dataframes, axis=1)
You can try using this:
import pandas as pd
import os
files = [file for file in os.listdir('./Your_Folder')] # Here is where all the files are located.
all_csv_files = pd.DataFrame()
for file in files:
df = pd.read_csv("./Your_Folder/"+file)
all_csv_files = pd.concat([all_csv_files, df])
all_csv_files.to_csv("All_CSV_Files_Concat.csv", index=False)
I want to import in path structured csv files and output as a one CSV. My code just works with a path and a manually typed csv file.
import csv
import pandas as pd
import numpy as np
import glob
cols = ['Date', 'Time', 'Duration', 'IP', 'Request']
pd.DataFrame(columns=cols).to_csv('out9.csv', index=False, sep=';')
for df in pd.read_csv('query.csv', sep='\s', header=None, chunksize=6):
df.reset_index(drop=True, inplace=True)
df.fillna('', inplace=True)
d = pd.DataFrame([df.loc[3,0], df.loc[3,1], ' '.join(df.loc[3,4:8]), ' '.join(df.loc[4,4:6]), ' '.join(df.loc[5,4:])])
d.T.to_csv('out.csv', index=False, header=False, mode='a', sep=';')
I know there are some topics how to read many csv files, but in my case have not helped unfortunately.
I would like to read about it:
: C\Desktop\Files\*.csv
Information about the csv files:
All are built the same, that is, no header, same structures.
And I would like at a start of my code all in a folder read in and as a formatted again give out.
Therefore if possible to change the code as little as possible, I would only read several csv, instead of these a 'query.csv'
Thanks!
I think you can use glob:
import glob
cols = ['Date', 'Time', 'Duration', 'IP', 'Request']
pd.DataFrame(columns=cols).to_csv('out9.csv', index=False, sep=';')
for file in glob.glob('C:/Desktop/Files/*.csv'):
for df in pd.read_csv(file, sep='\s', header=None, chunksize=6):
df.reset_index(drop=True, inplace=True)
...
...