I am running a loop to open and modify a set of files in a directory using pandas. I am testing on a subset of 10 files and one of them is somehow transposing onto the other and I have no idea why. I have a column for filename and it is the correct file, but using data from the other. It's only this file and I can't figure out why. In the end I get a concatinated dataset where a subset are identical minus the "filename". It seems to be happening before line 8 because that output file has the incorrect info as well. The source files are indeed different and the names of the files are not the same.
Thank you for any help!
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
df = pd.read_excel(filename, header = None)
for i, row in df.iterrows():
if row.notnull().all():
df2 = df.iloc[(i+1):].reset_index(drop=True)
df2.columns = list(df.iloc[i])
df2.to_excel(filename+"test.xlsx", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*test2.xlsx'))
CAT = pd.concat([pd.read_excel(f) for f in all_filenames ], ignore_index=True, sort=False)
CAT.pop("Unnamed: 0")
CAT.to_excel("All_DF.xlsx", index=filename)
CAT.to_csv("All_DF.csv", index=filename)
Related
I'm trying to combine a bunch of CSVs in a folder into one using Python. Each CSV has 9 columns but no headers. When they combine, some 'sheets' are spread far to the right in the sheet. So it seems they are not combining properly.
Please see code below
## Merge Multiple 1M Rows CSV files
import os
import pandas as pd
# 1. defines path to csv files
path = "C://halfordsCSV//new//Archive1/"
# 2. creates list with files to merge based on name convention
file_list = [path + f for f in os.listdir(path) if f.startswith('greyville_po-')]
# 3. creates empty list to include the content of each file converted to pandas DF
csv_list = []
# 4. reads each (sorted) file in file_list, converts it to pandas DF and appends it to the
csv_list
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
# 5. merges single pandas DFs into a single DF, index is refreshed
csv_merged = pd.concat(csv_list, ignore_index=True)
# 6. Single DF is saved to the path in CSV format, without index column
csv_merged.to_csv(path + 'halfordsOrders.csv', index=False)
It should be sticking to the same number of columns. Any idea what might be going wrong?
First, please check if separator and delimiter are fine in pandas.read_csv, default are ',' and None. You can pass them like that for example:
pandas.read_csv("my_file_path", sep=';', delimiter=',')
If they are already ok regarding to your csv files, try cleaning the dataframes before concating them
replace :
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
by :
nan_value = float("NaN")
for file in sorted(file_list):
my_df = pd.read_csv(file)
my_df.assign(File_Name = os.path.basename(file))
my_df.replace("", nan_value, inplace=True)
my_df.dropna(how='all', axis=1, inplace=True)
csv_list.append(my_df)
I've a list of csv files (approx. 100) that I'd like to include in one single csv file.
The list is found using
PATH_DATA_FOLDER = 'mypath/'
list_files = os.listdir(PATH_DATA_FOLDER)
for f in list_files:
list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, f)).columns)
df = pd.DataFrame(columns=list_columns)
print(df)
Which returns the files (it is just a sample, since I have 100 and more files):
['file1.csv', 'name2.csv', 'example.csv', '.DS_Store']
This, unfortunately, includes also hidden files, that I'd like to exclude.
Each file has the same columns:
Columns: [Name, Surname, Country]
I'd like to find a way to create one unique file with all these fields, plus information of the original file (e.g., adding a new column with the file name).
I've tried with
df1 = pd.read_csv(os.path.join(PATH_DATA_FOLDER, f))
df1['File'] = f # file name
df = df.append(df1)
df = df.reset_index(drop=True).drop_duplicates() # I'd like to drop duplicates in both Name and Surname
but it returns a dataframe with the last entry, so I guess the problem is in the for loop.
I hope you can provide some help.
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#drop duplicates and reset index
combined_csv.drop_duplicates().reset_index(drop=True)
#Save the combined file
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Have you tried using glob?
filenames = glob.glob("mypath/*.csv") #list of all you csv files.
df = pd.DataFrame(columns=["Name", "Surname", "Country"])
for filename in filenames:
df = df.append(pd.read_csv(filename))
df = df.drop_duplicates().reset_index(drop=True)
Another way would be concatenating the csv files using the cat command after removing the headers and then read the concatenated csv file using pd.read_csv.
I'm trying to make a small script to automate something at my work. I have a ton of text files that I need to group into a large dataframe to plot after.
The files have this general structure like this
5.013130280 4258.0
5.039390845 4198.0
... ...
49.944957015 858.0
49.971217580 833.0
What I want to do is
Keep the first column as the column of the final dataframe (as these values are the same for all files)
The rest of the dataframe is just extracting the second column of each file, normalize it and group everything together.
Use the file name as the header for extracted column (from point to) to use after in the plotting of the data
Right I was able to only make step 2, here is the code
import os
import pandas as pd
import glob
path = "mypath"
extension = 'xy'
os.chdir(path)
dir = os.listdir(path)
files = glob.glob(path + "/*.xy")
li = []
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df['int_n']=data['int']/data['int'].max()
li_norm.append(df['int_n'])
norm_files = pd.concat(li_norm, axis = 1)
So is there any way to solve this in an easy way?
Assuming that all of your files have exactly the same length (# of rows) and values for angles, then you don't really need to make a bunch of dataframes and concatenate them all together.
If I'm understanding correctly, you just want a final dataframe with a new column for each file (named with the filename) with the 'int' data, normalized with all the values from only that specific file
On the first file, you can create a dataframe to use as your final output, then just add columns to it on each subsequent file
for idx,file in enumerate(files):
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
filename = file.split('\\')[-1][:-3] #get filename from splitting full path and removing last 3 characters (file extension)
df[filename]=df['int']/df['int'].max() #use the filename itself as the new column name
if idx == 0: #create norm_files output dataframe on first file
norm_files = df[['angle',file]]
else: #add column to norm_files for all subsequent files
norm_files[file] = df[file]
You can add a calculated column quite simply, although I'm not sure if that's what you're asking.
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df[file.split('.')[0]]=data['int']/data['int'].max()
li_norm.append(df['int_n'])
I'm trying to get a single dataset by merging several cvs files within one folder. So I would like to merge the different file, which each have 4 columns. I would also like to label the four columns using names=[] in pd.concatenate.
I'm using this code:
path = r'C:\Users\chiar\Desktop\folder' # defining the path
all_files = glob.glob(path + "/*.csv")
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True, names=['quat_1', 'quat_2', 'quat_3', 'quat_4'])
The problem is that instead of getting 4 columns I get 25, and I don't get labeling.
Could someone tell me what I'm doing wrong? Thank you very much!
Use parameter names in read_csv if no header in files:
name = ['quat_1', 'quat_2', 'quat_3', 'quat_4']
df = pd.concat((pd.read_csv(f, names=names) for f in all_files), ignore_index=True)
I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
Reading Multiple CSV Files into Python Pandas Dataframe
Merge of multiple data frames of different number of columns into one big data frame
Import multiple csv files into pandas and concatenate into one DataFrame
You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)