Combining CSVs with Python issue - python

I'm trying to combine a bunch of CSVs in a folder into one using Python. Each CSV has 9 columns but no headers. When they combine, some 'sheets' are spread far to the right in the sheet. So it seems they are not combining properly.
Please see code below
## Merge Multiple 1M Rows CSV files
import os
import pandas as pd
# 1. defines path to csv files
path = "C://halfordsCSV//new//Archive1/"
# 2. creates list with files to merge based on name convention
file_list = [path + f for f in os.listdir(path) if f.startswith('greyville_po-')]
# 3. creates empty list to include the content of each file converted to pandas DF
csv_list = []
# 4. reads each (sorted) file in file_list, converts it to pandas DF and appends it to the
csv_list
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
# 5. merges single pandas DFs into a single DF, index is refreshed
csv_merged = pd.concat(csv_list, ignore_index=True)
# 6. Single DF is saved to the path in CSV format, without index column
csv_merged.to_csv(path + 'halfordsOrders.csv', index=False)
It should be sticking to the same number of columns. Any idea what might be going wrong?

First, please check if separator and delimiter are fine in pandas.read_csv, default are ',' and None. You can pass them like that for example:
pandas.read_csv("my_file_path", sep=';', delimiter=',')
If they are already ok regarding to your csv files, try cleaning the dataframes before concating them
replace :
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
by :
nan_value = float("NaN")
for file in sorted(file_list):
my_df = pd.read_csv(file)
my_df.assign(File_Name = os.path.basename(file))
my_df.replace("", nan_value, inplace=True)
my_df.dropna(how='all', axis=1, inplace=True)
csv_list.append(my_df)

Related

How to read multiple csv files with specific name from a folder and merge them?

I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.
Kindly help me!
glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
axis=0 means that the concatenation should happen along the rows (vertically)
ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.
import glob
import pandas as pd
path = r'<path to folder containing csv files>'
all_files = glob.glob(path + "/*.csv")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
lst.append(df)
merged_df = pd.concat(lst, axis=0, ignore_index=True)
This can be easily done with a CSV tool like miller:
mlr --csv cat --filename bla1.csv *.car.csv
This will concatenate the files (without repeating the header) and prepend the filename as the first column.
You can use the pandas library this way:
import pandas as pd
import os
# path to folder where the csv files are stored
path = '/path/to/folder'
result = pd.DataFrame()
for i in range(1, n+1):
filename = "{}.car.csv".format(i)
file_path = os.path.join(path, filename)
df = pd.read_csv(file_path)
df['new_label'] = i
result = pd.concat([result, df], ignore_index=True)
result.to_csv('final_result.csv', index=False)
The n in the code above should be replaced with the number of csv files you have in the folder.
If you need any explanation of the code (in case you're new to python or dataframes) just comment below.
Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.
from pathlib import Path
import pandas as pd
input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"
pd.concat(
(pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)

Reading, calculate and group data of several files with pandas

I'm trying to make a small script to automate something at my work. I have a ton of text files that I need to group into a large dataframe to plot after.
The files have this general structure like this
5.013130280 4258.0
5.039390845 4198.0
... ...
49.944957015 858.0
49.971217580 833.0
What I want to do is
Keep the first column as the column of the final dataframe (as these values are the same for all files)
The rest of the dataframe is just extracting the second column of each file, normalize it and group everything together.
Use the file name as the header for extracted column (from point to) to use after in the plotting of the data
Right I was able to only make step 2, here is the code
import os
import pandas as pd
import glob
path = "mypath"
extension = 'xy'
os.chdir(path)
dir = os.listdir(path)
files = glob.glob(path + "/*.xy")
li = []
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df['int_n']=data['int']/data['int'].max()
li_norm.append(df['int_n'])
norm_files = pd.concat(li_norm, axis = 1)
So is there any way to solve this in an easy way?
Assuming that all of your files have exactly the same length (# of rows) and values for angles, then you don't really need to make a bunch of dataframes and concatenate them all together.
If I'm understanding correctly, you just want a final dataframe with a new column for each file (named with the filename) with the 'int' data, normalized with all the values from only that specific file
On the first file, you can create a dataframe to use as your final output, then just add columns to it on each subsequent file
for idx,file in enumerate(files):
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
filename = file.split('\\')[-1][:-3] #get filename from splitting full path and removing last 3 characters (file extension)
df[filename]=df['int']/df['int'].max() #use the filename itself as the new column name
if idx == 0: #create norm_files output dataframe on first file
norm_files = df[['angle',file]]
else: #add column to norm_files for all subsequent files
norm_files[file] = df[file]
You can add a calculated column quite simply, although I'm not sure if that's what you're asking.
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df[file.split('.')[0]]=data['int']/data['int'].max()
li_norm.append(df['int_n'])

How to merge more csv files in Python?

I am trying to merge all found csv files in a given directory. The problem is that all csv files have almost the same header, only one column differs. I want to add that column from all csv files to the merged csv file(and also 4 common columns for all csv).
So far, I have this:
import pandas as pd
from glob import glob
interesting_files = glob(
"C:/Users/iulyd/Downloads/*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list, sort=False)
full_df.to_csv("C:/Users/iulyd/Downloads/merged_pands.csv", index=False)
With this code I managed to merge all csv files, but the problem is that some columns are empty in the first "n" rows, and only after some rows they get their proper values(from the respective csv). How can I make the values begin normally, after the column header?
Probably just you need add the name columns :
import pandas as pd
from glob import glob
interesting_files = glob(
"D:/PYTHON/csv/*.csv")
df_list = []
for filename in sorted(interesting_files):
print(filename)
#time,latitude,longitude
df_list.append(pd.read_csv(filename,usecols=["time", "latitude", "longitude","altitude"]))
full_df = pd.concat(df_list, sort=False)
print(full_df.head(10))
full_df.to_csv("D:/PYTHON/csv/mege.csv", index=False)

writing a header with 125,000+ columns and two rows

Excel limits the columns of any csv file around 3000. I am trying to write 125,000 columns in the following format:
O1
MA1
MI1
C1
V1
...
O125000
MA125000
MI125000
C125000
V125000
import pandas as pd
def formatting(i):
return tuple(map(lambda x: x+str(i), ("O", "MA", "MI", "C", "V")))
l = []
for i in range(1, 125001):
l.extend(formatting(i))
f = pd.read_csv('file.csv')
f.columns = l
f.to_csv('new_file.csv')
I tried coding this script but its too slow and inconsistent in the fact that its prone to errors. However, you can get the idea of what I am trying to do from this script.
The current script I use to generate a csv(that contains 2 rows and 125,000+ columns) is the following:
import pandas as pd
import glob
allfiles = glob.glob('*.csv')
index = 0
def testing(file):
#file = file.loc[:,'Open':'Volume']
file = file.values.reshape(1, -1)
return file
for _fileT in allfiles:
nFile = pd.read_csv(_fileT, header=0, usecols=range(1,6))
fFile = testing(nFile)
df = pd.DataFrame(fFile)
new_df = df.iloc[:125279]
new_df = new_df.shift(1, axis=1)
new_df.to_csv('HeadCSV/FinalCSV.csv', mode='a', index=False, header=0)
This script reads two csv files in the directory, and aggregates them into one file however how can I make sure that it prints the header mentioned above and labels the two rows it prints out?
Id basically like to combine these two scripts in the most logical way possible.
the idea is to write the header, then get all the data from the files into the dataframe, then do the row indexing as mentioned, and finally throw it all into a CSV

Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
Reading Multiple CSV Files into Python Pandas Dataframe
Merge of multiple data frames of different number of columns into one big data frame
Import multiple csv files into pandas and concatenate into one DataFrame
You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)

Categories