I have two data frames with the same structure in a CSV. I want to read both CSV and merge them to create one bigger data frame. In the directory there are only the two data frames.
The first CSV is called "first":
ad 7 8
as 5 8
ty 9 y
The second CSV is called "second":
ewtw 5 2
as 1 2
ty 4 9
My code is:
import os
import pandas as pd
targetdir = "C:/Documents and Settings/USER01/Mis documentos/experpy"
filelist = os.listdir(targetdir)
for file in filelist :
df_csv=pd.read_csv(file)
big_df = pd.concat(df_csv)
Unfortunately, it didn’t work. How Can I fix that?
if you are going to have only two CSVs then you may just want to use pd.merge
first = pd.read_csv( 'first.csv' ) # insert your file path
second = pd.read_csv( 'second.csv' )
big_df = (first, second, how='outer') # union of first and second
concat takes a list or dict of series: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.tools.merge.concat.html, so what you can do is make a list of the dataframes and concat them all together to make your big df:
filelist = os.listdir(targetdir)
df_list=[]
big_df=None
for file in filelist :
df_list.append(pd.read_csv(file))
big_df = pd.concat(df_list,ignore_index=True)
Alternatively you can append:
filelist = os.listdir(targetdir)
big_df=None
for file in filelist :
big_df.append(pd.read_csv(file), ignore_index=True)
I think you should change your path to this:
targetdir = r'C:\Documents and Settings\USER01\Mis documentos\experpy'
The above uses a raw string avoids the ambiguous parsing of slashes on Windows systems
Related
I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.
Kindly help me!
glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
axis=0 means that the concatenation should happen along the rows (vertically)
ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.
import glob
import pandas as pd
path = r'<path to folder containing csv files>'
all_files = glob.glob(path + "/*.csv")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
lst.append(df)
merged_df = pd.concat(lst, axis=0, ignore_index=True)
This can be easily done with a CSV tool like miller:
mlr --csv cat --filename bla1.csv *.car.csv
This will concatenate the files (without repeating the header) and prepend the filename as the first column.
You can use the pandas library this way:
import pandas as pd
import os
# path to folder where the csv files are stored
path = '/path/to/folder'
result = pd.DataFrame()
for i in range(1, n+1):
filename = "{}.car.csv".format(i)
file_path = os.path.join(path, filename)
df = pd.read_csv(file_path)
df['new_label'] = i
result = pd.concat([result, df], ignore_index=True)
result.to_csv('final_result.csv', index=False)
The n in the code above should be replaced with the number of csv files you have in the folder.
If you need any explanation of the code (in case you're new to python or dataframes) just comment below.
Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.
from pathlib import Path
import pandas as pd
input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"
pd.concat(
(pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)
I'm trying to combine a bunch of CSVs in a folder into one using Python. Each CSV has 9 columns but no headers. When they combine, some 'sheets' are spread far to the right in the sheet. So it seems they are not combining properly.
Please see code below
## Merge Multiple 1M Rows CSV files
import os
import pandas as pd
# 1. defines path to csv files
path = "C://halfordsCSV//new//Archive1/"
# 2. creates list with files to merge based on name convention
file_list = [path + f for f in os.listdir(path) if f.startswith('greyville_po-')]
# 3. creates empty list to include the content of each file converted to pandas DF
csv_list = []
# 4. reads each (sorted) file in file_list, converts it to pandas DF and appends it to the
csv_list
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
# 5. merges single pandas DFs into a single DF, index is refreshed
csv_merged = pd.concat(csv_list, ignore_index=True)
# 6. Single DF is saved to the path in CSV format, without index column
csv_merged.to_csv(path + 'halfordsOrders.csv', index=False)
It should be sticking to the same number of columns. Any idea what might be going wrong?
First, please check if separator and delimiter are fine in pandas.read_csv, default are ',' and None. You can pass them like that for example:
pandas.read_csv("my_file_path", sep=';', delimiter=',')
If they are already ok regarding to your csv files, try cleaning the dataframes before concating them
replace :
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
by :
nan_value = float("NaN")
for file in sorted(file_list):
my_df = pd.read_csv(file)
my_df.assign(File_Name = os.path.basename(file))
my_df.replace("", nan_value, inplace=True)
my_df.dropna(how='all', axis=1, inplace=True)
csv_list.append(my_df)
First time poster and fairly new to Python here. I have a collection of +1,7000 csv files with 2 columns each. The number and labels of the rows are the same in every file. The files are named with a specific format. For example:
Species_1_OrderA_1.csv
Species_1_OrderA_2.csv
Species_1_OrderA_3.csv
Species_10_OrderB_1.csv
Species_10_OrderB_2.csv
Each imported dataframe is formatted like so:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
I would like to join the files that correspond to the same species, based on the first column. So, in the end, I would get the files Species_1_OrderA.csv and Species_10_OrderB.csv. Please note that all the species do not necessarily have the same number of files.
This is what I have tried so far.
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
However, the last bit is not even working, and at this point, I am not sure this is the best way to deal with my problem. Any pointers on how to solve this will be really appreciated.
--- EDIT:
This is the expected output for the files per species. Ideally, I would remove the rows that have zeros in them, but that can easily be done with awk.
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``
Try it like this:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")
If your goal is to concatenate all the csv's for each species-order into a consolidated csv, this is one approach. I haven't tested it so there might be a few errors. The idea is to first use glob, as you're doing, to make a dict of file_paths so that all the file_paths of the same species-order are grouped together. Then for each species-order read in all the data into a single table in memory and then write out to a consolidated file.
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
There are definitely improvements that can be made such as using collections.defaultdict and writing one file at a time out to the consolidated file, instead of reading them all into memory
I have around 20 XLSX files ranging from 4-10 mb.
I want to grab a certain sheet in those xlsx files and concat them into one file.
Each xlsx file is named in a sequential week order, and the sheet I am trying to parse has no date, so I'm using the file_name as the the index and will reverse engineer a week date.
I am using the following code, which I use quite often to concat multiple files into one df. I am also using basename to add in the name but get the following error.
ValueError: Length mismatch: Expected axis has 461 elements, new values have 457 elements
import pandas as pd
from os.path import basename
import os
import glob
path = os.getcwd()
allFiles = glob.glob(path + "/*.xlsx")
frame = pd.DataFrame()
master_list = []
for file_ in allFiles:
df = pd.read_excel(file_,sheet_name = "Base data",index_col=None,
header=0)
df.index = [os.path.basename(f)] * len(data)
master_list.append(df)
frame = pd.concat(master_list)
You can use list comprehension for list of DataFrames and then create list of filenames used for parameter keys in concat:
dfs = [pd.read_excel(f, sheet_name="Base data",index_col=None,header=0) for f in allFiles]
keys = [os.path.basename(f) for f in allFiles]
frame = pd.concat(dfs, keys=keys)
#if want remove default index values
#frame = pd.concat(dfs, keys=keys).reset_index(level=1, drop=True)
I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
Reading Multiple CSV Files into Python Pandas Dataframe
Merge of multiple data frames of different number of columns into one big data frame
Import multiple csv files into pandas and concatenate into one DataFrame
You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)