I have some code that reads all the CSV files in a certain folder and concatenates them into one excel file. This code works as long as the CSV's have headers but I'm wondering if there is a way to alter my code if my CSV's didn't have any headers.
Here is what works:
path = r'C:\Users\Desktop\workspace\folder'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df = df[~df['Ran'].isin(['Active'])]
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.drop_duplicates(subset=None, inplace=True)
What this is doing is deleting any row in my CSV's with the word "Active" under the "Ran" column. But if I didn't have a "Ran" header for this column, is there another way to read this and do the same thing?
Thanks in advance!
df = df[~df['Ran'].isin(['Active'])]
Instead of selecting a column by name, select it by index. If the 'Ran' column is the third column in the csv use...
df = df[~df.iloc[:,2].isin(['Active'])]
If some of your files have headers and some don't then you probably should look at the first line of each file before you make a DataFrame with it.
for filename in all_files:
with open(filename) as f:
first = next(f).split(',')
if first == ['my','list','of','headers']:
header=0
names=None
else:
header=None
names=['my','list','of','headers']
f.seek(0)
df = pd.read_csv(filename, index_col=None, header=header,names=names)
df = df[~df['Ran'].isin(['Active'])]
If I understood your question correctly ...
If the header is missing, yet you know the data format, you can pass the desired column labels as a list, such as: ['id', 'thing1', 'ran', 'other_stuff'] into the names parameter of read_csv.
Per the pandas docs:
names : array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
Related
import pandas as pd
import os
import glob
path = r'C:\Users\avira\Desktop\CC\SAIL\Merging\CISF'
files = glob.glob(os.path.join(path, '*.csv'))
combined_data = pd.DataFrame()
for file in files :
data = pd.read_csv(file)
print(data)
combined_data = pd.concat([combined_data,data],axis=0,ignore_index=True)
combined_data.to_csv(r'C:\Users\avira\Desktop\CC\SAIL\Merging\CISF\data2.csv')
The files are merging diagonally,ie-next to the last cell of the first file, is the beginning of second file. ALSO, it is taking the first entry of file as column names.
All of my files are without column names. How do I vertically merge my files,and provide coluumn names to the merged csv.
For the header problem while reading csv , u can do this:
pd.read_csv(file, header=None)
While dumping the result u can pass list containing the header names
df.to_csv(file_name,header=['col1','col2'])
You need to read the csv with no headers and concat:
data = pd.read_csv(file, header=None)
combined_data = pd.concat([combined_data, data], ignore_index=True)
If you want to give the columns meaningful names:
combined_data.columns = ['name1', 'name2', 'name3']
Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.
Example with header:
Field1 Field2 Field3
data1 data2 data3
Example without header:
data1 data2 data3
When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.
pd.read_csv('filename.csv', names=col_names)
When trying to use the below, it will drop the first row of data of there is no header in the file.
pd.read_csv('filename.csv', header=0, names=col_names)
My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.
df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
del df
df = pd.read_csv('filename.csv', names=col_names)
Is there a better way to handle this data set that doesn't involve potentially reading the file twice?
Just modify your logic so the first time through only reads the first row:
# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)
You can do this with seek method of file descriptor:
with open('filename.csv') as csvfile:
headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
csvfile.seek(0) # return file pointer to the beginning of the file
# do stuff here
if 'Field1' in headers:
...
else:
...
df = pd.read_csv(csvfile, ...)
The file is read only once.
The first part of this question has been asked many times and the best answer I found was here: Import multiple csv files into pandas and concatenate into one DataFrame.
But what I essentially want to do is be able to add another variable to each dataframe that has participant number, such that when the files are all concatenated, I will be able to have participant identifiers.
The files are named like this:
So perhaps I could just add a column with the ucsd1, etc. to identify each participant?
Here's code that I've gotten to work for Excel files:
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
If I understand you correctly, it's simple:
import re # <-------------- Add this line
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.
I am scanning a directory of text files and adding them to a Pandas dataframe:
text_path = "/home/tdun0002/stash/cloud_scripts/aws_scripts/output_files/memory_stats/text/"
filelist = os.listdir(text_path)
final_df = pd.DataFrame()
for filename in filelist:
my_file = text_path + filename
try:
df = pd.read_csv(my_file, delim_whitespace=True, header=None)
final_df = final_df.append(df)
pd.options.display.max_rows
print(f"\n***Full Data Frame: {df}\n***")
Each file in the directory holds the memory of a server:
bastion001-memory.txt
permissions001-memory.txt
haproxy001-memory.txt
The contents of the files look something like this:
cat haproxy001-memory.txt
7706172
On each pass of adding the file, it reports this:
Data Frame: Empty DataFrame
Columns: [7706172]
Index: []
And when I print out the full data frame it only has the last entry:
***Full Data Frame:
Empty DataFrame
Columns: [7706172]
Index: []
***
Why is it reporting that the dataframe is empty? Why is it only showing the last file that was input? I think I may need to append the data.
2 things:
You need to provide header=None in pd.read_csv command to consider the value in text file as data. This is because by default, pandas assumes the first row to be header.
Since you are reading multiple files, you need to append each dataframe into another. Currently you are overwriting df on each iteration.
Code should be like:
text_path = "/home/tdun0002/stash/cloud_scripts/aws_scripts/output_files/memory_stats/text/"
filelist = os.listdir(text_path)
final_df = pd.DataFrame()
for filename in filelist:
my_file = text_path + filename
try:
df = pd.read_csv(my_file, delim_whitespace=True, header=None)
final_df = final_df.append(df)
print(f"Data Frame: {df}")
pd.options.display.max_rows
print(f"\n***Full Data Frame: {df}\n***")
I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
Reading Multiple CSV Files into Python Pandas Dataframe
Merge of multiple data frames of different number of columns into one big data frame
Import multiple csv files into pandas and concatenate into one DataFrame
You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)