I am scanning a directory of text files and adding them to a Pandas dataframe:
text_path = "/home/tdun0002/stash/cloud_scripts/aws_scripts/output_files/memory_stats/text/"
filelist = os.listdir(text_path)
final_df = pd.DataFrame()
for filename in filelist:
my_file = text_path + filename
try:
df = pd.read_csv(my_file, delim_whitespace=True, header=None)
final_df = final_df.append(df)
pd.options.display.max_rows
print(f"\n***Full Data Frame: {df}\n***")
Each file in the directory holds the memory of a server:
bastion001-memory.txt
permissions001-memory.txt
haproxy001-memory.txt
The contents of the files look something like this:
cat haproxy001-memory.txt
7706172
On each pass of adding the file, it reports this:
Data Frame: Empty DataFrame
Columns: [7706172]
Index: []
And when I print out the full data frame it only has the last entry:
***Full Data Frame:
Empty DataFrame
Columns: [7706172]
Index: []
***
Why is it reporting that the dataframe is empty? Why is it only showing the last file that was input? I think I may need to append the data.
2 things:
You need to provide header=None in pd.read_csv command to consider the value in text file as data. This is because by default, pandas assumes the first row to be header.
Since you are reading multiple files, you need to append each dataframe into another. Currently you are overwriting df on each iteration.
Code should be like:
text_path = "/home/tdun0002/stash/cloud_scripts/aws_scripts/output_files/memory_stats/text/"
filelist = os.listdir(text_path)
final_df = pd.DataFrame()
for filename in filelist:
my_file = text_path + filename
try:
df = pd.read_csv(my_file, delim_whitespace=True, header=None)
final_df = final_df.append(df)
print(f"Data Frame: {df}")
pd.options.display.max_rows
print(f"\n***Full Data Frame: {df}\n***")
Related
Currently my code looks into CSV files in a folder and replaces strings based on if the file has column 'PROD_NAME' in the data. If it doesnt have column 'PROD_NAME', I'm trying to delete those files in the folder. I can get my code to print which csv files do not have the column with a little debugging, but I cant figure out how to actually delete or remove them from the folder they are in. I have tried an if statement that calls os.remove() and still nothing happens. No errors or anything.. it just finishes the script with all the files still in the folder. Here is my code. Any help is appreciated. Thanks!
def worker():
filenames = glob.glob(dest_dir + '\\*.csv')
print("Finding all files with column PROD_NAME")
time.sleep(3)
print("Changing names of products in these tables...")
for filename in filenames:
my_file = Path(os.path.join(dest_dir, filename))
try:
with open(filename):
# read data
df1 = pd.read_csv(filename, skiprows=1, encoding='ISO-8859-1') # read column header only - to get the list of columns
dtypes = {}
for col in df1.columns:# make all columns text, to avoid formatting errors
dtypes[col] = 'str'
df1 = pd.read_csv(filename, dtype=dtypes, skiprows=1, encoding='ISO-8859-1')
if 'PROD_NAME' not in df1.columns:
os.remove(filename)
#Replaces text in files
if 'PROD_NAME' in df1.columns:
df1 = df1.replace("NABVCI", "CLEAR_BV")
df1 = df1.replace("NAMVCI", "CLEAR_MV")
df1 = df1.replace("NA_NRF", "FA_GUAR")
df1 = df1.replace("N_FPFA", "FA_FLEX")
df1 = df1.replace("NAMRFT", "FA_SECURE_MVA")
df1 = df1.replace("NA_RFT", "FA_SECURE")
df1 = df1.replace("NSPFA7", "FA_PREFERRED")
df1 = df1.replace("N_ENHA", "FA_ENHANCE")
df1 = df1.replace("N_FPRA", "FA_FLEX_RETIRE")
df1 = df1.replace("N_SELF", "FA_SELECT")
df1 = df1.replace("N_SFAA", "FA_ADVANTAGE")
df1 = df1.replace("N_SPD1", "FA_SPD1")
df1 = df1.replace("N_SPD2", "FA_SPD2")
df1 = df1.replace("N_SPFA", "FA_LIFESTAGES")
df1 = df1.replace("N_SPPF", "FA_PLUS")
df1 = df1.replace("N__CFA", "FA_CHOICE")
df1 = df1.replace("N__OFA", "FA_OPTIMAL")
df1 = df1.replace("N_SCNI", "FA_SCNI")
df1 = df1.replace("NASCI_", "FA_SCI")
df1 = df1.replace("NASSCA", "FA_SSC")
df1.to_csv(filename, index=False, quotechar="'")
except:
if 'PROD_NAME' in df1.columns:
print("Could not find string to replace in this file: " + filename)
worker()
Written below is a block of code that reads the raw csv data. It extracts the first row of data (containing the column names) and looks for the column name PROD_NAME. If it finds it, it sets found to True. Else, it sets found to False. To prevent trying to delete the files whilst open, the removal is done outside of the open().
import os
filename = "test.csv"
with open(filename) as f: #Any code executed in here is while the file is open
if "PROD_NAME" in f.readlines()[0].split(","): #Replace "PROD_NAME" with the string you are looking for
print("found")
found = True
else:
print("not found")
found = False
if not found:
os.remove(filename)
else:
pass#Carry out replacements here/load it in pandas
import pandas as pd
import os
import glob
path = r'C:\Users\avira\Desktop\CC\SAIL\Merging\CISF'
files = glob.glob(os.path.join(path, '*.csv'))
combined_data = pd.DataFrame()
for file in files :
data = pd.read_csv(file)
print(data)
combined_data = pd.concat([combined_data,data],axis=0,ignore_index=True)
combined_data.to_csv(r'C:\Users\avira\Desktop\CC\SAIL\Merging\CISF\data2.csv')
The files are merging diagonally,ie-next to the last cell of the first file, is the beginning of second file. ALSO, it is taking the first entry of file as column names.
All of my files are without column names. How do I vertically merge my files,and provide coluumn names to the merged csv.
For the header problem while reading csv , u can do this:
pd.read_csv(file, header=None)
While dumping the result u can pass list containing the header names
df.to_csv(file_name,header=['col1','col2'])
You need to read the csv with no headers and concat:
data = pd.read_csv(file, header=None)
combined_data = pd.concat([combined_data, data], ignore_index=True)
If you want to give the columns meaningful names:
combined_data.columns = ['name1', 'name2', 'name3']
I am trying to add data from several files in a folder to a data frame. Each .csv file has varying lengths but has the same number of columns. I am trying to add all of them to one data frame with ignoring the index so that the new data frame is just vertically combined. For some reason every time I try to concatenate the data I am left with ~ 363 columns when there should only be 9. Each csv file has the same number of columns so I am confused.
import os
import pandas as pd
import glob
cwd = os.getcwd()
folder = cwd +'\\downloads\\prepared_csv_files\\prepared_csv_files\\'
all_files = glob.glob(folder + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I have also tried
final_df = pd.DataFrame(li, columns = ['tool','pressure'])
# and I name all columns not doing it now
here final is the name of the final dataset.
I am assuming tool and pressure are the columns name in your all .csv files
final = pd.DataFrame(columns = ['tool','pressure'])
for filename in all_files:
df = pd.read_csv(filename)
df = pd.DataFrame(df)
final = pd.concat([final,df],ignore_index= True,join="inner")
I'm using a for loop to cycle through numerous text files, select a single column from the text files (named ppm), and append these columns to a new data frame. I'd like the columns in the new data frame to have the name of the text file but I'm not sure how to do this..
My code is:
all_files=glob.glob(os.path.join(path,"*.txt"))
df1=pd.DataFrame()
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
df1 = pd.concat([df,df1],axis=1)
At the moment every column in the new dataframe is called 'ppm'.
I used to have this code
df1=pd.DataFrame()
for file in all_files:
file_name = file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0)
df1[file_name] = df['ppm']
But I ran into the warning 'PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() df1[file_name] = df['ppm'].copy()' when I tried to run the code for a large number of files (~ 100s).
Assuming index is equal, add all your data into a dictionairy:
all_files=glob.glob(os.path.join(path,"*.txt"))
data_dict = {}
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
data_dict[file_name] = df['ppm']
df1 = pd.DataFrame(data_dict)
Use concat outside loops with append DataFrames to list with rename column ppm:
all_files=glob.glob(os.path.join(path,"*.txt"))
dfs = []
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
dfs.append(df.rename(columns={'ppm':file_name}))
df_big = pd.concat(dfs, axis=1)
Use df.rename() to rename the column name of the dataframe.
for file in all_files:
file_name = os.path.basename(file)
print(file_name)
df = pandas.read_csv(file, index_col=None, sep=',', header = 0, usecols = ['ppm'])
df.rename(columns={'ppm': file_name}, inplace=True)
df1 = pandas.concat([df,df1],axis=1)
Output:
two.txt one.txt
0 9 3
1 0 6
Rather than concatenating and appending dataframes as you iterate over your list of files, you could consider building a dictionary of the relevant data then construct your dataframe just once. Like this:
import csv
import pandas as pd
import glob
import os
PATH = ''
COL = 'ppm'
FILENAME = 'filename'
D = {COL: [], FILENAME: []}
for file in glob.glob(os.path.join(PATH, '*.csv')):
with open(file, newline='') as infile:
for row in csv.DictReader(infile):
if COL in row:
D[COL].append(row[COL])
D[FILENAME].append(file)
df = pd.DataFrame(D)
print(df)
I have the following code:
import glob
import pandas as pd
import os
import csv
myList = []
path = "/home/reallymemorable/Documents/git/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
fileDate = pd.DataFrame({'Date': [dateFromFilename]})
myList.append(row.join(fileDate))
concatList = pd.concat(myList, sort=True)
print(concatList)
concatList.to_csv('/home/reallymemorable/Documents/test.csv', index=False, header=True
It goes through a folder of CSVs and grabs a specific row and puts it all in a CSV. The files themselves have names like 10-10-2020.csv. I have some code in there that gets the filename and removes the file extension, so I am left with the date alone.
I am trying to add another column called "Date" that contains the filename for each file.
The script almost works: it gives me a CSV of all the rows I pulled out of the various CSVs, but the Date column itself is empty.
If I do print(dateFromFilename), the date/filename prints as expected (e.g. 10-10-2020).
What am I doing wrong?
I believe join has how=left by default. And your fileDate dataframe has different index than row, so you wouldn't get the date. Instead, do an assignment:
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList.append(row.assign(Date=dateFromFilename))
concatList = pd.concat(myList, sort=True)
Another way is to store the dataframes as a dictionary, then concat:
myList = dict()
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList[dateFromFilename] = row
concatList = pd.concat(myList, sort=True)