Why is my for loop overwriting instead of appending? - python

I have multiple (25k) .csv files that I'm trying to append into a HDFStore file. They all share identical headers. I am using the below code, but for some reason whenever I run it the dataframe isn't appended with all of the files, but rather is only the last file in the list.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
store = pd.HDFStore('store.h5')
store.put('df', pd.read_csv(filenames[0],dtype=dtypes,parse_dates=
["date"])) #store one data frame
for f in filenames:
try:
temp_csv = pd.DataFrame()
temp_csv = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
store.append('df', temp_csv)
except:
pass
I've tried using a subset of the filenames list, but always get the last entry. For some reason, the loop is not appending my file, but rather overwriting it every single time. Any advice would be appreciated as this is driving me bonkers. (python 3, windows)

I think the problem is related to:
store.append('df', temp_csv)
If I correctly understand what you're trying to do, 'df' should change every iteration, you're just overwriting it now.

You're creating/storing a new DataFrame with each iteration, like #SeaMonkey said. Your consolidated dataframe should be instantiated outside your loop, something like this.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
df = pd.DataFrame()
for f in filenames:
df_tmp = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
df = df.append(df_tmp)
store = pd.HDFStore('store.h5')
store.put('df', df)

Related

How can I save my results in the same file as different columns in case of a 'for-cylce'

def get_df():
df = pd.DataFrame()
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
av_a = np.average(a, axis=0)
np.savetxt('merged_average.csv', av_a, delimiter=',')
I've tried to save it but it always overwrites with the next file and deletes the previous results
At the moment, your code is a bit hard to read, as you are declaring variables which are not used (df) and using variables which are not declared (a). In the future, try to give a minimal reproducible example of your problematic code.
I'll still try to give you an interpreted answer:
If you want to store multiple columns from different files next to each other, the job becomes simpler by first acquiring all columns, and then afterwardds save them to the file in a single action.
Here is an interpretation of your code:
def get_df():
# create an empty list to collect all results
average_results = []
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
a = something(file) # unknown to me
average_results.append(np.average(a, axis=0))
# convert the results to a 2d numpy matrix,
# optionally transpose it to get the desired data orientation
data = np.array(average_results).transpose()
# save the full dataset
np.savetxt('merged_average.csv', data , delimiter=',')

Use for loop to create dataframes from a list

Python/Pandas beginner here. I have a list with names which each represent a csv file on my computer. I would like to create a separate pandas dataframe for each of these csv files and use the same names for the dataframes. I can do this in a very inefficient way by creating a separate line of code for each name in the list and adding/removing these lines of code manually as the list changes over time, something like this when I have 3 names Mark, Frank and Peter:
path = 'C:\\Users\\Me\\Desktop\\Names'
Mark = pd.read_csv(path+"Mark.csv")
Frank = pd.read_csv(path+"Frank.csv")
Peter = pd.read_csv(path+"Peter.csv")
Problem is that I will usually have a dozen or so names and they change frequently, so this is not very efficient. Instead I figured I would keep a list of the names to update when needed and use a for loop to do the rest:
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
for name in names:
name = pd.read_csv(path+name+'.csv')
This does not produce an error, but instead of creating 3 different dataframes Mark, Frank and Peter, it creates a single dataframe 'name' using only the data from the first entry in the list. How do make this work so that it creates a separate dataframe for each name in the list and give each dataframe the same name as the csv file that was read?
it creates a single dataframe 'name' using only the data from the first entry in the list.
It uses the last entry, because each time through the loop, name is replaced with the result of the next read_csv call. (Actually, it's being replaced with one of the value from the list, and then with the read_csv result; to avoid confusion, you should use a separate name for your loop variables as your outputs. Especially since name doesn't make any sense as the thing to call your result :) )
How do make this work
You had a list of input values, and thus you want a list of output values as well. The simplest approach is to use a list comprehension, describing the list you want in terms of the list you start with:
csvs = [
pd.read_csv(f'{path}{name}.csv')
for name in names
]
It works the same way as the explicit loop, except it builds a list automatically from the value that's computed each time through. It means what it says, in order: "csvs is a list of these pd.read_csv results, computed once for each of the name values that is in names".
name here is the variable used to iterate over the list. Modifying it won't make any noticable changes.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path + name + '.csv'))
# OR
dfs = [
pd.read_csv(path + name + '.csv')
for name in names
]
Or, you can use a dict to map the name with the file.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = {}
for name in names:
dfs[name] = pd.read_csv(path + name + '.csv')
# OR
dfs = {
name : pd.read(path + name + '.csv')
for name in names
}
Two options:
If you know the names of all your csv files you can edit you code and only add a list to hold all your files.
Example
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path+name+'.csv')
Otherwise, you can look for all the files with csv extension and open all of them using listdir()
import os
import pandas as pd
path = 'C:\\Users\\Me\\Desktop\\Names'
files = os.listdir(path)
dfs = []
for file in files:
if file[-3:] == "csv":
dfs.append(pf.read_csv(path + file))
for name in names:
globals()[name] = pd.read_csv(path+name+'.csv')

Finding a specific value in csv files Python

I have a column of values, which are part of a dataframe df.
Value
6.868061881
6.5903628020000005
6.472865833999999
6.427754219
6.40081742
6.336348032
6.277545389
6.250755132
These values have been put together from several CSV files. Now I'm trying to backtrack and find the original CSV file which contains the values. This is my code. The problem is each row of the CSV file contains alphanumeric entries and I'm comparing only for numeric ones (as Values above). So the code isn't working.
for item in df['Value']:
for file in dirs:
csv_file = csv.reader(open(file))
for row in csv_file:
for column in row:
if str(column) == str(item):
print (file)
Plus, I'm trying to optimize the # loops. How do I approach this?
Assuming dirs is a list of file paths to CSV files:
csv_dfs = {file: pd.read_csv(file) for file in dirs}
csv_df = pd.concat(csv_dfs)
If you're just looking in the 'Values' column, this is pretty straightforward:
print csv_df[csv_df['Values'].isin(df['Values'])]
Because we made the dataframe from a dictionary of the files, where the keys are filenames, the printed values will have the original filename in the index.
In a comment, you asked how to just get the filenames. Because of the way we constructed the dataframe's index, the following should work to get a series of the filenames:
csv_df[csv_df['Values'].isin(df['Values'])].reset_index()['level_0']
Note, if you're not sure what column in the CSVs you're matching, then you can loop it:
for col in df.columns:
print csv_df[csv_df[col].isin(df['Values'])]
A few suggestions:
Make sure you're comparing like types, e.g.:
if str(column) == str(item):
Or, you could check types before doing the comparison:
if all(map(type,[column,item])) and column == item:
Or, dump your CSV into a DataFrame. This approach reduces the number of loops since you don't need to iterate the rows/lines in the file, just the columns:
from pandas import read_csv
for item in df['Value']:
for file in dirs:
csv_frame = read_csv(file)
for column in csv_frame.columns:
if item in csv_frame[column]:
print(file)
File I/O will generally take more time than processing data in memory. So, if you want to optimize your code , it will be better to loop through the csv files once, instead of for every item in your dataframe. I suggest the following -
val_list = df['Values'].values
for file in dirs:
csv_df = pd.read_csv(file)
df_contains = csv_df.isin(val_list)
if np.any(df_contains.values):
print(file)

How to select unique values from named column in multiple .csv files?

I am trying to create a list of unique ID's from multiple csvs.
I have around 80 csvs containing data, all in the same format and in the same directory. The files contain time series data from around 1500 sites, but not all sites are in all files. The column with the data I need is called 'Site Id'.
I can get unique values from the first csv by creating a dataframe, but I can't see how to loop through all the remaining files.
If it's not obvious by now I am a complete beginner and my tutors are on vacation!
I've tried creating a df for a single file, but I can't figure out the next step.
df = pd.read_csv(r'C:filepathhere.csv')
ids = df['Site Id'].unique().tolist()
You can do something like this. I used the os.listdir function to get all of the files, and then the list.extend to merge the site IDs I was coming across into my siteIDs list. Finally, turning a list into a set, and then back into a list will remove any duplicate entries.
siteIDs = []
directoryToCSVs = r'c:\...'
for filename in os.listdir(directoryToCSVs):
if filename.lower().endswith('.csv'):
df = pd.read_csv(r'C:filepathhere.csv')
siteIDs.extend( df['Site Id'].tolist() )
#remove duplicate site IDs
siteIDs = list(set(siteIds))
#siteIDs will now contain a list of the unique site IDs across all of your CSV files.
You could do something like this to iterate over all your CSVs and load them into dataframes:
from os import walk, path
import pandas as pd
path = 'Path to CSV dir'
csv_paths = []
for root, dirs, files in walk(path):
for c in glob(path.join(root, '*.csv')):
csv_paths.append(c)
for file_path in csv_paths:
df = pd.read_csv(filepath_or_buffer=file_path)
# do something with df (append, export, etc.)
First you need to gather the files into a list that you will be getting data out of. There are many ways to do this, assuming you know the directory they are all in, see this answer for many options.
from os import walk
f = []
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
Then within that list you'll need to gather those unique values that you need. Without using Pandas, since it doesn't seem like you actually need your information in a dataframe:
import csv
unique_data = {}
for file in f:
with open(file, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
# go through each, add value to dictionary
for header, value in row.items():
unique_data[value] = 0
# unqiue_data.keys() is now your list of unique values, if you want a true list
unique_data_list = list(unqiue_data.keys())

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

Categories