Parsing through each folder to pull in information in python - python

I have a directory with a folder for each customer. In each customer folder there is a csv file named surveys.csv. I want to open each customer folder and then pull the data from the csv and concatenate. I also want to create a column with that customer id which is the name of the folder.
import os
rootdir = '../data/customer_data/'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
csvfiles = glob.glob(os.path.join(mycsvdir, 'surveys.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
df['patient_id'] = os.path.dirname
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
result.head()
This code is only giving me a dataframe with one customer's data. In the directory : '../data/customer_data/' there should be about 25 folders with customer data. I want to concatenate all the 25 of the surveys.csv files into a dataframe. Please help

Put this line:
dataframes = []
Outside the outer for loop.
It erases the list every loop.
Another issues:
In this line csvfiles = glob.glob(os.path.join(mycsvdir, 'surveys.csv')) - use subdir to get full path of the files.
csvfiles is only one file - why do you use loop to read it?

Related

Iterating over .csv files and naming dataframes respectively

how can I iterate over .csv files in a folder, create dataframe from each .csv and name those dateframes after respective .csv files. Or it could be actually any other name.
My approach doesnt event create a single dataframe.
path = "/user/Home/Data/"
files = os.listdir(path)
os.chdir(path)
for file, j in zip(files, range(len(files))):
if file.endswith('.csv'):
files[j] = pd.read_csv(file)
Thanks!
You can use pathlib and a dictionary to do that (as already pointed out by jitusj in the comment).
from pathlib import Path
path = Path(".../user/Home/Data/") # complete path needed! Replace "..." with full path
dict_of_df = {}
for file in path.glob('*.csv'):
dict_of_df[file.stem] = pd.read_csv(file)
Now you have a dictionary of dataframes, with the filenames as keys (without .csv extension).

Import all Excel files from all subfolders in a directory

I'm new to Python and having some trouble looping all the files in my directory.
I am trying to import data from all Excel files from all of the subfolders I have in one single directory. For example, I have a directory named "data" which has five different subfolders and each subfolder contains Excel files from which I want to extract data.
I guess my current code is not working because it just loops all the files in a directory without considering the subfolders. How do I modify my current code to extract data from all the subfolders in my directory?
data_location = "data/"
for file in os.listdir(data_location):
df_file = pd.read_excel(data_location + file)
df_file.set_index(df_file.columns[0], inplace=True)
selected_columns = df_file.loc["Country":"Score", :'Unnamed: 1']
selected_columns.dropna(inplace=True)
df_total = pd.concat([selected_columns, df_total], ignore_index=True)
Also, I've been trying to create a new variable using each file name as I import them. For example, if there are 5 files(file1~file5) in a directory, I want to create a new variable called "Source" and each value would be file1, file2, file3, file4, file5. I want python to append this value for the new variable as it imports each file in the loop. Could anyone please help me with this?
to go through subdirectories recursively, try something like this:
data_location = 'C:/path/to/data'
for subdir, dirs, files in os.walk(data_location):
for file in files:
df_file = pd.read_excel(data_location + file)

How to apply the same process to multiple csv files in pandas and save it in another directory?

I've been trying to create a code, which runs through all the csv files inside the directory and applies the same operation to all of them. Afterwards it should save the new csv files in another directory.
I've got two problems: First the code only saves the last iteration and second how do I save the files with different names?
Here's my code so far:
from pathlib import Path
import pandas as pd
dir = r'C:\my\path\to\file'
csv_files = [f for f in Path(dir).glob('*.csv')] #list all csv
for csv in csv_files: #iterate list
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ';') #read csv
df.drop(df.index[:-1], inplace = True) #drop all but the last row
df.to_csv("C:\new\path\to\file\variable name") #save the file in a new dir
Rakesh answer works perfectly for me. Thank you guys for your input! :)
In this case maybe best thing is to save new file with same name/with a common suffix or in new directory.
I've got two problems:
First the code only saves the last iteration - It is because you are saving files with same name so each iteration overrides this file & only last file is available.
and second how do I save the files with different names? - may be use same name for new files to & save in new directory or use some suffix like mycsv_modified.csv
Below i created an example to save in new directory (I tested this code on non-window environment & using jupyter notebook)-
from pathlib import Path
import pandas as pd
dir_b = r'/Users/rakeshkumar/bigquery'
csv_files = [f for f in Path(dir_b).glob('*.csv')] #list all csv
#!mkdir -p processed #I created new directory to save modified file in notebook itself, you can decide yourself about new directory
for csv in csv_files: #iterate list
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ';') #read csv
df.drop(df.index[:-1], inplace = True) #drop all but the last row
print (df)
df.to_csv(dir_b + "/processed/" + csv.name) #save the file in a new dir

Appending a single row from multiple CSV files to another CSV

I'm using python 3 and pandas. I have a folder of multiple CSV files where each contain stats on a given date for all the regions of a country. I have created another folder for CSV files I created for each of the regions, one named for each of the regions listed in the CSV files in the first folder. I want to append the appropriate row from each of the first set of files to their respective region file in the second folder.
This shows a portion of a CSV file from first folder
This shows the CSV files I created in the second folder
Here is the code I'm running after creating the new set of region named files in the second folder. I don't get any errors, but I don't get the results I'm looking for either, which is a CSV file for each region in the second folder containing the daily stats from each of the files in the first folder.
for csvname in os.listdir("NewTables"):
if csvname.endswith(".csv"):
df1 = pd.read_csv("NewTables/"+ csvname)
name1 = os.path.splitext(filename)[0]
for file in os.listdir():
if file.endswith(".csv"):
df2 = pd.read_csv(file)
D = df2[df2["denominazione_regione"] == name1 ]
df1.append(D, ignore_index = True)
df1.to_csv("NewTables/"+ csvname)
Here are a few lines from a CSV file in the first folder:
data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,note_it,note_en
2020-02-24T18:00:00,ITA,13,Abruzzo,42.35122196,13.39843823,0,0,0,0,0,0,0,0,0,0,5,,
2020-02-24T18:00:00,ITA,17,Basilicata,40.63947052,15.80514834,0,0,0,0,0,0,0,0,0,0,0,,
2020-02-24T18:00:00,ITA,04,P.A. Bolzano,46.49933453,11.35662422,0,0,0,0,0,0,0,0,0,0,1,,
I would not use pandas here because there is little data processing and mainly file processing. So I would stick to the csv module.
I would look over the csv files in the first directory and process them one at a time. For each row I would just append it in the file with the relevant name in the second folder. I assume that the number of regions is reasonably small, so I would keep the files in second folder opened to save open/close time on each row.
The code could be:
import glob
import os.path
import csv
outfiles = {} # cache the open files and the associated writer in 2nd folder
for csvname in glob.glob('*.csv'): # loop over csv files from 1st folder
with open(csvname) as fdin:
rd = csv.DictReader(fdin) # read the file as csv
for row in rd:
path = "NewTables/"+row['denominazione_regione']+'.csv'
newfile = not os.path.exists(path) # a new file?
if row['denominazione_regione'] not in outfiles:
fdout = open(path, 'a', newline='') # not in cache: open it
wr = csv.DictWriter(fdout, rd.fieldnames)
if newfile:
wr.writeheader() # write header line only for new files
outfiles[row['denominazione_regione']] = (wr, fdout) # cache
wr = outfiles[row['denominazione_regione']][0]
wr.writerow(row) # write the row in the relevant file
for file in outfiles.values(): # close every outfile
file[1].close()

Read multiple csv files starting with a string into separate data frames in python

I have about 500 '.csv' files starting with letter 'T' e.g. 'T50, T51, T52 ..... T550' and there are some other ',csv' files with other random names in the folder. I want to read all csv files starting with "T" and store them in separate dataframes: 't50, t51, t52... etc.'
The code I have written just reads these files into a dataframe
import glob
import pandas as pd
for file in glob.glob("T*.csv"):
print (file)
I want to have a different name for each dataframe - preferably, their own file names. How can I achieve this within its 'for loop'?
Totally agree with #Comos
But if you still need individual variable names, I adapted the solution from here!
import pandas as pd
import os
folder = '/path/to/my/inputfolder'
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
for file in filelist:
exec("%s = pd.read_csv('%s')" % (file.split('.')[0], os.path.join(folder,file)))
In additions to ABotros's answer, to read all files in different dataframes, I would recommend adding the files to a dictionary, which will allow you to save dataframes with different names in a loop:
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
database = {}
for file in filelist:
database[file] = pd.read_csv(file)

Categories