Reading multiple files into separate data frames

Reading multiple files into separate data frames - python

I found this method on stack :
import glob
d = {}
for filename in glob.glob('*.xlsx'):
d[filename[:-4]] = pd.read_excel(filename, sheet_name = 'Bilan')
How do I change that to have the name of all of my dataframe more like :
-df1
-df2
-df3
...
-dfN
and so on. The name based on the filename is nice but tedious to code with.

You can probably do something like this:
import glob
d = {}
base_name = "df{}"
flag = 0
for filename in glob.glob('*.xlsx'):
d[base_name.format(flag)] = pd.read_excel(filename, sheet_name = 'Bilan')
flag += 1
Here you create a base_name for your name and a flag to track the position of your file and then use those variables to construct a full filename.

Related

to_csv multiple dataframes from loop with filename

I'm trying to create multiple good/bad files from original .csv files from a directory.
Im fairly new to Python, but have cobbled together the below, but it's not saving multiple files, just x1 "good" and x1 "bad" file. in the dir i have testfile1 and testfile2. the output should be testfile1good testfile1bad testfile2good testfile2bad.
Any help would be greatly appreciated.
Thanks
import pandas as pd
from string import ascii_letters
import glob
from pathlib import Path
files = glob.glob('C:\\Users\\nickn\\OneDrive\\Documents\\Well\\*.csv')
for f in files:
filename = []
filename = Path(f)
#Can not be null fields
df = pd.read_csv(f)
emptyvals = []
emptyvals = df['First Name'].isnull() | df['Last Name'].isnull()
#Bank Account Number is not 8 digits long
accountnolen = []
ac = []
accountnolen = df['AccNumLen'] = df['Bank Account Number'].astype(str).map(len)
ac = df[(df['AccNumLen'] != 8)]
acd= ac.drop(['AccNumLen'],axis=1)
#Create Exclusions
allexclusions = []
allexclusions = df[emptyvals].append(acd)
allexclusions.to_csv(filename.stem+"bad.csv",header =True,index=False)
#GoodList
#for f in files:
# filename = []
# filename = Path(f)
origlist = df
df = pd.merge(origlist, allexclusions, how='outer', indicator=True)
cl = df[(df['_merge'] == 'left_only')]
cld = cl.drop(['_merge','AccNumLen'],axis=1)
cld['Well ID'] = cld['Well ID'].str.rstrip(ascii_letters)
cld.to_csv(filename.stem+'good.csv',header =True,index=False)

i think you do loop but leave it and do the rest on line 14 - there you have filename set and you save your data once.
What you want is do the loop and the rest should happen for each iteration, so code should look like this:
import pandas as pd
from string import ascii_letters
import glob
from pathlib import Path
files = glob.glob('C:\\Users\\nickn\\OneDrive\\Documents\\Well\\*.csv')
for f in files:
filename = []
filename = Path(f)
#EDIT: we stay in loop and process each file one by one with following lines:
#Can not be null fields
df = pd.read_csv(f)
emptyvals = []
emptyvals = df['First Name'].isnull() | df['Last Name'].isnull()
#Bank Account Number is not 8 digits long
accountnolen = []
ac = []
accountnolen = df['AccNumLen'] = df['Bank Account Number'].astype(str).map(len)
ac = df[(df['AccNumLen'] != 8)]
acd= ac.drop(['AccNumLen'],axis=1)
#Create Exclusions
allexclusions = []
allexclusions = df[emptyvals].append(acd)
allexclusions.to_csv(filename.stem+"bad.csv",header =True,index=False)
#GoodList
#for f in files:
# filename = []
# filename = Path(f)
origlist = df
df = pd.merge(origlist, allexclusions, how='outer', indicator=True)
cl = df[(df['_merge'] == 'left_only')]
cld = cl.drop(['_merge','AccNumLen'],axis=1)
cld['Well ID'] = cld['Well ID'].str.rstrip(ascii_letters)
cld.to_csv(filename.stem+'good.csv',header =True,index=False)
In another words - you iterate over file names found in directory and THEN you take last "filename" and process it in one pass. By adding 4 spaces to rest of code we say to python interpreter that this part of code is part of loop and should be executed for each file. Hope it makes sense

Open multiple Excel files to separate Pandas dataframes

Brand new to Python and could use some help importing multiple Excel files to separate Pandas dataframes. I have successfully implemented the following code, but of course it imports everything into one frame. I would like to import them into df1, df2, df3, df4, df5, etc.
Anything helps, thank you!
import pandas as pd
import glob
def get_files():
directory_path = input('Enter directory path: ')
filenames = glob.glob(directory_path + '/*.xlsx')
number_of_files = len(filenames)
df = pd.DataFrame()
for f in filenames:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)
print(number_of_files)
get_files()

The easiest way to do that is to use a list. Each element of the list is a dataframe
def get_files():
directory_path = input('Enter directory path: ')
filenames = glob.glob(directory_path + '/*.xlsx')
number_of_files = len(filenames)
df_list = []
for f in filenames:
data = pd.read_excel(f, 'Sheet1')
df_list.append(data)
print(df_list)
print(number_of_files)
return df_list
get_files()
You can then access your dataframes with df_list[0], df_list[1]...

Just as another option by Jezrael answer here https://stackoverflow.com/a/52074347/13160821 but modified for your code.
from os.path import basename
def get_files():
directory_path = input('Enter directory path: ')
filenames = glob.glob(directory_path + '/*.xlsx')
number_of_files = len(filenames)
df_list = {basename(f) : pd.read_excel(f, 'Sheet1') for f in filenames}
print(number_of_files)
return df_list
get_files()
Which can then be accessed by the filename eg. dfs['file_name1.xlsx'] or dfs['some_file.xlsx']. You can also do things like splitext to remove the xlsx from the key or use just part of the filename.

Loop through multiple CSV files and run a script

I have a script which pulls in data from a csv file, does some manipulations to it and creates an output excel file. But, its a tedious process as I need to do it for multiple files.
Question: Is there a way for me to run this script across multiple csv files together and create a separate excel file output for each input file?
I'm not sure what to try out here. I've read that I need to use a module called glob but I'm not sure how to go about it.
This script works for a single file:
# Import libraries
import pandas as pd
import xlsxwriter
# Set system paths
INPUT_PATH = 'SystemPath//Downloads//'
INPUT_FILE = 'rawData.csv'
OUTPUT_PATH = 'SystemPath//Downloads//Output//'
OUTPUT_FILE = 'rawDataOutput.xlsx'
# Get data
df = pd.read_csv(INPUT_PATH + INPUT_FILE)
# Clean data
cleanedData = df[['State','Campaigns','Type','Start date','Impressions','Clicks','Spend(INR)',
'Orders','Sales(INR)','NTB orders','NTB sales']]
cleanedData = cleanedData[cleanedData['Impressions'] != 0].sort_values('Impressions',
ascending= False).reset_index()
cleanedData.loc['Total'] = cleanedData.select_dtypes(pd.np.number).sum()
cleanedData['CTR(%)'] = (cleanedData['Clicks'] /
cleanedData['Impressions']).astype(float).map("{:.2%}".format)
cleanedData['CPC(INR)'] = (cleanedData['Spend(INR)'] / cleanedData['Clicks'])
cleanedData['ACOS(%)'] = (cleanedData['Spend(INR)'] /
cleanedData['Sales(INR)']).astype(float).map("{:.2%}".format)
cleanedData['% of orders NTB'] = (cleanedData['NTB orders'] /
cleanedData['Orders']).astype(float).map("{:.2%}".format)
cleanedData['% of sales NTB'] = (cleanedData['NTB sales'] /
cleanedData['Sales(INR)']).astype(float).map("{:.2%}".format)
cleanedData = cleanedData[['State','Campaigns','Type','Start date','Impressions','Clicks','CTR(%)',
'Spend(INR)','CPC(INR)','Orders','Sales(INR)','ACOS(%)',
'NTB orders','% of orders NTB','NTB sales','% of sales NTB']]
# Create summary
summaryData = cleanedData.groupby(['Type'])[['Spend(INR)','Sales(INR)']].agg('sum')
summaryData.loc['Overall Snapshot'] = summaryData.select_dtypes(pd.np.number).sum()
summaryData['ROI'] = summaryData['Sales(INR)'] / summaryData['Spend(INR)']
# Push to excel
writer = pd.ExcelWriter(OUTPUT_PATH + OUTPUT_FILE, engine='xlsxwriter')
summaryData.to_excel(writer, sheet_name='Summary')
cleanedData.to_excel(writer, sheet_name='Overall Report')
writer.save()
I've never tried anything like this before and I would appreciate your help trying to figure this out

You can use Python's glob.glob() to get all of the CSV files from a given folder. For each filename that is returned, you could derive a suitable output filename. The file processing could be moved into a function as follows:
# Import libraries
import pandas as pd
import xlsxwriter
import glob
import os
def process_csv(input_filename, output_filename):
# Get data
df = pd.read_csv(input_filename)
# Clean data
cleanedData = df[['State','Campaigns','Type','Start date','Impressions','Clicks','Spend(INR)',
'Orders','Sales(INR)','NTB orders','NTB sales']]
cleanedData = cleanedData[cleanedData['Impressions'] != 0].sort_values('Impressions',
ascending= False).reset_index()
cleanedData.loc['Total'] = cleanedData.select_dtypes(pd.np.number).sum()
cleanedData['CTR(%)'] = (cleanedData['Clicks'] /
cleanedData['Impressions']).astype(float).map("{:.2%}".format)
cleanedData['CPC(INR)'] = (cleanedData['Spend(INR)'] / cleanedData['Clicks'])
cleanedData['ACOS(%)'] = (cleanedData['Spend(INR)'] /
cleanedData['Sales(INR)']).astype(float).map("{:.2%}".format)
cleanedData['% of orders NTB'] = (cleanedData['NTB orders'] /
cleanedData['Orders']).astype(float).map("{:.2%}".format)
cleanedData['% of sales NTB'] = (cleanedData['NTB sales'] /
cleanedData['Sales(INR)']).astype(float).map("{:.2%}".format)
cleanedData = cleanedData[['State','Campaigns','Type','Start date','Impressions','Clicks','CTR(%)',
'Spend(INR)','CPC(INR)','Orders','Sales(INR)','ACOS(%)',
'NTB orders','% of orders NTB','NTB sales','% of sales NTB']]
# Create summary
summaryData = cleanedData.groupby(['Type'])[['Spend(INR)','Sales(INR)']].agg('sum')
summaryData.loc['Overall Snapshot'] = summaryData.select_dtypes(pd.np.number).sum()
summaryData['ROI'] = summaryData['Sales(INR)'] / summaryData['Spend(INR)']
# Push to excel
writer = pd.ExcelWriter(output_filename, engine='xlsxwriter')
summaryData.to_excel(writer, sheet_name='Summary')
cleanedData.to_excel(writer, sheet_name='Overall Report')
writer.save()
# Set system paths
INPUT_PATH = 'SystemPath//Downloads//'
OUTPUT_PATH = 'SystemPath//Downloads//Output//'
for csv_filename in glob.glob(os.path.join(INPUT_PATH, "*.csv")):
name, ext = os.path.splitext(os.path.basename(csv_filename))
# Create an output filename based on the input filename
output_filename = os.path.join(OUTPUT_PATH, f"{name}Output.xlsx")
process_csv(csv_filename, output_filename)
os.path.join() can be used as a safer way to join file paths together.

Something like:
import os
import glob
import pandas as pd
os.chdir(r'path\to\folder') #changes folder path to working dir
filelist=glob.glob('*.csv') #creates a list of all csv files
for file in filelist: #loops through the files
df=pd.read_csv(file,...)
#Do something and create a final_df
final_df.to_excel(file[:-4],+'_output.xlsx',index=False) #excel with same name+ouput

you can run this scrip inside a for loop:
for file in os.listdir(INPUT_PATH):
if file.endswith('.csv') or file.endswith('.CSV'):
INPUT_FILE = INPUT_PATH + '/' + file
OUTPUT_FILE = INPUT_PATH + '/Outputs/' + file.[:-4] + 'xlsx'

try this:
import glob
files = glob.glob(INPUT_PATH + "*.csv")
for file in files:
# Get data
df = pd.read_csv(file)
# Clean data
#your cleaning code
# Push to excel
writer = pd.ExcelWriter(OUTPUT_PATH + file.split("/")[-1].replace(".csv","_OUTPUT.xlxs", engine='xlsxwriter')

Taking Same Worksheet from a Folder of xlsm Files with Python

I'm new to pandas/python and Ive come up with the following code to extract data from a specific part of a worksheet.
import openpyxl as xl
import pandas as pd
rows_with_data = [34,37,38,39,44,45,46,47,48,49, 50,54,55,57,58,59,60,62,63,64,65,66,70,71,72,76,77, 78,79,80,81,82,83,84,88,89,90,91,92]
path = r'XXX'
xpath = input('XXX')
file = r'**.xlsm'
xfile = input('Change file name, current is ' + file + ' :')
sheetname = r'Summary'
wb = xl.load_workbook(filename = xpath + '\\' +file, data_only = True)
sheet = wb.get_sheet_by_name(sheetname)
rows = len(rows_with_data)
line_items = []
for i in range(rows) :
line_items.append(sheet.cell(row = rows_with_data[i], column = 13).value)
period = []
for col in range(17,35):
period.append(sheet.cell(row = 20, column = col).value)
print(line_items)
vals = []
x = []
for i in range(rows):
if i != 0:
vals.append(x)
x = []
for col in range(17,35):
x.append(sheet.cell(row = rows_with_data[i], column = col).value)
vals.append(x)
all_values = {}
all_values['Period'] = period
for i in range(rows):
print(line_items[i])
all_values[line_items[i]] = vals[i]
print(all_values)
period_review = input('Enter a period (i.e. 2002): ')
item = input('Enter a period (i.e. XXX): ')
time = period.index(period_review)
display_item = str(all_values[item][time])
print(item + ' for ' + period_review + " is " + display_item)
Summary_Dataframe = pd.DataFrame(all_values)
writer = pd.ExcelWriter(xpath + '\\' + 'values.xlsx')
Summary_Dataframe.to_excel(writer,'Sheet1')
writer.save()
writer.close()
I have the same worksheet (summary results) across a library of 60 xlsm files and I'm having a hard time figuring out how to iterate this across the entire folder of files. I also want change this from extracting specific rows to taking the entire "Summary" worksheet, pasting it to the new file and naming the worksheet by its filename ("Experiment_A") when pasted to the new excel file. Any advice?

I was having hard time to read your code to understand that what you want to do finally. So it is just an advice not a solution. You can iterate through all files in the folder using os then read the files in to one dataframe then save the single big data frame in to csv. I usually avoid excel but I guess you need the excel conversion. In the example below I have read all txt file from a directory put them in to dataframe list then store the big data frame as json. You can also store it as excel/csv.
import os
import pandas as pd
def process_data():
# input file path in 2 part in case it is very long
input_path_1 = r'\\path\to\the\folder'
input_path_2 = r'\second\part\of\the\path'
# adding the all file path
file_path = input_path_1 + input_path_2
# listing all file in the file folder
file_list = os.listdir(os.path.join(file_path))
# selecting only the .txt files in to a list object
file_list = [file_name for file_name in file_list if '.txt' in file_name]
# selecting the fields we need
field_names = ['country', 'ticket_id']
# defining a list to put all the datafremes in one list
pd_list = []
inserted_files = []
# looping over txt files and storing in to database
for file_name in file_list:
# creating the file path to read the file
file_path_ = file_path + '\\' + file_name
df_ = pd.read_csv(os.path.join(file_path_), sep='\t', usecols=field_names)
# converting the datetime to date
# few internal data transformation example before writting
df_['sent_date'] = pd.to_datetime(df_['sent_date'])
df_['sent_date'] = df_['sent_date'].values.astype('datetime64[M]')
# adding each dataframe to the list
pd_list.append(df_)
# adding file name to the inserted list to print later
inserted_files.append(file_name)
print(inserted_files)
# sql like union all dataframes and create a single data source
df_ = pd.concat(pd_list)
output_path_1 = r'\\path\to\output'
output_path_2 = r'\path\to\output'
output_path = output_path_1 + output_path_2
# put the file name
file_name = 'xyz.json'
# adding the day the file processed
df_['etl_run_time'] = pd.to_datetime('today').strftime('%Y-%m-%d')
# write file to json
df_.to_json(os.path.join(output_path, file_name), orient='records')
return print('Data Stored as json successfully')
process_data()

Python , get duplicates in 1st column of all csv files in a directory

import pandas as pd
import glob
dataset = pd.read_csv('masterfeedproduction-EURNA_2016-06-27.csv',sep =
',',delimiter = None) # select 1 file in the directory
datasets_cols = ['transactionID','gvkey','companyName']
df= dataset.transactionID
df.shape
df.loc[df.duplicated()]
returns the duplicates in the selected file. displays row number and transactionID. so this is correct.
target_directory = r'C:\Users\nikol\Downloads\fullDailyDeltas\fullDailyDeltas'
file_list = glob.glob(target_directory + "/*.csv")
df_result = df.loc[df.duplicated()]
for file in file_list:
return(df_result)
here I am stuck.
target_directory = r'C:\Users\nikol\Downloads\fullDailyDeltas\fullDailyDeltas'
file_list = glob.glob(target_directory + "/*.csv")
for file in file_list:
dataset = pd.read_csv(file)
df = dataset.transactionID
duplicated = df.loc[df.duplicated()]
if duplicated.empty == False:
print(file)
print(duplicated)

Have a look at the glob module.
import pandas as pd
import glob
def your_function(file):
# put your df processing logic here
return df_result
Step 1 - Create list of files in directory
target_directory = r'Path/to/your/dir'
file_list = glob.glob(target_directory + "/*.csv")
# Include slash or it will search in the wrong directory!!
Step 2 - Loop through files in list
for file in file_list: # Loop files
df_result = your_function(file) # Put your logic into a separate function
new_filename = file.replace('.csv', '_processed.csv')
df_result.to_csv(new_filename, index = False)
Comment
In case you would have included your code showing your attempts to do this yourself, your question was answered within seconds.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading multiple files into separate data frames - python

Related

to_csv multiple dataframes from loop with filename

Open multiple Excel files to separate Pandas dataframes

Loop through multiple CSV files and run a script

Taking Same Worksheet from a Folder of xlsm Files with Python

Python , get duplicates in 1st column of all csv files in a directory

Categories

Resources