I am trying to read in data from text files that are moved into a network share drive over a VPN. The overall intent is to be able to loop through the files with yesterday's date (either in the file name, or by the Modified Date) and extract the pipe delimited data separated by "|" and concat it into a pandas df. The issue I am having is actually being able to read files from the network drive. So far I have only been able to figure out how to use os.listdir to identify the file names, but not actually read them. Anyone have any ideas?
This is what I've tried so far that has actually started to pan out = with os.listdir correctly being able to see the Network folder and the files inside - but how would I call the actual files inside (filtered by date or not) to actually get it to work in the loop?
import pandas as pd
#folder = os.listdir(r'\\fileshare.com\PATH\TO\FTP\FILES')
folder = (r'\\fileshare.com\PATH\TO\FTP\FILES')
main_dataframe = pd.DataFrame(pd.read_csv(folder[0]))
for i in range (1, len(folder)):
data = pd.read_csv(folder[i])
df = pd.DataFrame(data)
main_dataframe = pd.concat([main_dataframe, df], axis=1)
print(main_dataframe)
I'm pretty new at Python and doing things like this, so I apologize if I refer to anything wrong. Any advice would be greatly appreciated!
I have over 200 000 json files stored across 600 directories on my drive and a copy in Google cloud storage.
i need to merge them and transform to CSV file;
so i thougt of looping the following logic:
open each file with pandas read_json
apply some transformations ( drop some columns, add columns with values based on the filename, change order of columns)
append this df to a master_df
and finally export the master_df to a csv.
each file is around 5000 rows when converted to DF - which would result in DF of around 300 000 000 rows.
The files are grouped and reside in 3 directories ( and than depending on the file - in a couple more, hence on top level there are 3 dirs, but overall - about 600 ) - so to speed it up a bit i decided to run the script based on one of those 3 directories
I am running this script on my local machine (32GD RAM, sufficient disk space on SSD drive), yet its painfully slow, and the speed decreases.
At the beginning one df was appended within 1 second, after executing loop 4000 times - time to append grew to 2.7s
I am looking for a way to speed it up to something that couple be reasonably done within (hopefully) a couple hours
so bottomline - should i try to optimize my code and run this locally, or not even bother, likely keep the script 'as is' and run it in f.ex Google Cloud Run?
the JSON files contain keys A, B, C, D, E. I drop A, B as not important, and rename C
the code i have so far:
import pandas as pd
import os
def listfiles(track, footprint):
"""
This function lists files in a specified filepath (track) by a footprint found in the filename with extension
Function returns a list of files.
"""
locations =[]
for r, d, f in os.walk(track):
for file in f:
if footprint in file:
locations.append(os.path.join(r, file))
return locations
def create_master_df(track, footprint):
master_df = pd.DataFrame(columns = ['date','domain','property','lang','kw', 'clicks','impressions'])
all_json_files = listfiles(track, footprint)
prop = [] #this indicates the starting directory, and to which ouput file should master_df be saved
for i, file in enumerate(all_json_files):
#here starts logic by which i identify some properties of the file, to use them as values in columns in local_df
elements=file.split('\\')
elements[-1] = elements[-1].replace('.json','')
elements[-1] = elements[-1].replace('-Copy-Multiplicated','')
date = elements[-1][-10:]
domain=elements[7]
property=elements[6]
language=elements[8]
json_to_rows = pd.read_json(file)
local_df = pd.DataFrame(json_to_rows.rows.values.tolist())
local_df = local_df.drop(columns=['A','B'])
local_df = local_df.rename(columns={'C': 'new_name'})
#here i add earlier extracted data, to be able to distinguish rows in the master DF
local_df['date'] = date
local_df['domain'] = domain
local_df['property'] = property
local_df['lang'] = language
local_df = local_df[['date','domain','property','lang','new_name', 'D','E']]
master_df = master_df.append(local_df, ignore_index=True)
prop = property
print(f'appended {i} times')
out_file_path = f'{track}\\{prop}-outfile.csv'
master_df.to_csv(out_file_path, sep=";", index=False)
#run the script on first directory
track = 'C:\\xyz\\a'
footprint ='.json'
create_master_df(track, footprint)
#run the script on first directory
track = 'C:\\xyz\\b'
create_master_df(track, footprint)
any feedback is welcome!
#Edit:
i tried and timed a couple ways of approaching it, below is explanation of what i did:
clear execution - at first i ran my code as it was written
1st try - i moved all changes to local_df to before the final master_df was saved to csv. This was an attempt to see if working on files 'as they are' without manipulating them would be faster. i removed from for loop dropping columns, changing colum names, and reordering them. all this was applied on master_df before exporting to csv
2nd try - i brought back to for loop dropping unnecessary columns - as clearly performance was impacted by additional columns
3rd try - so far the most efficient - instead of appending local_df to master_df, i transformed it to local_list, and appended to master_list. master_list was than pd.concat to master_df and exported to csv
here is a comparison of speed - how fast the script was to iterate the for loop in each version, when tested on ~ 800 JSON files:
comparison of timed script executions
Its surprising to see that i actually wrote a solid code in the first place - as i don't have much experience coding - that was super nice to see :)
overall when job was ran on the test set of files it finished as follows (in seconds)
First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.
I have 10 files which I need to work on.
I need to import those files using pd.read_csv to turn them all into dataframes along with usecols as I only need the same two specific columns from each file.
I then need to search the two columns for a specific entry in the rows like ‘abcd’ and for python to return a new df with includes all the rows it appeared in for each file.
Is there a way I could do this using a for loop. For far I’ve only got a list of all the paths to the 10 files.
So far what I do for one file without the for loop is:
df = pd.read_csv(r'filepath', header=2, usecols=['Column1', 'Column2'])
search_df = df.loc[df['Column1'] == 'abcd']
I have these data exports that are populating every hour in a particular directory, and i'm hoping to have a script that reads all the files and appends them into one master dataframe in Python. Only issue is, since they are populating every hour, I don't want to append existing or already added csv files to the master dataframe.
I'm very new to Python, and so far have only been able to load all the files in the directory and append them all, using the below code:
import pandas as pd
import os
import glob
path = os.environ['HOME'] + "/file_location/"
allFiles = glob.glob(os.path.join(path,"name_of_files*.csv"))
df = pd.concat((pd.read_csv(f) for f in allFiles), sort=False)
With the above code, it looks into the file_location and imports any files with the name "name_of_files" & uses a wildcard as the tail of each of the files will be different.
I could continue to do this, but i'm literally going to have hundreds of files and don't want to import them all and append/concat them each and every hour. To avoid this i'd like to have that master data frame mentioned above and just have new csv files that populate each hour to be automatically appended to that existing master df.
Again super new to Python, so not even sure what to do next. Any advice would be greatly appreciated!