Python - Read txt files from Network shared drive - python

I am trying to read in data from text files that are moved into a network share drive over a VPN. The overall intent is to be able to loop through the files with yesterday's date (either in the file name, or by the Modified Date) and extract the pipe delimited data separated by "|" and concat it into a pandas df. The issue I am having is actually being able to read files from the network drive. So far I have only been able to figure out how to use os.listdir to identify the file names, but not actually read them. Anyone have any ideas?
This is what I've tried so far that has actually started to pan out = with os.listdir correctly being able to see the Network folder and the files inside - but how would I call the actual files inside (filtered by date or not) to actually get it to work in the loop?
import pandas as pd
#folder = os.listdir(r'\\fileshare.com\PATH\TO\FTP\FILES')
folder = (r'\\fileshare.com\PATH\TO\FTP\FILES')
main_dataframe = pd.DataFrame(pd.read_csv(folder[0]))
for i in range (1, len(folder)):
data = pd.read_csv(folder[i])
df = pd.DataFrame(data)
main_dataframe = pd.concat([main_dataframe, df], axis=1)
print(main_dataframe)
I'm pretty new at Python and doing things like this, so I apologize if I refer to anything wrong. Any advice would be greatly appreciated!

Related

Why do I get duplicates when trying to append csv files

I wrote a function parse_xml that convert xml to csv. Then I want to convert every files within a folder (100,000 files) and append all into 1 file. My code for that task is
result=pd.DataFrame()
os.chdir('/Users/dp/Dropbox/Ratings/SP')
for file in list(glob.glob('*.xml')):
data = marshal.dumps(file)
obj = marshal.loads(data)
parse_xml(obj)
df = pd.DataFrame(rows, columns=cols)
result = pd.concat([result,pd.DataFrame.from_records(df)])
result.to_csv('output.csv')
However the result isn't what I'm looking for. It keeps re-appending all over again with similar files. 90% of the ouput's observations are duplicated.
Could somebody please give me a hint on how to resolve this issue? Thank you so much

Appending lots of csv files in folders within a folder

I am working on a shared network drive where I have 1 folder (main folder) containing many subfolders; 1 for each date (over 1700) and then within them csv files (results.csv) with a common name at the end (same file format). Each csv contains well over 30k rows.
I wish to read in all the csvs appending them into one dataframe to perform some minor calculations. I have used the below code. It ran for 3+ days so I quit, but looking at the dataframe it actually got 80% of the way through. But it seems inefficient as it takes ages and when I want to add the latest days file it will have to re-run again. I also only need a handful of the columns within each csv so want to use the usecols=['A', 'B', 'C'] function but not sure how to incorporate it. Could someone shed some light please on a better solution?
import glob
import os
import pandas as pd
file_source = glob.glob(r"//location//main folder//**//*results.csv", recursive=True)
appended_file = []
for i in file_source:
df = pd.read_csv(i)
appended_file.append(df)
combined=pd.concat(appended_file, axis=0, ignore_index=True, sort=False)
Thanks.

handling zip files using python library pandas

we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]
background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.
i am trying below steps in order to accomplish the tasks
step 1:
all_files = glob.glob(path + "/*.gz")
above step able to list all three types of file now using below code to process further
step 2:
li = []
for filename in x:
df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
low_memory=False, sep ="|")
li.append(df_a)
step 3:
frame = pd.concat(li, axis=0, ignore_index= True)
all three steps will work perfectly if
we have small data that could fit in our machine memory
we have only one type of files inside zip file
how do we overcome this problem, please help
we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.
also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.
You can refer to this link:
How do I read a large csv file with pandas?
In general, you can try with chunks
For better performance, I suggest to use Dask or Pyspark
Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.

Export data from MSSQL to Excel 'template' saving with a new name using Python

I am racking my brain here and have read a lot of tutorials, sites, sample code, etc. Something is not clicking for me.
Here is my desired end state.
Select data from MSSQL - Sorted, not a problem
Open an Excel template (xlsx file) - Sorted, not a problem
Export data to this Excel template and saving it with a different name - PROBLEM.
What I have achieved so far: (this works)
I can extract data from DB.
I can write that data to Excel using pandas, my line of code for doing that is: pd.read_sql(script,cnxn).to_excel(filename,sheet_name="Sheet1",startrow=19,encoding="utf-8")
filename variable is a new file that I create every time the for loop runs.
What my challenge is:
The data needs to be export to a predefined template (template has formatting that must be present in every file)
I can open the file and I can write to the file, but I do not know how to save that file with a different name through every iteration of the for loop
In my for loop I use this code:
#this does not work
pd.read_sql(script,cnxn)
writer = pd.ExcelWriter(SourcePath) #opens the source document
df.to_excel(writer)
writer.save() #how to I saveas() a different file name?????
Your help would be highly appreciated.
Your method is work. The problem is you don't need to write the data into excel file right after you read the data from the database. My suggestion is first read the data into different data frame.
df1 = pd.read_sql(script)
df2 = pd.read_sql(script)
df3 = pd.read_sql(script)
You can then write all the dataframe together to a excel file. You can refer to this link.
I hope this solution can help you. Have a nice weekend

Appending incoming csv files in Python to a master data frame

I have these data exports that are populating every hour in a particular directory, and i'm hoping to have a script that reads all the files and appends them into one master dataframe in Python. Only issue is, since they are populating every hour, I don't want to append existing or already added csv files to the master dataframe.
I'm very new to Python, and so far have only been able to load all the files in the directory and append them all, using the below code:
import pandas as pd
import os
import glob
path = os.environ['HOME'] + "/file_location/"
allFiles = glob.glob(os.path.join(path,"name_of_files*.csv"))
df = pd.concat((pd.read_csv(f) for f in allFiles), sort=False)
With the above code, it looks into the file_location and imports any files with the name "name_of_files" & uses a wildcard as the tail of each of the files will be different.
I could continue to do this, but i'm literally going to have hundreds of files and don't want to import them all and append/concat them each and every hour. To avoid this i'd like to have that master data frame mentioned above and just have new csv files that populate each hour to be automatically appended to that existing master df.
Again super new to Python, so not even sure what to do next. Any advice would be greatly appreciated!

Categories