Copy data from CSV and PDF into HDF5 using Python - python

How to transfer files from specific folders to hdf5 file type using python? files type is PDF and CSV.
For example i have this path /root/Desktop/mal/ex1/ that contain many CSV files and PDF files
all of them i wont to make 1 single hdf5 file contain all this CSV and PDF files.

You could modify the below code based on your requirement details:
import numpy as np
import h5py
import pandas as pd
import glob
yourpath = '/root/Desktop/mal/ex1'
all_files = glob.glob(yourpath + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename,index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
hf = h5py.File('data.h5', 'w')
hf.create_dataset('dataset_1', data=frame)
hf.close()

Related

Open all NC files in a directory and save them as Excel Sheet

I have 720 .NC files in one folder. I am trying to open the file and write all the data into an excel sheet. the scripts works perfectly for single file. Here is my code for single file:
import xarray as xr
file_name = 'dcbl.slice.11748.nc'
# Loading NetCDF dataset using xarray
data = xr.open_dataset('/Users/ismot/Downloads/LES_Data/u1.shf400.lhf040.512/' + file_name)
# convert the columns to dataframe using xarray
df = data[['x', 'y', 'time', 'C_sum_column_AVIRIS', 'C_sum_column_HyTES']].to_dataframe()
# write the dataframe to an excel file
df.to_excel(file_name + '.xlsx')
Now, I am trying to run the script for the all files in the directory. I have modified the scripts like this:
# import required module
import os
import xarray as xr
# assign directory
directory = '/Users/ismot/Downloads/LES_Data/u1.shf400.lhf040.512'
# list all files in the directory
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
# checking if it is a file
if os.path.isfile(f):
print(f)
# write a function to open .NC files using xarray and convert them in excel sheet
def file_changer(filename):
data = xr.open_dataset(str(filename))
df = data[['x', 'y', 'time', 'C_sum_column_AVIRIS', 'C_sum_column_HyTES']].to_dataframe()
df.to_excel(filename + '.xlsx')
# Run for multiple files
import glob
for file in glob.glob('*.nc'):
file_changer(file)
The scripts runs and gives no error. But it only prints the name of the files in the directory. It doesn't go over the 720 files and save them in the excel sheet. How can I fix it?

Pandas,(Python) -> Export to xlsx with multiple sheets

i`m traind to read some .xlsx files from a directory that is create earlier using curent timestamp and the files are store there, now i want to read those .xlsx files and put them in only one .xlsx files with multiple sheets, but i tried multiple ways and didnt work, i tried:
final file Usage-SvnAnalysis.xlsx
the script i tried:
import pandas as pd
import numpy as np
from timestampdirectory import createdir
import os
dest = createdir()
dfSvnUsers = pd.read_csv(dest, "SvnUsers.xlsx")
dfSvnGroupMembership = pd.read_csv(dest, "SvnGroupMembership.xlsx")
xlwriter = pd.ExcelWriter("Usage-SvnAnalysis.xlsx")
dfSvnUsers.to_excel(xlwriter, sheet_name='SvnUsers', index = False )
dfSvnGroupMembership.to_excel(xlwriter, sheet_name='SvnGroupMembership', index = False )
xlwriter.close()
the folder that is created automaticaly with curent timestamp that contains files.
this is one of file that file that i want to add as sheet in that final xlsx
this is how i create the director with curent time and return dest to export the files in
I change a bit the script, now its how it looks like, still getting error :
File "D:\Py_location_projects\testfi\Usage-SvnAnalysis.py", line 8, in
with open(file, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: 'SvnGroupMembership.xlsx'
the files exist, but the script cant take the root path to that directory because i create that directory on other script using timestamp and i returned the path using dest
dest=createdir() represent the path where the files is, what i need to do its just acces this dest an read the files from there and export them in only 1 xlsx as sheets of him , in this cas sheet1 and sheet2, because i tried to reat only 2 files from that dir
import pandas as pd
import numpy as np
from timestampdirectory import createdir
import os
dest = createdir()
files = os.listdir(dest)
for file in files:
with open(file, 'r') as f:
dfSvnUsers = open(os.path.join(dest, 'SvnUsers.xlsx'))
dfSvnGroupMembership = open(os.path.join(dest, 'SvnGroupMembership.xlsx'))
xlwriter = pd.ExcelWriter("Usage-SvnAnalysis.xlsx")
dfSvnUsers.to_excel(xlwriter, sheet_name='SvnUsers', index = False )
dfSvnGroupMembership.to_excel(xlwriter, sheet_name='SvnGroupMembership', index = False )
xlwriter.close()
I think you should try read Excel files use pd.read_excel instead of pd.read_csv.
import os
dfSvnUsers = pd.read_excel(os.path.join(dest, "SvnUsers.xlsx"))
dfSvnGroupMembership = pd.read_excel(os.path.join(dest, "SvnGroupMembership.xlsx"))

Reading bulk Excel files from a file with Python (Pandas)

I have 40 .xls files in a folder I would like to import into a df in Pandas.
Is there a function similar to read_csv() that will allow me to direct Python to the folder and open each of these files into the dataframe? All headers are the same in each file
Try pandas.read_excel to open each file. You can loop over the files using the glob module.
import glob
import pandas as pd
dfs = {}
for f in glob.glob('*.xlsx'):
dfs[f] = pd.read_excel(f)
df = pd.concat(dfs) # change concatenation axis if needed
you can load excel files and concat each other.
import os
import pandas as pd
files = os.listdir(<path to folder>)
df_all = pd.DataFrame()
for file in files:
df = pd.read_excel(f"<path to folder>/{file}")
df_all = pd.concat([df_all,df])
import os import pandas as pd
folder = r'C:\Users\AA\Desktop\Excel_file' files = os.listdir(folder)
for file in files: if file.endswith('.xlsx'): df = pd.read_excel(os.path.join(folder,file))
Does this help?

Unable to read excel file from S3 bucket

i have this small piece of code which is working in local system
import pandas as pd
import glob
import openpyxl
# path of folder
path=r'C:\Users\Preet\Desktop\python_files'
#Display list of files
filenames=glob.glob(path+"\*.xlsx")
print(filenames)
#initializing data frame
finalexcelsheet=pd.DataFrame()
#to iteriate excel
for file in filenames:
df = pd.concat(pd.read_excel(file,sheet_name=None), ignore_index=True,sort=False)
#print(df)
finalexcelsheet=finalexcelsheet.append(df,ignore_index=True)
print(finalexcelsheet)
finalexcelsheet.to_excel('C:\\Users\\preet\\Desktop\\python_files\\final.xlsx',index=False).
However when i try to read the same xlsx files from s3 bucket it just creates a empty data frame and stops and says job succeeded.below is the code for s3..Please let me know if anything i m missing in code below
import boto3
import pandas as pd
import glob
import openpyxl
# path of folder
bucketname = "sit-bucket-lake-raw-static-5464"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
source = "sit-bucket-lake-raw-static-5464/Staging/"
target = "sit-bucket-lake-raw-static-5464/branch/2020/12/"
#Display list of files
filenames=glob.glob(source+"\*.xlsx")
print(filenames)
#initializing data frame
finalexcelsheet=pd.DataFrame()
#to iteriate excel
for file in filenames:
df = pd.concat(pd.read_excel(file,sheet_name=None), ignore_index=True,sort=False)
finalexcelsheet=finalexcelsheet.append(df,ignore_index=True)
print(finalexcelsheet)
finalexcelsheet.to_excel('target\final.xlsx',index=False)

How to read all csv files from web page in a pandas data frame

I would like to load all the csv files from the following webpage to a data frame
https://s3.amazonaws.com/tripdata/index.html
I tried with glob as for loading all files from a directory without success:
import glob
path ='https://s3.amazonaws.com/tripdata' # use your path
allFiles = glob.glob(path + "/*citibike-tripdata.csv.zip")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_, index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
Any suggestions?
The module glob is used for finding pathnames matching patterns on the same system as Python is running in, and there is no way for it to index arbitrary file hosting web servers (which isn't even possible a priori). In your case, since https://s3.amazonaws.com/tripdata/ provides the desired index, you could parse that to get the relevant files:
import re
import requests
url = 'https://s3.amazonaws.com/tripdata/'
t = requests.get(url).text
filenames = re.findall('[^>]+citibike-tripdata\.csv\.zip', t)
frame = pd.concat(pd.read_csv(url + f) for f in filenames)

Categories