I was trying to convert xlsb file to xlsx using Python but I am not able to figure out my problem in my all unsuccessful attempts.
Code:
import pandas as pd
import os
import glob
source='C:\\Users\\JS Developer\\sample.xlsb'
dest= 'C:\\Users\\JS Developer\\Desktop\\New folder'
os.chdir(source)
for file in glob.glob("*.xlb"):
df.to_csv(dest+file+'.csv', index=False)
os.remove(file)
for file in glob.glob("*.xlsb"):
df = pd.read_excel(file)
df.to_csv(dest+file+'.csv', index=False)
os.remove(file)
Once you read the excel and stored it in pandas dataframe save it as
df.to_excel(r'Path\name.xlsx')
Try:
for file in glob.glob("*.xlsb"):
df = pd.read_excel(file)
df.to_excel(dest+file+'.xlsx', index = None, header=True)
os.remove(file)
Related
i have an encrypted excel file that i need to work with i know how to read data from that using this method
import io
import pandas as pd
import msoffcrypto
password= 'something'
decrypted_file = io.BytesIO()
with open(path_to_excel, "rb") as file:
excel_file = msoffcrypto.OfficeFile(file)
excel_file.load_key(password)
excel_file.decrypt(decrypted_file)
return decrypted_file
how to read data: From password-protected Excel file to pandas DataFrame
now my question is how to write back to such files?
I am writing some automated scripts to process Excel files in Python, some are in XLS format. Here's a code snippet of my attempting to do so with Pandas:
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents is the file contents pulled from an AWS S3 bucket. When this line runs I get [ERROR] ValueError: File is not a recognized excel file.
In troubleshooting this, I have tried to access the spreadsheet using xlrd directly:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
This works without errors so xlrd seems to recognize it as an Excel file, just not when asked to do so by Pandas.
Anyone know why Pandas won't read the file with xlrd as the engine? Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?
Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel can take a book...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.
The code loops through a local folder and shows the file and extension but then uses python-magic to drill into it. It also has a column showing guessing from mimetypes but that isn't as good. Do zoom into the image of the frame and see that some .xls are not what the extension says. Also a .txt is actually an Excel file.
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
From there you could filter files based on their type:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
Output:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4
I've just started out with Pandas and I have gotten my xls file to convert into an xlsx file using Pandas however I now want the file to save to a different loaction such as OneDrive I was wondering if you could help me out?
Here is the code I have written for it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
#Deleting original file
path = (r"C:\Users\MQ\Downloads\Incident Report.xls")
os.remove(path)
print("Original file has been deleted :)")
#Identifying the xls file
excel_file_1 = 'Incident Report.xls'
df_first_shift = pd.read_excel(r'C:\Users\MQ\3D Objects\New Folder\Incident Report.xls')
print(df_first_shift)
#combining data
df_all = pd.concat([df_first_shift])
print(df_all)
#Creating the .xlsx file
df_all.to_excel("Incident_Report.xlsx")
Use pd.ExcelWriter by passing in your destination path!
destination_path = "path\\to\\your\\onedrive\\filename.xlsx"
writer = pd.ExcelWriter(destination_path , engine='xlsxwriter')
df.to_excel(writer, sheet_name='sheetname')
writer.save()
To write to cloud OneDrive the following code is suggested. I did not run it but offer it as a suggestion.
REFER to www.lieben.nu's example for uploading file to onedrive`
import requests
import io
import pandas as pd
def cloudOneDrive(filename, bytesIO):
'''
Reference : https://www.lieben.nu/liebensraum/2019/04/uploading-a-file-to-onedrive-for-business-with-python/
Write to cloud (bytesIO)
'''
data = {'grant_type':"client_credentials",
'resource':"https://graph.microsoft.com",
'client_id':'XXXXX',
'client_secret':'XXXXX'}
URL = "https://login.windows.net/YOURTENANTDOMAINNAME/oauth2/token?api-version=1.0"
# FIXME: put coder top open OneDrive file here as bytes stream
r = requests.put(URL+"/"+filename+":/content", data=bytesIO, headers=headers)
if r.status_code == 200 or r.status_code == 201:
print("succeeded")
return True
else:
print("Fail", r.status_code)
fn = 'junk.xlsx'
with io.BytesIO() as bio:
with pd.ExcelWriter(bio, mode='wb') as xio:
df.to_excel(bio, sheet_name='sh1')
bio.seek(0)
cloudOneDrive(fn, bio)
I want to take a PDF File as an input. And as an output file I want a csv file to show. So all the textual data which is there in the pdf file should be converted to a csv file. But I am not understanding how would this happen..I need your help at the earliest as I've tried to do but couldn't do it.
what ive done is used a library called Tabula-py which converts pdf to csv file. It does create a csv format but there are no contents being copied to the csv file from the pdf file.
heres the code
from tabula import convert_into,read_pdf
import tabula
df = tabula.read_pdf("crimestory.pdf", spreadsheet=True,
pages='all',output_format="csv")
df.to_csv('crimestoryy.csv', index=False)
the output should come as a csv file where the data is present.
what i am getting is a blank csv file.
I have find answer to this question by my own
To tackle this issue I came up with converting the pdf file into a text file. Then I converted this text file to a csv file.here's my code.
conversion.py
import os.path
import csv
import pdftotext
#Load your PDF
with open("crimestory.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Save all text to a txt file.
with open('crimestory.txt', 'w') as f:
f.write("\n\n".join(pdf))
save_path = "/home/mayureshk/PycharmProjects/NLP/"
completeName_in = os.path.join(save_path, 'crimestory' + '.txt')
completeName_out = os.path.join(save_path, 'crimestoryycsv' + '.csv')
file1 = open(completeName_in)
In_text = csv.reader(file1, delimiter=',')
file2 = open(completeName_out, 'w')
out_csv = csv.writer(file2)
file3 = out_csv.writerows(In_text)
file1.close()
file2.close()
Try this, hope it will works
import tabula
# convert PDF into CSV
tabula.convert_into("crimestory.pdf", "crimestory.csv", output_format="csv", pages='all')
or
df = tabula.read_pdf("crimestory.pdf", encoding='utf-8', spreadsheet=True, pages='all')
df.to_csv('crimestory.csv', encoding='utf-8')
or
from tabula import read_pdf
df = read_pdf("crimestory.pdf")
df
#make sure df displays your pdf contents in the output
from tabula import convert_into
convert_into("crimestory.pdf", "crimestory.csv", output_format="csv")
!cat.crimestory.csv
Hi I am new to the open source tech - I am using Anaconda3-5.1.0-Windows-x86_64 & Microsoft Excel 2016.The Excel reading operation using pandas throws error as File not found error for the below code.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
path= "D:\sample.xlsx"
print(path)
df = pd.read_excel(path, sheet_name = 'Sheet1')
print('Column headings:')
print(df.columns)
The error message is FileNotFoundError: [Errno 2] No such file or directory: 'D:\sample.xlsx'-
I was trying to read 'D:\sample.xlsx' but the function tries to open file as 'D:\sample.xlsx'.
Can anyone please advise on this issue or shall let me know any more details required.
Change
path= "D:\sample.xlsx"
to
path= r"D:\sample.xlsx" or path= "D:\\sample.xlsx"