Python Save SQL query to excel file - python

I'm using psycopg2 to query a database and return a table of data. I need to then write that data to a .xlsx file.
I had it writing to a .csv really nicely using:
with open("file_name.csv", "w") as file:
csv_writer = writer(file)
csv_writer.writerow(headers)
csv_writer.writerows(data)
This works fine, the only issue is that I now need to open the .csv and save as a new .xlsx so its a step I want to cut out.
I'm trying to use pandas:
df = pandas.DataFrame(data, columns=headers)
df.to_excel("file_name.xlsx")
But all numbers are being stored as text so I now need to go back in and refresh the cells for excel to realise its an integer or float?
Also tried with openpyxl, this works better but still stored the date column as text so I still need to go in and refresh the cells for excel to recognise it as a date.
I thought it might have been an issue with how psycopg2 pulls the data but its not an issue for .csv so why is it a problem for .xlsx? This is probably just my lack of understanding the difference between the two files types. Does anyone have a solution for saving as a .xlsx but retaining all the correct formatting?

When creating the DataFrame you can specify "converters" for excel:
converters = {
'name': str,
'ages': int,
'score': float
}
df = pandas.DataFrame(data, columns=headers, converters=converters)
df.to_excel("file_name.xlsx")
Where the keys of the converters dict are the column names in the DataFrame.

Related

Convert XLSX to CVS in Python: Keep Values, not Formulas

TLDR: How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves?
My code combines two .xlsx sheets together to generate emails for new org users.
The first .xlsx contains a formula that concatenates the user's name and our domain, while the other .xlsx contains the queried list of new users. When combined, the newly generated .xlsx, titled 'users.xlsx' includes the desired information - but the emails generated are done so using the formula, still - not values. If asked to read data_only via pandas, it doesn't seem to work at all and no emails are generated on this newly created 'users' xlsx sheet.
This is all fine and works well, but the final step is converting the .xlsx over to .csv
Because the emails are technically generated through the concatenating formula, the conversion doesn't preserve the user's emails.
How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves? Is this possible? Can I force the third .xlsx to preserve values only and then do the conversion?
Things I've tried (While they all successfully convert into a .cvs, the data within formulas is lost):
Lenged:
combined_xlsx_2
# The .xlsx product after combining two xlsx (user info + emails)
# This product is 'users.xlsx' - I need it converted to a .csv
Code 1:
# Read and store content
# of an excel file
read_file = pd.read_excel (combined_xlsx_2)
# Write the dataframe object
# into csv file
filedir = combined_xlsx_2.replace("users_2.xlsx","users.csv")
read_file.to_csv (filedir,
index = None,
header=True,
encoding='utf-8')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv(filedir))
df
Code 2:
filename = (combined_xlsx_2)
filedir = (filename.replace("/users.xlsx",""))
path_to_excel_files = glob.glob(filedir)
for excel in path_to_excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel)
df.to_csv(out)
Code 3:
wb = xlrd.open_workbook(combined_xlsx_2)
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open(combined_xlsx_2.replace('.xlsx', '.csv'), 'w')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
Thank you for your time and assistance!
UPDATE 1:
I was able to accomplish this using 'convert-api'
https://www.convertapi.com/xlsx-to-csv#snippet=python
While not what I had in mind, it will at least get me by. Still hoping there's a better solution for this. Just wanted to share this just in case anyone else had a similar question.

Convert XLSX to CSV without losing values from formulas

I've tried a few methods, including pandas:
df = pd.read_excel('file.xlsx')
df.to_csv('file.csv')
But every time I convert my xlsx file over to csv format, I lose all data within columns that include a formula. I have a formula that concatenates values from two other cells + '#domain' to create user emails, but this entire column returns blank in the csv product.
The formula is basically this:
=CONCATENATE(B2,".",E2,"#domain")
The conversion is part of a larger code workflow, but it won't work if this column is left blank. The only thing I've tried that worked was this API, but I'd rather not pay a subscription if this can be done locally on the machine.
Any ideas? I'll try whatever you throw at me - bear in mind I'm new to this, but I will do my best!
You can try to open the excel file with the openpyxl library in the data-only mode. This will prevent the raw formulas - they are going to be calculated just the way you see them in excel itself.
import openpyxl
wb = openpyxl.load_workbook(filename, data_only=True)
Watch out when youre working with you original file and save it with the openpyxl-lib in the data-only-mode all your formulas will be lost. I had this once and it was horrible. So i recommend using a copy of your file to work with.
Since you have your xlsx-file with values only you can now use the internal csv library to generate a proper csv-file (idea from this post: How to save an Excel worksheet as CSV):
import csv
sheet = wb.active # was .get_active_sheet()
with open('test.csv', 'w', newline="") as f:
c = csv.writer(f)
for r in sheet.iter_rows(): # generator; was sh.rows
c.writerow([cell.value for cell in r])

How to write a dict of dataframes to one excel file in Pandas? Key is sheet name, Value is the dataframe

I am having much trouble trying to read in a large excel file (.xlsx), and write some of its tabs/sheets to a smaller excel file.
In one class, I get return a dict of dataframes. The key is the respective sheet/tab that the dataframe came from, type string. The value is the actual dataframe, with all of its original columns, type DataFrame. In this class, I extract certain dataframes from the original excel file.
I am able to print out my key:value pairs after extracting the dataframes of my choice, and it all looks fine. However, I believe that my real problem is writing the actual data to 1 excel file. I only get the first dataframe, without the sheet name that it came from (it becomes the default 'Sheet1'), and nothing else.
Here is the code that writes my dict to an excel file:
def frames_to_excel(self, df_dict, path):
"""Write dictionary of dataframes to separate sheets, within
1 file."""
writer = pd.ExcelWriter(path, engine='xlsxwriter')
for tab_name, dframe in df_dict.items():
dframe.to_excel(writer, sheet_name=tab_name)
writer.save()
- "path" is the select output path to write the whole dict to a xlsx fle.
- "df_dict" is the dict of dataframes.
I am very sorry for the confusion. My bug was not at all in the code I posted, or any of the classes that parse the data from the original excel file. The problem was this line of code:
excel_path = re.sub(r"(?i)original|_original", "_custom", os.path.basename(excel_path))
By doing the basename function from the os library, I was only using the file name, instead of the entire full path:
writer = pd.ExcelWriter(excel_path, engine='xlsxwriter')
-Therefore, I was not writing the correct data to the full path, and I was looking at old data from my programs output, from about 5 days ago. Thanks for everyones help.
The fix (use the proper full path that you expect):
excel_path = re.sub(r"(?i)original|_original", "_custom", excel_path)

How to read data from excel from a particular column in python

I have an excel sheet and I am reading the excel sheet using pandas in python.
Now I want to read the excel file based on a column, if the column has some value then do not read that row, if the column is empty than read that and store the values in a list.
Here is a screenshot
Excel Example
Now in the above image when the uniqueidentifier is yes then it should not read that value, but if it is empty then it should start reading from that value.
How to do that using python and how to get index so that after I have performed some function that I am again able to write to that blank unique identifier column saying that row has been read
This is possible for csv files. There you could do
iter_csv = pandas.read_csv('file.csv', iterator=True, chunksize=100000)
df = pd.concat([chunk[chunk['UniqueIdentifier'] == 'True'] for chunk in iter_csv])
But pd.read_excel does not offer to return an iterator object, maybe some other excel-readers can. But I don't no which ones. Nevertheless you could export your excel file as csv and use the solution for csv files.

why pandas change (large)numbers when it exports data to csv and excel

I have a dataframe with one column number:
df = pd.DataFrame([34032872653290886,57875847776839336],['A','B'],columns=['numbers'])
when I save dataframe to excel and to csv, saved data are shown as scientific number and became 34032872653290900, 57875847776839300.
To convert df I use following codes.
df.to_excel('a1.xlsx')
df.to_csv('a1.csv')
Is it a bug? Or should I change a setting? I check my code from two system(Mac and windows) and my pandas version is '0.20.2'.
Turns out Excel has a limitation on displaying large numbers, nothing wrong with the CSV writer module.
Got the reply in other post Python CSV writer truncates long numbers

Categories