Save file with a specific name using tk filedialog - python

Is there a way to save a DataFrame into an excel file with filedialog but, using a specific name as 'my_file' for example?
I usually use this code
path_to_save = filedialog.asksaveasfilename(defaultextension='.xlsx')
df.to_excel(path_to_save, index=False)
and this opens a window where I can choose the location and name of my file just that, now I want to have the name 'my_file' by default so that typing it will not be necessary.
Is there a way of doing it?Many thanks in advance
The excel file saved is empty:
a_row['column1'] = df['column1']
new_df = a_row
new_df2 = pd.DataFrame({'column2': [], '': []})
new_df3 = pd.concat([new_df, new_df2])
new_df3['column2'] = 'some value'
new_df3 = new_df3.set_index(['column1', 'column2'])
path_to_save1 = filedialog.asksaveasfilename(defaultextension='.xlsx', initialfile = 'my_file')
new_df3.to_excel(path_to_save1, index=False)
Is there maybe away to insert a row on the top of columns name like in this image?I couldn't find anything in pandas doc about this

Main Question
Try using the parameter initialfile for the asksaveasfilename function, e.g.:
path_to_save = filedialog.asksaveasfilename(defaultextension='.xlsx', initialfile = 'my_file')
Comment Question
Regarding the DataFrame emptiness, is because you are not assigning anything to it. To add values, you can use loc:
df = pd.DataFrame({'column1':[],'column2':[]})
df.loc[0] = ['value1','value2']
You can also do that using concat, but make sure your dataframes have the same number of columns for that.
And to add a row on top, there was this interesting solution by #edyvedy13 found here:
df.loc[-1] = ['value1.0','value2.0']
df.insert(0,1,{'value2'})
df.index = df.index + 1
df.sort_index(inplace=True)

Related

Prevent pandas from changing int to float/date?

I'm trying to merge a series of xlsx files into one, which works fine.
However, when I read a file, columns containing ints are transformed into floats (or dates?) when I merge and output them to csv. I have tried to visualize this in the picture. I have seen some solutions to this where dtype is used to "force" specific columns into int format. However, I do not always know the index nor the title of the column, so i need a more scalable solution.
Anyone with some thoughts on this?
Thank you in advance
#specify folder with xlsx-files
xlsFolder = "{}/system".format(directory)
dfMaster = pd.DataFrame()
#make a list of all xlsx-files in folder
xlsFolderContent = os.listdir(xlsFolder)
xlsFolderList = []
for file in xlsFolderContent:
if file[-5:] == ".xlsx":
xlsFolderList.append(file)
for xlsx in xlsFolderList:
print(xlsx)
xl = pd.ExcelFile("{}/{}".format(xlsFolder, xlsx))
for sheet in xl.sheet_names:
if "_Errors" in sheet:
print(sheet)
dfSheet = xl.parse(sheet)
dfSheet.fillna(0, inplace=True)
dfMaster = dfMaster.append(dfSheet)
print("len of dfMaster:", len(dfMaster))
dfMaster.to_csv("{}/dfMaster.csv".format(xlsFolder),sep=";")
Data sample:
Try to use dtype='object' as parameter of pd.read_csv or (ExcelFile.parse) to prevent Pandas to infer the data type of each column. You can also simplify your code using pathlib:
import pandas as pd
import pathlib
directory = pathlib.Path('your_path_directory')
xlsFolder = directory / 'system'
data = []
for xlsFile in xlsFolder.glob('*.xlsx'):
sheets = pd.read_excel(xlsFile, sheet_name=None, dtype='object')
for sheetname, df in sheets.items():
if '_Errors' in sheetname:
data.append(df.fillna('0'))
pd.concat(data).to_csv(xlsxFolder / dfMaster.csv, sep=';')

Defining function to open an Excel file (openpyxl) and save as a DataFrame

I have a typical method that I use to pull data from an Excel file into a DataFrame:
import pandas as pd
import openpyxl as op
path = r'thisisafilepath\filename.xlsx'
book = op.load_workbook(filename=path, data_only=True)
tab = book['sheetname']
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
I'm trying to define this method as a function to make the code simpler/more readable.
I have tried the following:
def openthis(path, sheet):
book = op.load_workbook(filename=path, data_only=True)
tab = book[sheet]
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
return df
When I then call openthis() the output is a printed version of the DataFrame in my console, but no variable has actually been created for me to work with.
What am I missing? Also, is there a way to define what the DataFrame variable is called when it is produced?
You didn't show your actual implementation of calling it but I'm guessing that you didn't assign the output to a variable.
Notice in your function return df.
This statement means when you call openthis() it outputs a variable. Unless you assign that output to a local variable, its gone forever.
Try this
df = openthis(some_arguments)

How to name dataframes dynamically in Python?

I have an excel file which contains more than 30 sheets. However the operation that I do on each sheet remains the same more or less. But my objective is to create a separate dataframe for each sheet, so that I can refer in the future
This is what I tried but it throws an error
xls = pd.ExcelFile('DC_Measurement.xlsx')
sheets = xls.sheet_names
for s in sheets:
print(s)
'df '+ s = pd.read_excel(xls, sheet_name=s)
So, it's like I want 30 dataframes to be created and each dataframe will have the sheet name as the suffix name. I tried using the "+" operator but it didn't help either. It threw an error message as shown below
SyntaxError: can't assign to operator
How can I create dataframes on the fly and name them ?
You could use something like this:
for s in sheets:
vars()['df'+ s] = pd.read_excel(xls, sheet_name=s)
Strictly speaking not an answer to your question but this will create a dictionary where the key is the sheet name and the value is the dataframe.
workbook = pd.read_excel('DC_Measurement.xlsx', sheet_name = None)
Then you can retrieve the dataframe you need like this.
df = workbook['sheet_name']
I think this is tidier than other solutions.
Or use locals:
for s in sheets:
locals()['df'+ s] = pd.read_excel(xls, sheet_name=s)
In a function change locals to globals.
The best approach is usually to store the dataframes in a list or dictionary, where you can work with them systematically, like this:
xls = pd.ExcelFile('DC_Measurement.xlsx')
sheets = {}
for s in xls.sheet_names:
print(s)
sheets[s] = pd.read_excel(xls, sheet_name=s)
Or just this:
xls = pd.ExcelFile('DC_Measurement.xlsx')
sheets = {
s: pd.read_excel(xls, sheet_name=s)
for s in xls.sheet_names
}
This will make it easy to work with the sheets programmatically later (just access sheets[s], where s is a sheet name). Otherwise you will next face the tricky problem of how to access all the dataframes that you've just created as free-floating variables.

Allow duplicate columns in Pandas

I'm splitting a large CSV (containing stock financial data) file into smaller chunks. The format of the CSV file is different. Something like an Excel pivot table. The first few rows of the first column contain some headers.
Company name, id, etc. are repeated across the following columns. Because one single company has more than one attribute, not like one company has one column only.
After the first few rows, the columns then start resembling a typical data frame where headers are in columns instead of rows.
Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc after the headers. I know Pandas does not allow this natively, is there a workaround? I tried to set header = None on read_csv but it throws a tokenization error which I think makes sense. I just can't think of an easy way.
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
EDIT:
From, https://github.com/pandas-dev/pandas/issues/19383, I add:
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
So, full code:
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
Now, the entire first row is gone. But, the expected output is for the header row to be replaced with the reset index, without the ".1", ".2", etc.
Screenshot:
The SimFin ID row is no longer there.
This is how I did it:
final_df.columns = final_df.columns.str.split('.').str[0]
Reference:
https://pandas.pydata.org/pandas-docs/stable/text.html
Below solution would ensure that other column names with symbol period ('.') in the dataframe do not get modified
import pandas as pd
from csv import DictReader
csv_file_loc = "file.csv"
# Read csv
df = pd.read_csv(csv_file_loc)
# Get column names from csv file using DictReader
col_names = DictReader(open(csv_file_loc, 'r')).fieldnames
# Rename columns
df.columns = col_names
I know I'm pretty late to the draw on this one, but I'm leaving the solution I came up with in case anyone else wanders across this as I have.
Firstly, the linked question has a pretty nice and dynamic solution that seems to work well even for high column counts. I came across that after I made my solution, haha. Check it out here. Another answer on this thread utilizes the csv library to read and use the column names from that, as it doesn't seem to modify duplicates like Pandas does. That should work fine, but I just wanted to avoid using any extra libraries, especially considering I was originally using csv and then upgrade to Pandas for better functionality.
Now here's my solution. I'm sure it could be done more nicely but this does the job for what I needed and is pretty dynamic, from what I can tell. It basically goes through the columns, checks if it can split the string based on the rightmost "." (that's the rpartition), then does a few more checks from there.
It checks:
Is this string in the colMap? The colMap keeps track of all of the column names, duplicate or not. If this comes back true, then that means it's a duplicate of another column that came before it.
Is the string after the rightmost "." a number? All of the columns are strings, so this just makes sure that whatever it is can be converted into a number to prevent grabbing some other random column that meets previous criteria but isn't actually a dupe from Pandas. eg. "DupeCol" and "DupeCol.Stuff" wouldn't get picked up, but "DupeCol" and "DupeCol.1" would.
Does the number that comes after the rightmost "." match up to the current count of duplicates in the colMap? Seeing as the colMap contains all of the names of the columns, duplicates or not, this will ensure that we're not grabbing a user-named column that managed to overlap with the ".number" convention that Pandas uses. Eg. if a user had named two columns "DupeCol" and "DupeCol.6", it wouldn't get picked up unless there were 6 "DupeCol"s preceding "DupeCol.6", indicating that it almost had to be Pandas that named it that way, as opposed to the user. This part is definitely a bit overkill, but I felt like being extra thorough.
colMap = []
for col in df.columns:
if col.rpartition('.')[0]:
colName = col.rpartition('.')[0]
inMap = col.rpartition('.')[0] in colMap
lastIsNum = col.rpartition('.')[-1].isdigit()
dupeCount = colMap.count(colName)
if inMap and lastIsNum and (int(col.rpartition('.')[-1]) == dupeCount):
colMap.append(colName)
continue
colMap.append(col)
df.columns = colMap
Hopefully this helps someone! Feel free to comment if you think it could use any improvements. I don't entirely love using "continue" in my code, but I'm not sure if that's because it's actually bad practice or just me reading random people complain about it too much. I think it doesn't make the code too unreadable here and prevents the need for duplicating the "else" statement; but let me know if there's a way to improve that or anything otherwise. I'm always looking to learn!
If you know types of all data you may consider loading the csv without header first.
df = pd.read_csv(csv_file, header=None)
df.columns = df.iloc[0] # replace column with first row
df = df.drop(0) # remove the first row
(Note that drop is to remove the row, given that your index is unique, and may not be true if you use index_col argument of pd.read_csv)
caveats: The above solution causes you to lose dtypes infomations.
There is some solution to fix the above problem.
# turn each column into numeric
df = df.apply(lambda col: pd.to_numeric(col, errors='ignore'), axis=0)
Otherwise, you may consider reading the csv twice to get the dtype information and apply the correct convertion.

Writing excel files in Python

I posted part of this question couple of days ago with a good answer but that solved just part of my problem.
So, I have a excel file on which needs to be done some data mining and after that needs to get out another excel file with the same format .xlsx
The problem is that I get a strange column after I write the file, which cannot be seen before the writing using Anaconda. And that makes it harder to develop a strategy to counter it's appearance. initially I though I solved the problem by reducing the width to 0 but apparently at some point the file needs to be converted in text and then the columns reappears.
For more details here is part of my code:
import os
import pandas as pd
import numpy as np
import xlsxwriter
# Retrieve current working directory (`cwd`)
cwd = os.getcwd()
cwd
# Change directory
os.chdir("/Users/s7c/Documents/partsstop")
# Assign spreadsheet filename to `file`
file = 'file = 'SC daily inventory retrieval columns for reports'.xlsx
# Load spreadsheet
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df
df = xl.parse('Sheet1')
#second file code:
#select just the columns we need and rename them:
df2 = df.iloc[:, [1, 3, 6, 9]]
df2.columns = ['Manufacturer Code', 'Part Number', 'Qty Available', 'List Price']
#then select just the rows we need:
df21 = df2[df2['Manufacturer Code'].str.contains("DRP")]#13837 entries
#select just the DRP, first 3 characters and dropping the ones after:
df21['Manufacturer Code'] = df21['Manufacturer Code'].str[:3]
#add a new column:
#in order to do that we need to convert the next column to numeric:
df21['List Price'] = pd.to_numeric(df21['List Price'], errors='coerce')
df21['Dealer Price'] = df21['List Price'].apply(lambda x: x*0.48) #new column equals half of other column
writer = pd.ExcelWriter('example2.xlsx', engine='xlsxwriter')
# Write your DataFrames to a file
df21.to_excel(writer, 'Sheet1')
The actual view of the problem:
Any constructive idea is appreciated. Thanks!
This column seems to be the index of your DataFrame. You can exclude it by passing index=False to to_excel().

Categories