I'm using python and pandas to query a table in SQL, store it in a DataFrame, then write it to an excel file (.xlsx).
I'm then using a couple of VBA macros to loop through the columns and do some conditional formatting to highlight outliers.
Everything works fine except the date column which excel gets stuck on and presents an error:
"Method 'Average' of object 'WorksheetFunction' failed"
The date is being stored as a string in the format '20-01-2022' which is presumably causing the error so I need to convert it to an actual datetime format that excel will recognise upon opening the file.
Example:
import pandas as pd
df = pd.DataFrame([[1, '21-06-2022'], [2, '19-08-2022'], [3, '06-04-2022']], columns=['id', 'date'])
df.to_excel("output.xlsx")
If you then open "output.xlsx" and try to use conditional formatting on the 'date' column, or try to =AVERAGE(C2:C4) either nothing happens or you get an error. If you double click into the cell, something happens and excel will suddenly recognise it, but this solution isn't suitable with thousands of cells.
How can I convert dates to a format that excel will recognise immediately upon opening the file?
Before saving your df to excel, you need to parse those ISO8601 string to dates.
There are several ways to do that.
You can use the pandas.read_sql keyword argument parse_dates to parse specific columns as dates, even specifying the format, which can parse as dates directly.
import pandas as pd
df = pd.read_sql(
sql,
con,
parse_dates={
"<col1>": {"format": "%y-%m-%d"},
"<col2>": {"format": "%d/%m/%y"}
},
)
Same as above, but without a format, parses columns as datetimes and then the dates can be extracted.
import pandas as pd
df = pd.read_sql(sql, con, parse_dates=["<col1>", "<col2>"])
df[["<col1>", "<col2>"]] = df[["<col1>", "<col2>"]].dt.date
You can load then parse manually with pd.to_datetime, and again extract the dates only.
import pandas as pd
df = pd.read_sql(sql, con)
df[["<col1>", "<col2>"]] = pd.to_datetime(df[["<col1>", "<col2>"]]).dt.date
Or you could also just parse with datetime.date.fromisoformat.
import pandas as pd
from datetime import date
df = pd.read_sql(sql, con)
df[["<col1>", "<col2>"]] = df[["<col1>", "<col2>"]].applymap(date.fromisoformat)
NB. no specific ordering was used, but it seems the first method is slightly faster than the others, while also being the most elegant (in my opinion).
Related
I'm currently downloading a CSV from a database (using PgAdmin) and using a Python script to re-format and filter the rows to import somewhere else. However, I'm experiencing a very strange bug.
If I try running the script using the CSV that I downloaded from the database, it transforms all dates in one of the columns into blanks (NaN). However, if I open that same document in Excel beforehand, and 'Save As' into another CSV file, my script transforms all the dates correctly into the format desired (dd/mm/yyyy).
Here's a minimal reproduction case:
import pandas as pd
file_path = r'C:\Users\MiguelTavares\Desktop\from_database.csv'
data = pd.read_csv(file_path)
data['start_date'] = pd.to_datetime(data['start_date'], errors='coerce', format='%d/%m/%Y')
print(data)
The CSV looks something like this:
column1 column2 start_date
test1 test2 26/06/2019
test11 test22 25/07/2019
I believe this all happens because I'm passing errors='coerce'. However I need to pass this because if I don't I get a ValueError, and I need to put this information into datetime so I can do calculations with it later on.
ValueError: time data '2019-06-26' does not match format '%d/%m/%Y' (match)
The format (.csv) and encoding (UTF-8) of the CSV files is the same in the file from the database, and the file which I 'Saved As', as well as the content within. So why is my script working perfectly with the duplicate I 'Saved As', but not the one from the database?
Thanks in advance!
Just this should work, if this doesn't work then there is some value in the start_date column with a different format:
df = pd.read_csv('test.csv', sep='\s+')
df['start_date'] = pd.to_datetime(df['start_date'])
print(df)
column1 column2 start_date
0 test1 test2 2019-06-26
1 test11 test22 2019-07-25
import pandas as pd
file_path = r'C:\Users\MiguelTavares\Desktop\from_database.csv'
# parse dates while reading csv. dayfirst=True parsing it from format DD/MM
data = pd.read_csv(file_path, parse_dates=['start_date'], dayfirst=True)
print(data)
This should work.
I am using openpyxl and pandas to generate an Excel file, and need to have dates formatted as Date in Excel. The dates in exported file are formatted correctly in dd/mm/yyyy format but when I right-click on a cell and go to 'Format Cells' it shows Custom, is there a way to change to Date? Here is my code where I specify date format.
writer = pd.ExcelWriter(dstfile, engine='openpyxl', date_format='dd/mm/yyyy')
I have also tried to set cell.number_format = 'dd/mm/yyyy' but still getting Custom format in Excel.
The answer can be found in the comments of Converting Data to Date Type When Writing with Openpyxl.
ensure you are writing a datetime.datetime object to the cell, then:
.number_format = 'mm/dd/yyyy;#' # notice the ';#'
e.g.,
import datetime
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A1'] = datetime.datetime(2021, 12, 25)
ws['A1'].number_format = 'yyyy-mm-dd;#'
wb.save(r'c:\data\test.xlsx')
n.b. these dates are still a bit 'funny' as they are not auto-magically grouped into months and years in pivot tables (if you like that sort of thing). In the pivot table, you can manually click on them and set the grouping though: https://support.microsoft.com/en-us/office/group-or-ungroup-data-in-a-pivottable-c9d1ddd0-6580-47d1-82bc-c84a5a340725
You might have to convert them to datetime objects in python if they are saved as strings in the data frame. One approach is to iterate over the cells and doing it after using ExcelWriter:
cell = datetime.strptime('30/12/1999', '%d/%m/%Y')
cell.number_format = 'dd/mm/yyyy'
A better approach is to convert that column in the data frame prior to that. You can use to_datetime function in Pandas for that.
See this answer for converting the whole column in the dataframe.
I would like to only import a subset of a csv as a dataframe as it is too large to import the whole thing. Is there a way to do this natively in pandas without having to set up a database like structure?
I have tried only importing a chunk and then concatenating and this is still too large and causes memory error. I have hundreds of columns so manually specifying dtypes could help, but would likely be a major time commitment.
df_chunk = pd.read_csv("filename.csv", chunksize=1e7)
df = pd.concat(df_chunk,ignore_index=True)
You may use the skiprows and nrows arguments in the read_csv function to load only a subset of rows from your original dataframe.
For instance:
import pandas as pd
df = pd.read_csv("test.csv", skiprows = 4, nrows=10)
I'm trying to take a dictionary object in python, write it out to a csv file, and then read it back in from that csv file.
But it's not working. When I try to read it back in, it gives me the following error:
EmptyDataError: No columns to parse from file
I don't understand this for two reasons. Firstly, if I used pandas very own to_csv method, it should
be giving me the correct format for a csv. Secondly, when I print out the header values (by doing this : print(df.columns.values) ) of the dataframe that I'm trying to save, it says I do in fact have headers ("one" and "two".) So if the object I was sending out had column names, I don't know why they wouldn't be found when I'm trying to read it back.
import pandas as pd
testing = {"one":1,"two":2 }
df = pd.DataFrame(testing, index=[0])
file = open('testing.csv','w')
df.to_csv(file)
new_df = pd.read_csv("testing.csv")
What am I doing wrong?
Thanks in advance for the help!
The default pandas.DataFrame.to_csv takes a path and not an text io. Just remove the file declaration and directly use the path, pass index = False to skip indexes.
import pandas as pd
testing = {"one":1,"two":2 }
df = pd.DataFrame(testing, index=[0])
df.to_csv('testing.csv', index = False)
new_df = pd.read_csv("testing.csv")
I am trying to parse this CSV data which has quotes in between in unusual pattern and semicolon in the end of each row.
I am not able to parse this file correctly using pandas.
Here is the link of data (The pastebin was for some reason not recognizing as text / csv so picked up any random formatting please ignore that)
https://paste.gnome.org/pr1pmw4w2
I have tried using the "," as delimiter, and normal call of pandas dataframe object construction by only giving file name as parameter.
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)
First, you can just read the data normally. Now all data would be in the first column. You can use pyparsing module to split based on ',' and assign it back. I hope this solves your query. You just need to do this for all the rows.
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
Output
df #(since there are 42 columns, pasting just a snipped)