I'm currently downloading a CSV from a database (using PgAdmin) and using a Python script to re-format and filter the rows to import somewhere else. However, I'm experiencing a very strange bug.
If I try running the script using the CSV that I downloaded from the database, it transforms all dates in one of the columns into blanks (NaN). However, if I open that same document in Excel beforehand, and 'Save As' into another CSV file, my script transforms all the dates correctly into the format desired (dd/mm/yyyy).
Here's a minimal reproduction case:
import pandas as pd
file_path = r'C:\Users\MiguelTavares\Desktop\from_database.csv'
data = pd.read_csv(file_path)
data['start_date'] = pd.to_datetime(data['start_date'], errors='coerce', format='%d/%m/%Y')
print(data)
The CSV looks something like this:
column1 column2 start_date
test1 test2 26/06/2019
test11 test22 25/07/2019
I believe this all happens because I'm passing errors='coerce'. However I need to pass this because if I don't I get a ValueError, and I need to put this information into datetime so I can do calculations with it later on.
ValueError: time data '2019-06-26' does not match format '%d/%m/%Y' (match)
The format (.csv) and encoding (UTF-8) of the CSV files is the same in the file from the database, and the file which I 'Saved As', as well as the content within. So why is my script working perfectly with the duplicate I 'Saved As', but not the one from the database?
Thanks in advance!
Just this should work, if this doesn't work then there is some value in the start_date column with a different format:
df = pd.read_csv('test.csv', sep='\s+')
df['start_date'] = pd.to_datetime(df['start_date'])
print(df)
column1 column2 start_date
0 test1 test2 2019-06-26
1 test11 test22 2019-07-25
import pandas as pd
file_path = r'C:\Users\MiguelTavares\Desktop\from_database.csv'
# parse dates while reading csv. dayfirst=True parsing it from format DD/MM
data = pd.read_csv(file_path, parse_dates=['start_date'], dayfirst=True)
print(data)
This should work.
Related
I'm using python and pandas to query a table in SQL, store it in a DataFrame, then write it to an excel file (.xlsx).
I'm then using a couple of VBA macros to loop through the columns and do some conditional formatting to highlight outliers.
Everything works fine except the date column which excel gets stuck on and presents an error:
"Method 'Average' of object 'WorksheetFunction' failed"
The date is being stored as a string in the format '20-01-2022' which is presumably causing the error so I need to convert it to an actual datetime format that excel will recognise upon opening the file.
Example:
import pandas as pd
df = pd.DataFrame([[1, '21-06-2022'], [2, '19-08-2022'], [3, '06-04-2022']], columns=['id', 'date'])
df.to_excel("output.xlsx")
If you then open "output.xlsx" and try to use conditional formatting on the 'date' column, or try to =AVERAGE(C2:C4) either nothing happens or you get an error. If you double click into the cell, something happens and excel will suddenly recognise it, but this solution isn't suitable with thousands of cells.
How can I convert dates to a format that excel will recognise immediately upon opening the file?
Before saving your df to excel, you need to parse those ISO8601 string to dates.
There are several ways to do that.
You can use the pandas.read_sql keyword argument parse_dates to parse specific columns as dates, even specifying the format, which can parse as dates directly.
import pandas as pd
df = pd.read_sql(
sql,
con,
parse_dates={
"<col1>": {"format": "%y-%m-%d"},
"<col2>": {"format": "%d/%m/%y"}
},
)
Same as above, but without a format, parses columns as datetimes and then the dates can be extracted.
import pandas as pd
df = pd.read_sql(sql, con, parse_dates=["<col1>", "<col2>"])
df[["<col1>", "<col2>"]] = df[["<col1>", "<col2>"]].dt.date
You can load then parse manually with pd.to_datetime, and again extract the dates only.
import pandas as pd
df = pd.read_sql(sql, con)
df[["<col1>", "<col2>"]] = pd.to_datetime(df[["<col1>", "<col2>"]]).dt.date
Or you could also just parse with datetime.date.fromisoformat.
import pandas as pd
from datetime import date
df = pd.read_sql(sql, con)
df[["<col1>", "<col2>"]] = df[["<col1>", "<col2>"]].applymap(date.fromisoformat)
NB. no specific ordering was used, but it seems the first method is slightly faster than the others, while also being the most elegant (in my opinion).
So, I am actually handling text responses from surveys, and it is common to have responses that starts with -, an example is: -I am sad today.
Excel would interpret it as #NAMES?
So when I import the excel file into pandas using read_excel, it would show NAN.
Now is there any method to force excel to retain as raw strings instead interpret it at formula level?
I created a vba and assigning the entire column with text to click through all the cells in the column, which is slow if there is ten thousand++ data.
I was hoping it can do it at python level instead, any idea?
I hope, it works for your solution, use openpyxl to extract excel data and then convert it into a pandas dataframe
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = './formula_contains_raw.xlsx', ).active
print(wb.values)
# sheet_names = wb.get_sheet_names()[0]
# sheet_ranges = wb[name]
df = pd.DataFrame(list(wb.values)[1:], columns=list(wb.values)[0])
df.head()
It works for me using a CSV instead of excel file.
In the CSV file (opened in excel) I need to select the option Formulas/Show Formulas, then save the file.
pd.read_csv('draft.csv')
Output:
Col1
0 hello
1 =-hello
I am reading ~50 files and adding them to the same table consecutively, there is one for each month over the past few years. After the first year, the date format presented in the CSV files shifted from the format YYYY-mm-dd to mm/dd/YYYY.
SQL Server is fine with the date format YYYY-mm-dd and is what it expects, but once the format switched in the CSV my program will crash
I wrote a piece of code to try and convert the data to the correct format, but it didn't work, as shown here:
if '/' in df['SubmissionDate'].iloc[0]:
df['SubmissionDate'] = pd.to_datetime(df['SubmissionDate'], format = '%m/%d/%Y')
I believe that this would have worked, barring the issue that some of the rows of data have no date, so I need to either find some other way to allow the SQL Insert statement to accept this different date format, or avoid trying to convert the blank items in the Submission Date column.
Any help would be greatly appreciated!
It sounds like you are not using parse_dates= when loading the CSV file into the DataFrame. The date parser seems to be able to handle multiple date formats, even within the same file:
import io
import pandas as pd
csv = io.StringIO(
"""\
id,a_date
1,2001-01-01
2,1/2/2001
3,12/31/2001
4,31/12/2001
"""
)
df = pd.read_csv(csv, parse_dates=["a_date"])
print(df)
"""
id a_date
0 1 2001-01-01
1 2 2001-01-02
2 3 2001-12-31
3 4 2001-12-31
"""
I have an excel file that contains the names of 60 datasets.
I'm trying to write a piece of code that "enters" the Excel file, accesses a specific dataset (whose name is in the Excel file), gathers and analyses some data and finally, creates a new column in the Excel file and inserts the information gathered beforehand.
I can do most of it, except for the part of adding a new column and entering the data.
I was trying to do something like this:
path_data = **the path to the excel file**
recap = pd.read_excel(os.path.join(path_data,'My_Excel.xlsx')) # where I access the Excel file
recap['New information Column'] = Some Value
Is this a correct way of doing this? And if so, can someone suggest a better way (that works ehehe)
Thank you a lot!
You can import the excel file into python using pandas.
import pandas as pd
df = pd.read_excel (r'Path\Filename.xlsx')
print (df)
If you have many sheets, then you could do this:
import pandas as pd
df = pd.read_excel (r'Path\Filename.xlsx', sheet_name='sheetname')
print (df)
To add a new column you could do the following:
df['name of the new column'] = 'things to add'
Then when you're ready, you can export it as xlsx:
import openpyxl
# to excel
df.to_excel(r'Path\filename.xlsx')
I am a student and i am learning pandas.
I have created excel file named Student_Record.xlsx(using microsoft excel)
I wanted to create new file using pandas
import pandas as pd
df = pd.read_excel(r"C:\Users\sudarshan\Desktop\Student_Record.xlsx")
df.head()
df.to_excel(r"C:\Users\sudarshan\Desktop\Output.xlsx",index=False)
I opened the file in pandas and saving the file back to excel with different name(file name = Output)
I saved the file back to Excel, but when i open the file(Output) on MS.Excel the columns(DOB and YOP)have time stamp attached to dates.
Please let me know how to print only date?(I want Output file and its contents to look exactly like the original file)
Hope to get some help/support.
Thank you
Probably your DOB and Year of passing columns are of datetime format before they are saved to Excel. As a result, they got converted back to the datetime representation when saved to Excel.
If you want to retain its contents to look exactly like the original file in dd-mm-YYYY format, you can try converting these 2 columns to string format before saving to Excel. You can do it by:
df['DOB'] = df['DOB'].dt.strftime('%d-%m-%Y')
df['Year of passing'] = df['Year of passing'].dt.strftime('%d-%m-%Y')