In pd.read_excel pandas automatically parses the columns names as date. And parses it wrong. The date is dd/mm/yy and it parses it as mm/dd/yy.
The column names are date.
code used
df = pd.read_excel('check.xlsx')
print(df)
The df printed has dates parsed in wrong format
Here's the excel file https://docs.google.com/spreadsheets/d/1rgl0Je5EyxpBunk7FWPHcpZxXFdUZUni/edit?usp=drivesdk&ouid=109057655084381529864&rtpof=true&sd=true . The column names are in dd/mm/Y format.
Use '%Y-%m-%d' for formatting like you wish.
e.g.
import pandas as pd
df = pd.DataFrame({"Date": ["26-12-2007", "27-12-2007", "28-12-2007"]})
df["Date"] = pd.to_datetime(df["Date"]).dt.strftime('%Y-%m-%d')
print(df)
Output:
Date
0 2007-12-26
1 2007-12-27
2 2007-12-28
You can also set the column labels to equal the values in the first row with e.g.
df.columns = df.iloc[0]
Related
I have a chronological-disordered list stored in a excel sheet called 'Compilado' that I want to reorder correctly to prepare for data analysis.
I parsed it to a Pandas Dataframe running:
df = pandas.read_excel(r'C:\Users\KMBGSI\Downloads\Historico de Alertas.xlsx',sheet_name='Compilado', header=None, names= header_list, index_col=None, parse_dates=[0])
Dataframe preview:
df.head()
df.info() returns:
return from df.info()
so dtypes are okay.
Data from column 'Data' seems to have been parsed right and is shown in the format dd/mm/YYYY.
However, when I run the code below data from column 'Data' seems to have its format changed:
df.sort_values(by=['Data'], inplace=True)
df.head()
dataframe preview after sorting by 'Data' column values
I know '2021-01-12' is actually '2021-12-01' wrongly formatted, because my dataset begins in 01/09/2021 (2021-09-01).
Why does it happen?
How can I reorder this dataset keeping datetime64[ns] values correclty formatted?
Thanks! Kind regards,
Full code for reference:
import os, sys, pandas, numpy, matplotlib, seaborn
header_list = ['Data', 'Hora', 'Status']
df = pandas.read_excel(r'C:\Users\KMBGSI\Downloads\Historico de Alertas.xlsx',sheet_name='Compilado', header=None, names= header_list, index_col=None, parse_dates=[0])
"comment after checking dataframe is okay before proceeding"
#df.info()
#df.head()
#df.tail()
df.sort_values(by=['Data'], inplace=True)
df.head()
You are right df.sort_values() converts the format from yyyy-mm-dd to yyyy-dd-mm
You can change the format after df.sort_values by this command:
df['Data'] = df['Data'].dt.strftime('%m-%d-%Y')
However, the datatype of the column will change to object
To convert back to datetime64[ns]
df['Data']=pd.to_datetime(df['Data'])
You can check the documentation of df.sort_values() here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
Using python pandas how can we change the data frame
First, how to copy the column name down to other cell(blue)
Second, delete the row and index column(orange)
Third, modify the date formate(green)
I would appreciate any feedback~~
Update
df.iloc[1,1] = df.columns[0]
df = df.iloc[1:].reset_index(drop=True)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.set_index('Date')
print(df.columns)
Question 1 - How to copy column name to a column (Edit- Rename column)
To rename a column pandas.DataFrame.rename
df.columns = ['Date','Asia Pacific Equity Fund']
# Here the list size should be 2 because you have 2 columns
# Rename using pandas pandas.DataFrame.rename
df.rename(columns = {'Asia Pacific Equity Fund':'Date',"Unnamed: 1":"Asia Pacific Equity Fund"}, inplace = True)
df.columns will return all the columns of dataframe where you can access each column name with index
Please refer Rename unnamed column pandas dataframe to change unnamed columns
Question 2 - Delete a row
# Get rows from first index
df = df.iloc[1:].reset_index()
# To remove desired rows
df.drop([0,1]).reset_index()
Question 3 - Modify the date format
current_format = '%Y-%m-%d %H:%M:%S'
desired_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date']).dt.strftime(desired_format)
# Input the existing format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=current_format).dt.strftime(desired_format)
# To update date format of Index
df.index = pd.to_datetime(df.index,infer_datetime_format=current_format).strftime(desired_format)
Please refer pandas.to_datetime for more details
I'm not sure I understand your questions. I mean, do you actually want to change the dataframe or how it is printed/displayed?
Indexes can be changed by using methods .set_index() or .reset_index(), or can be dropped eventually. If you just want to remove the first digit from each index (that's what I understood from the orange column), you should then create a list with the new indexes and pass it as a column to your dataframe.
Regarding the date format, it depends on what you want the changed format to become. Take a look into python datetime.
I would strongly suggest you to take a better look into pandas features and documentations, and how to handle a dataframe with this library. There is plenty of great sources a Google-search away :)
Delete the first two rows using this.
Rename the second column using this.
Work with datetime format using the datetime package. Read about it here
Pandas' method "to_datetime" allows us to convert a string into a datetime object. Still, this method seems to have different outputs when converting data frame lines and column labels.
Let df be a data frame with identical date strings in its lines and column labels:
import pandas as pd
d = {'01-01-2001': ['01-01-2001'], '02-02-2002': ['02-02-2002'],'03-03-2003': ['03-03-2003']}
df = pd.DataFrame(data=d)
Now let us convert these data into datetime:
df.columns = pd.to_datetime(df.columns)
df.loc[0] = pd.to_datetime(df.loc[0])
While the column label outputs readable datetime data, the line outputs a different format. Why? And what should be done to obtain readable data in the data frame line?
You can use strftime to reformat the date
df.loc[0] = pd.to_datetime(df.columns).strftime(date_format='%Y-%m-%d')
Output
df
Out[400]:
2001-01-01 2002-02-02 2003-03-03
0 2001-01-01 2002-02-02 2003-03-03
thanks for help in advance. multi-part question
I have zip files that has multiple stock pricing info. the current format is
Header row is:
ticker,date,open,high,low,close,vol
and first row example is
AAPL,201906030900,176.32,176.32,176.24,176.29,2247
desired format:
header
ticker,date,time,open,high,low,close,vol
and data
AAPL,20190603,09:00,176.32,176.32,176.24,176.29,2247
where the time column is added and the column is filled with the last 4 digits from the date row with a colon in the middle and those last 4 digits are removed from the date data column.
there about 400 rows of data for each stock in each file so each row would need to be converted.
i haven't been able to find an answer here or elsewhere on the web that i could understand how to accomplish what i am trying to do.
Try the following, using pandas:
data.csv
ticker,date,open,high,low,close,vol
AAPL,201906030900,176.32,176.32,176.24,176.29,2247
ABCD,202002211000,220.97,217.38,221.43,219.82,8544
code
import pandas as pd
df = pd.read_csv('data.csv')
# print(df)
df['time'] = df['date'].apply(lambda x: f'{str(x)[-4:-2]}:{str(x)[-2:]}')
df['date'] = df['date'].apply(lambda x: str(x)[:-4])
cols = df.columns.to_list()
cols = cols[:2] + cols[-1:] + cols[2:-1]
df = df[cols]
# print(df)
df.to_csv('out.csv', index=False)
output.csv
ticker,date,time,open,high,low,close,vol
AAPL,20190603,09:00,176.32,176.32,176.24,176.29,2247
ABCD,20200221,10:00,220.97,217.38,221.43,219.82,8544
You can use the same code to loop over multiple files.
I have attached a photo of how the data is formatted when I print the df in Jupyter, please check that for reference.
Set the DATE column as the index, checked the data type of the index, and converted the index to be a datetime index.
import pandas as pd
df = pd.read_csv ('UMTMVS.csv',index_col='DATE',parse_dates=True)
df.index = pd.to_datetime(df.index)
I need to print out percent increase in value from Month/Year to Month/Year and percent decrease in value from Month/Year to Month/Year.
dataframe format picture
The first correction pertains to how to read your DataFrame.
Passing parse_dates you should define a list of columns to be parsed
as dates. So this instruction should be changed to:
df = pd.read_csv('UMTMVS.csv', index_col='DATE', parse_dates=['DATE'])
and then the second instruction in not needed.
To find the percent change in UMTMVS column, use: df.UMTMVS.pct_change().
For your data the result is:
DATE
1992-01-01 NaN
1992-02-01 0.110968
1992-03-01 0.073036
1992-04-01 -0.040080
1992-05-01 0.014875
1992-06-01 -0.330455
1992-07-01 0.368293
1992-08-01 0.078386
1992-09-01 0.082884
1992-10-01 -0.030528
1992-11-01 -0.027791
Name: UMTMVS, dtype: float64
Maybe you should multiply it by 100, to get true percents.