problems with python Pandas converting int to float - python

I'm using pandas read_csv to extract data and reformat it. For example, "10/28/2018" from the column "HBE date" will be reformatted to read "eHome 10/2018"
It mostly works except I am getting reformatted values like "ehome 1.0/2015.0"
eHomeHBEdata['HBE date'] = pd.to_datetime(eHomeHBEdata['Course Completed'])
#extract month and year values
eMonths=[]
eYears =[]
eHomeDates = eHomeHBEdata['HBE date']
for eDate in eHomeDates:
eMonth = eDate.month
eYear = eDate.year
eMonths.append(eMonth)
eYears.append(eYear)
At this point, if I print(type(eMonth)) it returns as 'int.' And if I print the eYears list, I get values like 2013, 2014, 2015 etc.
But then I assign the lists to columns in the data frame . . .
eHomeHBEdata.insert(0,'workshop Month',eMonths)
eHomeHBEdata.insert(1,'workshop Year',eYears)
. . . after which print(ehomeHomeHBEdata['workshop Month']) returns values like 2013.0, 2014.0, 2015.0. That's type float, right?
When I try to use the following code I get the misformatted errors mentioned above
eHomeHBEdata['course session'] = "ehome " + eHomeHBEdata['workshop Month'].astype(str) + "/" + eHomeHBEdata['workshop Year'].astype(str)
eHomeHBEdata['start'] = eHomeHBEdata['workshop Month'].astype(str) + "/1/" + eHomeHBEdata['workshop Year'].astype(str) + " 12:00 PM"
Could someone explain what's going on here and help me fix it?

Solution
To convert (reformat) your date columns as MM/YYYY, all you need to do is:
df["Your_Column_Name"].dt.strftime('%m/%Y')
See Section-A and Section-B for two different use-cases.
A. Example
I have created some dummy data for this illustration with a column called: Date. To reformat this column as MM/YYYY I am using df.Dates.dt.strftime('%m/%Y') which is equivalent to df["Dates"].dt.strftime('%m/%Y').
import pandas as pd
## Dummy Data
dates = pd.date_range(start='2020/07/01', end='2020/07/07', freq='D')
df = pd.DataFrame(dates, columns=['Dates'])
# Solution
df['Reformatted_Dates'] = df.Dates.dt.strftime('%m/%Y')
print(df)
## Output:
# Dates Reformatted_Dates
# 0 2020-07-01 07/2020
# 1 2020-07-02 07/2020
# 2 2020-07-03 07/2020
# 3 2020-07-04 07/2020
# 4 2020-07-05 07/2020
# 5 2020-07-06 07/2020
# 6 2020-07-07 07/2020
B. If your input data is in the following format
In this case, first you could convert the date using .astype('datetime64[ns, US/Eastern]') on the column. This lets you apply pandas datetime specific methods on the column. Try running df.Dates.astype('datetime64[ns, US/Eastern]').dt.to_period(freq='M') now.
## Dummy Data
dates = [
'10/2018',
'11/2018',
'8/2019',
'5/2020',
]
df = pd.DataFrame(dates, columns=['Dates'])
print(df.Dates.dtype)
print(df)
## To convert the column to datetime and reformat
df['Dates'] = df.Dates.astype('datetime64[ns, US/Eastern]') #.dt.strftime('%m/%Y')
print(df.Dates.dtype)
C. Avoid using the for loop
Try this. You can use the inbuilt vectorization of pandas on a column, instead for looping over each row. I have used .dt.month and .dt.year on the column to get the month and year as int.
eHomeHBEdata['HBE date'] = pd.to_datetime(eHomeHBEdata['Course Completed'])
eHomeDates = eHomeHBEdata['HBE date'] # this should be in datetime.datetime format
## This is what I changed
>>> eMonths = eHomeDates.dt.month
>>> eYears = eHomeDates.dt.year
eHomeHBEdata.insert(0,'workshop Month',eMonths)
eHomeHBEdata.insert(1,'workshop Year',eYears)

Related

Extracting dates in a pandas dataframe column using regex

I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.
start_date, end_date, days_between_start_and_end_date.
The issue is Campaign column value is not in a fixed format, for the below values my code block works well.
1. Season1 hero (18.02. -24.03.2021)
What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.
import pandas as pd
import re
import datetime
# read csv file
df = pd.read_csv("report.csv")
# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]
# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')
# Add year to start date
for index, row in df.iterrows():
if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
continue
start_month = row["start_date"].month
end_month = row["end_date"].month
year = row["end_date"].year
if start_month > end_month:
year = year - 1
dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
but, I have multiple different column values where my regex fail and I receive nan values, for example
1. Sales is on (30.12.21-12.01.2022)
2. Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3. M SALE (19.04 - 04.05.2022)
4. NEW SALE (29.12.2022-11.01.2023)
5. Year End (18.12. - 12.01.2023)
6. XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
Year End (18.12. - 12.01.2023)
in all the above 4 example, my date format is completely different.
expected output
start date end date
2021-12-30 2022-01-22
2023-03-24 2023-03-30
2022-04-19 2022-05-04
2022-12-29 2023-01-11
2022-18-12 2023-01-12
2021-11-18 2021-12-08
Can someone please help me here?
Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.
To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.
Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.
pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.
from datetime import datetime
def datetime_parser(date, end_date=None):
# remove space around dates
date = date.strip()
# if the start date doesn't have year, append it from the end date
dmy = date.split('.')
if end_date and len(dmy) == 2:
date = f"{date}.{end_date.rsplit('.', 1)[1]}"
elif end_date and not dmy[-1]:
edmy = end_date.split('.')
if int(dmy[1]) > int(edmy[1]):
date = f"{date}{int(edmy[-1])-1}"
else:
date = f"{date}{edmy[-1]}"
try:
# try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
return datetime.strptime(date, '%d.%m.%Y')
except ValueError:
# try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
return datetime.strptime(date, '%d.%m.%y')
# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']
I would do a basic regex with extract and then perform slicing :
ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)
​
start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]
end_date = ser.str.strip().str.split("\s*-\s*").str[0]
​
NB : You can assign the Series start_date and end_date to create your two new column.
Output :
start_date, end_date
(1.0 12.01.2022 # <- start_date
2.0 30.03.2023
3.0 04.05.2022
4.0 11.01.2023
Name: Campaign, dtype: object,
1.0 30.12.21 # <- end_date
2.0 24.03
3.0 19.04
4.0 29.12.2022
Name: Campaign, dtype: object)

"ValueError: year is out of range" with IEX Cloud API

I have a CSV (which I converted to a dataframe) consisting of company/stock data:
Symbol Quantity Price Cost date
0 DIS 9 NaN 20 20180531
1 SBUX 5 NaN 30 20180228
2 PLOW 4 NaN 40 20180731
3 SBUX 2 NaN 50 20191130
4 DIS 11 NaN 25 20171031
And I am trying to use the IEX Cloud API to pull in the stock Price for a given date. And then ultimately write that to the dataframe. Per the IEX Cloud API documentation, I can use the get_historical_data function, where the 2nd argument is the date: df = get_historical_data("SBUX", "20190617", close_only=True)
Everything works fine so long as I pass in a raw date directly to the function (e.g., 20190617), but if I try using a variable instead, I get ValueError: year 20180531 is out of range. I'm guessing something is wrong with the date format in my original CSV?
Here is my full code:
import os
from iexfinance.stocks import get_historical_data
import pandas as pd
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'Tsk_5798c0ab124d49639bb1575b322841c4'
input_df = pd.read_csv("all.csv")
for index, row in input_df.iterrows():
symbol = row['Symbol']
date = row['date']
temp_df = get_historical_data(symbol, date, close_only=True, output_format='pandas')
price = temp_df['close'].values[0]
print(temp_df)
Note that this is a public token, so it's okay to use
When you called get_historical_data("SBUX", "20190617", close_only=True)
you passed the date as a string.
But when you read a DataFrame using read_csv, this column
(containing 8-digit strings) is converted to an integer.
This difference can be the source of problem.
Try 2 things:
convert this column to string, or
while reading the DataFrame, pass dtype={'date': str},
so that this column will be read as a string.
You should be fine if you transform your date row into datetime.
import pandas as pd
df = pd.DataFrame(['20180531'])
pd.to_datetime(df.values[:, 0])
Out[43]: DatetimeIndex(['2018-05-31'], dtype='datetime64[ns]', freq=None)
Then, your column will be correctly formatted for use elsewhere. You can insert this line below pd.read_csv():
df['date'] = pd.to_datetime(df['date'])

Pandas Dataframe Time column has float values

I am doing a cleaning of my Database. In one of the tables, the time column has values like 0.013391204. I am unable to convert this to time [mm:ss] format. Is there a function to convert this to the required format [mm:ss]
The head for the column
0 20:00
1 0.013391204
2 0.013333333
3 0.012708333
4 0.012280093
Use the below reproducible data:
import pandas as pd
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333", "0.012708333", "0.012280093"]})
I expect the output to be like the first row of the column values shown above.
What is the correct time interpretation for say the first entry? 0.013391204 is it 48 seconds?
Because, if we use datetime module we can convert float into the time format:
Updating answer to add the new information
import datetime
datetime.timedelta(days = 0.013391204)
str(datetime.timedelta(days = 0.013391204))
Output:'0:19:17.000026'
Hope this helps :))
First convert values by to_numeric with errors='coerce' for replace non floats to missing values and then replace them by original values with 00: for hours, last convert by to_timedelta with unit='d':
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333",
"0.012708333", "0.012280093"]})
s = pd.to_numeric(df['time'], errors='coerce').fillna(df['time'].radd('00:'))
df['new'] = pd.to_timedelta(s, unit='d')
print (df)
time new
0 20:00 00:20:00
1 0.013391204 00:19:17.000025
2 0.013333333 00:19:11.999971
3 0.012708333 00:18:17.999971
4 0.012280093 00:17:41.000035

Parse timestamp having hour beyond 23 in python

I am learning python and came across an issue where I am trying to read timestamp from CSV file in below format,
43:32.0
here 43 is at hours position and convert it to DateTime format in Pandas.
I tried code,
df['time'] = df['time'].astype(str).str[:-2]
df['time'] = pd.to_datetime(df['time'], errors='coerce')
But, this is converting all values to NaT
I need the output to be in format - mm/dd/yyyy hh:mm:ss
I'm going to assume that this is a Date for 11-29-17 (today's date)?
I believe you need to add an extra 0: in the beginning of the string. Basic Example:
import pandas as pd
# creating a dataframe of your string
df1 = pd.DataFrame({'A':['43:32.0']})
# adding '0:' to the front
df1['A'] = '0:' + df1['A'].astype(str)
# making new column to show the output
df1['B'] = pd.to_datetime(df1['A'], errors='coerce')
#output
A B
0 0:43:32.0 2017-11-29 00:43:32

NaNs when extracting no. of days between two dates in pandas

I have a dataframe that contains the columns company_id, seniority, join_date and quit_date. I am trying to extract the number of days between join date and quit date. However, I get NaNs.
If I drop off all the columns in the dataframe except for quit date and join date and run the same code again, I get what I expect. However with all the columns, I get NaNs.
Here's my code:
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['days'] = df['days'].astype(str)
df1 = pd.DataFrame(df.days.str.split(' ').tolist(), columns = ['days', 'unwanted', 'stamp'])
df['numberdays'] = df1['days']
This is what I get:
days numberdays
585 days 00:00:00 NaN
340 days 00:00:00 NaN
I want 585 from the 'days' column in the 'numberdays' column. Similarly for every such row.
Can someone help me with this?
Thank you!
Instead of converting to string, extract the number of days from the timedelta value using the dt accessor.
import pandas as pd
df = pd.DataFrame({'join_date': ['2014-03-24', '2013-04-29', '2014-10-13'],
'quit_date':['2015-10-30', '2014-04-04', '']})
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['number_of_days'] = df['days'].dt.days
#Mohammad Yusuf Ghazi points out that dt.day is necessary to get the number of days instead of dt.days when working with datetime data rather than timedelta.

Categories