map with named function vs identical lambda function providing different responses, pandas - python

I am trying to apply a simple function to extract the month from a string column in a pandas dataframe, where the string is of the form m/d/yyyy.
The dataframe is called data, the date column is called transaction date, and my new proposed month column I wish to call transaction month.
The below works just fine:
data['transaction month']=data['transaction date'].map(lambda x: x[0:x.index('/')])
However, if I try to do the same thing with a named function, it just returns a column where every value is None
def extract_month_from_date(date):
return date[0:date.index('/')]
data['transaction month 2']=data['transaction date'].map(extract_month_from_date)
I've stared at the code for long enough that I think I'm going crazy, what's wrong with the

You can extract the month via pd.Series.dt.month:
import pandas as pd
df = pd.DataFrame({'date': ['8/2/2018']})
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
# date month
# 0 2018-08-02 8

Related

Why does pd.to_datetime not take the year into account?

I've searched for 2 hours but can't find an answer for this that works.
I have this dataset I'm working with and I'm trying to find the latest date, but it seems like my code is not taking the year into account. Here are some of the dates that I have in the dataset.
Date
01/09/2023
12/21/2022
12/09/2022
11/19/2022
Here's a snippet from my code
import pandas as pd
df=pd.read_csv('test.csv')
df['Date'] = pd.to_datetime(df['Date'])
st.write(df['Date'].max())
st.write gives me 12/21/2022 as the output instead of 01/09/2023 as it should be. So it seems like the code is not taking the year into account and just looking at the month and date.
I tried changing the format to
df['Date'] = df['Date'].dt.strftime('%Y%m%d').astype(int) but that didn't change anything.
pandas.read_csv allows you to designate column for conversion into dates, let test.csv content be
Date
01/09/2023
12/21/2022
12/09/2022
11/19/2022
then
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=["Date"])
print(df['Date'].max())
gives output
2023-01-09 00:00:00
Explanation: I provide list of names of columns holding dates, which then read_csv parses.
(tested in pandas 1.5.2)

Get month-day pair without a year from pandas date time

I am trying to use this, but eventually, I get the same year-month-day format where my year changed to default "1900". I want to get only month-day pairs if it is possible.
df['date'] = pd.to_datetime(df['date'], format="%m-%d")
If you transform anything to date time, you'll always have a year in it, i.e. to_datetime will always yield a date time with a year.
Without a year, you will need to store it as a string, e.g. by running the inverse of your example:
df['date'] = df['date'].dt.strftime(format="%m-%d")

How to find missing dates in an excel file by python

I'm a beginner in python. I have an excel file. This file shows the rainfall amount between 2016-1-1 and 2020-6-30. It has 2 columns. The first column is date, another column is rainfall. Some dates are missed in the file (The rainfall didn't estimate). For example there isn't a row for 2016-05-05 in my file. This a sample of my excel file.
Date rainfall (mm)
1/1/2016 10
1/2/2016 5
.
.
.
12/30/2020 0
I want to find the missing dates but my code doesn't work correctly!
import pandas as pd
from datetime import datetime, timedelta
from matplotlib import dates as mpl_dates
from matplotlib.dates import date2num
df=pd.read_excel ('rainfall.xlsx')
a= pd.date_range(start = '2016-01-01', end = '2020-06-30' ).difference(df.index)
print(a)
Here' a beginner friendly way of doing it.
First you need to make sure, that the Date in your dataframe is really a date and not a string or object.
Type (or print) df.info().
The date column should show up as datetime64[ns]
If not, df['Date'] = pd.to_datetime(df['Date'], dayfirst=False)fixes that. (Use dayfirst to tell if the month is first or the day is first in your date string because Pandas doesn't know. Month first is the default, if you forget, so it would work without...)
For the tasks of finding missing days, there's many ways to solve it. Here's one.
Turn all dates into a series
all_dates = pd.Series(pd.date_range(start = '2016-01-01', end = '2020-06-30' ))
Then print all dates from that series which are not in your dataframe "Date" column. The ~ sign means "not".
print(all_dates[~all_dates.isin(df['Date'])])
Try:
df = pd.read_excel('rainfall.xlsx', usecols=[0])
a = pd.date_range(start = '2016-01-01', end = '2020-06-30').difference([l[0] for l in df.values])
print(a)
And the date in the file must like 2016/1/1
To find the missing dates from a list, you can apply Conditional Formatting function in Excel. 4. Click OK > OK, then the position of the missing dates are highlighted. Note: The last date in the date list will be highlighted.
this TRICK Is not with python,a NORMAL Trick

Pandas group hourly data into daily sums with date index

I am working on a code that takes hourly data for a month and groups it into 24 hour sums. My problem is that I would like the index to read the date/year and I am just getting an index of 1-30.
The code I am using is
df = df.iloc[:,16:27].groupby([lambda x: x.day]).sum()
example of output I am getting
DateTime data
1 1772.031568
2 19884.42243
3 28696.72159
4 24906.20355
5 9059.120325
example of output I would like
DateTime data
1/1/2017 1772.031568
1/2/2017 19884.42243
1/3/2017 28696.72159
1/4/2017 24906.20355
1/5/2017 9059.120325
This is an old question, but I don't think the accepted solution is the best in this particular case. What you want to accomplish is to down sample time series data, and Pandas has built-in functionality for this called resample(). For your example you will do:
df = df.iloc[:,16:27].resample('D').sum()
or if the datetime column is not the index
df = df.iloc[:,16:27].resample('D', on='datetime_column_name').sum()
There are (at least) 2 benefits from doing it this way as opposed to accepted answer:
Resample can up sample and down sample, groupby() can only down sample
No lambdas, list comprehensions or date formatting functions required.
For more information and examples, see documentation here: resample()
If your index is a datetime, you can build a combined groupby clause:
df = df.iloc[:,16:27].groupby([lambda x: "{}/{}/{}".format(x.day, x.month, x.year)]).sum()
or even better:
df = df.iloc[:,16:27].groupby([lambda x: x.strftime("%d%m%Y")]).sum()
if your index was not datetime object.
import pandas as pd
df = pd.DataFrame({'data': [1772.031568, 19884.42243,28696.72159, 24906.20355,9059.120325]},index=[1,2,3,4,5])
print df.head()
rng = pd.date_range('1/1/2017',periods =len(df.index), freq='D')
df.set_index(rng,inplace=True)
print df.head()
will result in
data
1 1772.031568
2 19884.422430
3 28696.721590
4 24906.203550
5 9059.120325
data
2017-01-01 1772.031568
2017-01-02 19884.422430
2017-01-03 28696.721590
2017-01-04 24906.203550
2017-01-05 9059.120325
First you need to create an index on your datetime column to expose functions that break the datetime into smaller pieces efficiently (like the year and month of the datetime).
Next, you need to group by the year, month and day of the index if you want to apply an aggregate method (like sum()) to each day of the year, and retain separate aggregations for each day.
The reset_index() and rename() functions allow us to rename our group_by categories to their original names. This "flattens" out our data, making the category an actual column on the resulting dataframe.
import pandas as pd
date_index = pd.DatetimeIndex(df.created_at)
# 'df.created_at' is the datetime column in your dataframe
counted = df.group_by([date_index.year, date_index.month, date_index.day])\
.agg({'column_to_sum': 'sum'})\
.reset_index()\
.rename(columns={'level_1': 'year',
'level_2': 'month',
'level_3': 'day'})
# Resulting dataframe has columns "column_to_sum", "year", "month", "day" available
You can exploit panda's DatetimeIndex:
working_df=df.iloc[:, 16:27]
result = working_df.groupby(pd.DatetimeIndex(working_df.DateTime)).date).sum()
This if you DateTime column is actually DateTime (and be careful of the timezone).
In this way you will have valid datetime in the index, so that you can easily do other manipulations.

Pandas: select all dates with specific month and day

I have a dataframe full of dates and I would like to select all dates where the month==12 and the day==25 and add replace the zero in the xmas column with a 1.
Anyway to do this? the second line of my code errors out.
df = DataFrame({'date':[datetime(2013,1,1).date() + timedelta(days=i) for i in range(0,365*2)], 'xmas':np.zeros(365*2)})
df[df['date'].month==12 and df['date'].day==25] = 1
Pandas Series with datetime now behaves differently. See .dt accessor.
This is how it should be done now:
df.loc[(df['date'].dt.day==25) & (cust_df['date'].dt.month==12), 'xmas'] = 1
Basically what you tried won't work as you need to use the & to compare arrays, additionally you need to use parentheses due to operator precedence. On top of this you should use loc to perform the indexing:
df.loc[(df['date'].month==12) & (df['date'].day==25), 'xmas'] = 1
An update was needed in reply to this question. As of today, there's a slight difference in how you extract months from datetime objects in a pd.Series.
So from the very start, incase you have a raw date column, first convert it to datetime objects by using a simple function:
import datetime as dt
def read_as_datetime(str_date):
# replace %Y-%m-%d with your own date format
return dt.datetime.strptime(str_date,'%Y-%m-%d')
then apply this function to your dates column and save results in a new column namely datetime:
df['datetime'] = df.dates.apply(read_as_datetime)
finally in order to extract dates by day and month, use the same piece of code that #Shayan RC explained, with this slight change; notice the dt.datetime after calling the datetime column:
df.loc[(df['datetime'].dt.datetime.month==12) &(df['datetime'].dt.datetime.day==25),'xmas'] =1

Categories