How to count nonzero occurrences based on another variable in python? - python

Date Precipitation
20010101 0
20010102 10
20010103 5
20010104 3
20010105 0
...
20011231 0
I have dataset showing precipitation (in) per each day in the year 2001. The date variable is in YYYYMMDD format. I want to calculate how many times it precipitated each month. In other words, I need the number of times the precipitation value is not 0 per each month.
I am a beginner python learner and don’t quite know how to tell the program to output the count per each month without having to do it individually.
The code I have below does not work because I’m not sure how to tell the program the Date variable is in YYYYMMDD format.
Precip_Count= Date[(Precipitation !=0)]
Is there a way to do this by only using NumPy?

First, convert Date column to datetime using pd.to_datetime and specify the format of your datetime string Datetime format code, then use Series.ne to find non-zero values, groupby month and take the sum using GroupBy.sum
df['Date'] = pd.to_datetime(df['Date'], format="%Y%M%d")
df['Precipitation'].ne(0).groupby(df.Date.dt.month).sum()
Date
1 3
...
12 0
Name: Precipitation, dtype: int64
OR using Series.dt.to_period here.
df['Precipitation'].ne(0).groupby(df.Date.dt.to_period('M')).sum()
Date
2001-01 3
...
2001-12 0
Freq: M, Name: Precipitation, dtype: int64
If you want index as DatetimeIndex use pd.Grouper
df['Precipitation'].ne(0).groupby(pd.Grouper(freq='M')).sum()
Date
2001-01-31 3
...
2001-12-31 0
Freq: M, Name: Precipitation, dtype: int64
The output is calculated from df mentioned in the question.

Related

Date and Time Format Conversion in Pandas, Python

Initially, my dataframe had a Month column containing numbers representing the months.
Month
1
2
3
4
I typed df["Month"] = pd.to_datetime(df["Month"]) and I get this...
Month
970-01-01 00:00:00.0000000001
1970-01-01 00:00:00.000000002
1970-01-01 00:00:00.000000003
1970-01-01 00:00:00.000000004
I would like to just retain just the dates and not the time. Any solutions?
get the date from the column using df['Month'].dt.date
Use format='%m' in to_datetime:
df["Month"] = pd.to_datetime(df["Month"], format='%m')
print (df)
Month
0 1900-01-01
1 1900-02-01
2 1900-03-01
3 1900-04-01

Grouping and summing time differences from pandas dataframe

I have a dataframe like in example below:
Timestamp ComponentName Utilization
18.10.2020-19:07.10 A Available
19.10.2020-21:07.10 A Available
19.10.2020-19:07.10 A In use
22.10.2020-19:07.10 A In use
25.10.2020-19:07.10 A In use
And desired output should be:
ComponentName Total_Inuse_time Total_Available_time
A 6 days 1 day 2 hours
Basicly I want to have total inuse time and available time for each component.
I have tried grouping by component names and aggregating with sum on Time differences but could not get the desired result.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'] = df.groupby(['ComponentName', 'Utilization'])['Timestamp'].diff().fillna(pd.Timedelta(0))
sums = df.groupby(['ComponentName', 'Utilization'])['Timestamp'].sum()
Output:
>>> sums
ComponentName Utilization
A Available 1 days 02:00:00
In use 6 days 00:00:00
Name: Timestamp, dtype: timedelta64[ns]
>>> sums['A']
Utilization
Available 1 days 02:00:00
In use 6 days 00:00:00
Name: Timestamp, dtype: timedelta64[ns]
>>> sums['A']['Available']
Timedelta('1 days 02:00:00')

Convert Date from float64 to Year & Month Format

I have a dateset where the date column (Year & Month only) are a float64 with the month represented as fraction the year (ex. June 2012 is displayed as 2012.6).
Can any suggest how I can convert this to show as month & date format (6-2012, 7-2012, etc)?
Thanks!
I assume the solution is with to_datetime but so far I haven't been able to convert the dates properly
IIUC, you can do:
pd.to_datetime(pd.Series([2012.6]).astype(str), format='%Y.%m')
Output:
0 2012-06-01
dtype: datetime64[ns]
Try this:
import pandas as pd
dataframe = pd.DataFrame([[2019.1, 2018.2], [2017.3, 2018.4]], columns = ["a", "b"])
0 1
0 2019.1 2018.2
1 2017.3 2018.4
dataframe[a] = dataframe[a].apply(lambda x: pd.to_datetime(str(x)))
dataframe[a]
0 2019-01-01
1 2017-03-01
Name: a, dtype: datetime64[ns]
What this is doing is applying the function pd.to_datetime() to every value in the column converted to string type.
Hope it helps.

calculate date difference between today's date and pandas date series

Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)

Conversions of np.timedelta64 to days, weeks, months, etc

When I compute the difference between two pandas datetime64 dates I get np.timedelta64. Is there any easy way to convert these deltas into representations like hours, days, weeks, etc.?
I could not find any methods in np.timedelta64 that facilitate conversions between different units, but it looks like Pandas seems to know how to convert these units to days when printing timedeltas (e.g. I get: 29 days, 23:20:00 in the string representation dataframes). Any way to access this functionality ?
Update:
Strangely, none of the following work:
> df['column_with_times'].days
> df['column_with_times'].apply(lambda x: x.days)
but this one does:
df['column_with_times'][0].days
pandas stores timedelta data in the numpy timedelta64[ns] type, but also provides the Timedelta type to wrap this for more convenience (eg to provide such accessors of the days, hours, .. and other components).
In [41]: timedelta_col = pd.Series(pd.timedelta_range('1 days', periods=5, freq='2 h'))
In [42]: timedelta_col
Out[42]:
0 1 days 00:00:00
1 1 days 02:00:00
2 1 days 04:00:00
3 1 days 06:00:00
4 1 days 08:00:00
dtype: timedelta64[ns]
To access the different components of a full column (series), you have to use the .dt accessor. For example:
In [43]: timedelta_col.dt.hours
Out[43]:
0 0
1 2
2 4
3 6
4 8
dtype: int64
With timedelta_col.dt.components you get a frame with all the different components (days to nanoseconds) as different columns.
When accessing one value of the column above, this gives back a Timedelta, and on this you don't need to use the dt accessor, but you can access directly the components:
In [45]: timedelta_col[0]
Out[45]: Timedelta('1 days 00:00:00')
In [46]: timedelta_col[0].days
Out[46]: 1L
So the .dt accessor provides access to the attributes of the Timedelta scalar, but on the full column. That is the reason you see that df['column_with_times'][0].days works but df['column_with_times'].days not.
The reason that df['column_with_times'].apply(lambda x: x.days) does not work is that apply is given the timedelta64 values (and not the Timedelta pandas type), and these don't have such attributes.

Categories