Divide Single to Multiple Columns by Delimiter Python Dataframe - python

i have a dataframe called "dates" with shape 4380,1 that looks like this -
date
0 2017-01-01 00:00:00
1 2017-01-01 06:00:00
2 2017-01-01 12:00:00
3 2017-01-01 18:00:00
4 2017-01-02 00:00:00
...
4375 2019-12-30 18:00:00
4376 2019-12-31 00:00:00
4377 2019-12-31 06:00:00
4378 2019-12-31 12:00:00
4379 2019-12-31 18:00:00
but i need to divide the single column of dates by the delimiter "-" or dash so that I can use this to groupby the month e.g., 01, 02,...12. So, my final result for the new dataframe should have shape 4380,4 and look like:
Year Month Day HHMMSS
0 2017 01 01 00:00:00
1 2017 01 01 06:00:00
...
4379 2019 12 31 18:00:00
I cannot find how to do this python transformation from single to multiple columns based on a delimiter. Thank you much!

Use Series.dt.strftime and Series.str.split:
new_df = df['date'].dt.strftime('%Y-%m-%d-%H:%M:%S').str.split('-',expand=True)
new_df.columns = ['Year','Month','Day', 'HHMMSS']
print(new_df)
Year Month Day HHMMSS
0 2017 01 01 00:00:00
1 2017 01 01 06:00:00
2 2017 01 01 12:00:00
3 2017 01 01 18:00:00
4 2017 01 02 00:00:00
4375 2019 12 30 18:00:00
4376 2019 12 31 00:00:00
4377 2019 12 31 06:00:00
4378 2019 12 31 12:00:00
4379 2019 12 31 18:00:00

Related

Add rows for missing hourly data in a pandas dataframe

I have a pandas dataframe with 2 columns: Created'(%Y-%m-%d %H) and Count which is an integer.
It is counting the amount of "tickets" registered per hour.
The problem is that there are many hours in the day that there are not registered any tickets.
I would like to add these hours as new rows with a Count of 0. The dataframe looks like this:
Created Count
0 2020-10-26 10 11
1 2020-10-26 09 123
2 2020-10-26 08 36
3 2020-10-26 07 28
4 2020-10-26 06 7
But I would need it to add rows like this:
Created Count
enter code here
0 2020-10-26 10 11
1 2020-10-26 09 123
2 2020-10-26 08 36
3 2020-10-26 07 28
4 2020-10-26 06 7
1 2020-10-26 05. 0
3 2020-10-26 04. 0
Also adding that it needs to be able to update continuously as new dates are added to the original dataframe.
You can resample with:
import datetime
import pandas as pd
df = pd.DataFrame({
'Created': ['2020-10-26 10', '2020-10-26 08','2020-10-26 09','2020-10-26 07','2020-10-26 06'],
'count': [11, 10,14,16,20]})
df['Created'] = pd.to_datetime(df['Created'], format='%Y-%m-%d %H')
df.sort_values(by=['Created'],inplace=True)
df.set_index('Created',inplace=True)
df_Date=pd.date_range(start=df.index.min().replace(hour=0), end=(df.index.max()), freq='H')
df=df.reindex(df_Date,fill_value=0)
df.reset_index(inplace=True)
df.rename(columns={'index': 'Created'},inplace=True)
print(df)
Result:
Created count
0 2020-10-26 00:00:00 0
1 2020-10-26 01:00:00 0
2 2020-10-26 02:00:00 0
3 2020-10-26 03:00:00 0
4 2020-10-26 04:00:00 0
5 2020-10-26 05:00:00 0
6 2020-10-26 06:00:00 20
7 2020-10-26 07:00:00 16
8 2020-10-26 08:00:00 10
9 2020-10-26 09:00:00 14
10 2020-10-26 10:00:00 11

Changing the date format of an entire Dataframe column when multiple date formats already exist in the column?

bond_df['Maturity']
0 2022-07-15 00:00:00
1 2024-07-18 00:00:00
2 2027-07-16 00:00:00
3 2020-07-28 00:00:00
4 2019-10-09 00:00:00
5 2022-04-08 00:00:00
6 2020-12-15 00:00:00
7 2022-12-15 00:00:00
8 2026-04-08 00:00:00
9 2023-04-11 00:00:00
10 2024-12-15 00:00:00
11 2019
12 2020-10-25 00:00:00
13 2024-04-22 00:00:00
14 2047-12-15 00:00:00
15 2020-07-08 00:00:00
17 2043-04-11 00:00:00
18 2021
19 2022
20 2023
21 2025
22 2026
23 2027
24 2029
25 2021-04-15 00:00:00
26 2044-04-22 00:00:00
27 2043-10-02 00:00:00
28 2039-01-19 00:00:00
29 2040-07-09 00:00:00
30 2029-09-21 00:00:00
31 2040-10-25 00:00:00
32 2019
33 2035-09-04 00:00:00
34 2035-09-28 00:00:00
35 2041-04-15 00:00:00
36 2040-04-02 00:00:00
37 2034-03-27 00:00:00
38 2030
39 2027-04-05 00:00:00
40 2038-04-15 00:00:00
41 2037-08-17 00:00:00
42 2023-10-16 00:00:00
43 -
45 2019-10-09 00:00:00
46 -
47 2021-06-23 00:00:00
48 2021-06-23 00:00:00
49 2023-06-26 00:00:00
50 2025-06-26 00:00:00
51 2028-06-26 00:00:00
52 2038-06-28 00:00:00
53 2020-06-23 00:00:00
54 2020-06-23 00:00:00
55 2048-06-29 00:00:00
56 -
57 -
58 2029-07-08 00:00:00
59 2026-07-08 00:00:00
60 2024-07-08 00:00:00
61 2020-07-31 00:00:00
Name: Maturity, dtype: object
This is a column of data that I imported from Excel of maturity dates for various Walmart bonds. All I am concerned with is the year portion of these dates. How can I format the entire column to just return the year values?
dt.strftime didn't work
Thanks in advance
Wrote this little script for you which should output the years in a years.txt file, assuming your data is in data.txt as only the years you posted above.
Script also lets you toggle if you want to include the dash and the years on the right.
Contents of the data.txt I tested with:
0 2022-07-15 00:00:00
1 2024-07-18 00:00:00
2 2027-07-16 00:00:00
3 2020-07-28 00:00:00
4 2019-10-09 00:00:00
5 2022-04-08 00:00:00
6 2020-12-15 00:00:00
7 2022-12-15 00:00:00
8 2026-04-08 00:00:00
9 2023-04-11 00:00:00
10 2024-12-15 00:00:00
11 2019
12 2020-10-25 00:00:00
13 2024-04-22 00:00:00
14 2047-12-15 00:00:00
15 2020-07-08 00:00:00
17 2043-04-11 00:00:00
18 2021
19 2022
20 2023
21 2025
22 2026
23 2027
24 2029
25 2021-04-15 00:00:00
26 2044-04-22 00:00:00
27 2043-10-02 00:00:00
28 2039-01-19 00:00:00
29 2040-07-09 00:00:00
30 2029-09-21 00:00:00
31 2040-10-25 00:00:00
32 2019
33 2035-09-04 00:00:00
34 2035-09-28 00:00:00
35 2041-04-15 00:00:00
36 2040-04-02 00:00:00
37 2034-03-27 00:00:00
38 2030
39 2027-04-05 00:00:00
40 2038-04-15 00:00:00
41 2037-08-17 00:00:00
42 2023-10-16 00:00:00
43 -
45 2019-10-09 00:00:00
46 -
47 2021-06-23 00:00:00
48 2021-06-23 00:00:00
49 2023-06-26 00:00:00
50 2025-06-26 00:00:00
51 2028-06-26 00:00:00
52 2038-06-28 00:00:00
53 2020-06-23 00:00:00
54 2020-06-23 00:00:00
55 2048-06-29 00:00:00
56 -
57 -
58 2029-07-08 00:00:00
59 2026-07-08 00:00:00
60 2024-07-08 00:00:00
61 2020-07-31 00:00:00
and the script I wrote:
#!/usr/bin/python3
all_years = []
include_dash = False
include_years_on_right = True
with open("data.txt", "r") as f:
text = f.read()
lines = text.split("\n")
for line in lines:
line = line.strip()
if line == "":
continue
if "00" in line:
all_years.append(line.split("-")[0].split()[-1])
else:
if include_years_on_right == False:
continue
year = line.split(" ")[-1]
if year == "-":
if include_dash == True:
all_years.append(year)
else:
continue
else:
all_years.append(year)
with open("years.txt", "w") as f:
for year in all_years:
f.write(year + "\n")
and the output to the years.txt:
2022
2024
2027
2020
2019
2022
2020
2022
2026
2023
2024
2019
2020
2024
2047
2020
2043
2021
2022
2023
2025
2026
2027
2029
2021
2044
2043
2039
2040
2029
2040
2019
2035
2035
2041
2040
2034
2030
2027
2038
2037
2023
2019
2021
2021
2023
2025
2028
2038
2020
2020
2048
2029
2026
2024
2020
Contact me if you have any issues, and I hope I can help you!

How to use groupby on day and month in pandas?

I have a timeseries data for a full year for every minute.
timestamp day hour min somedata
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to group-by the data based on the day, i.e 2010-01-01 data should be one group, 2010-01-02 should be another upto 2010-12-31.
I used daily_groupby = dataframe.groupby(pd.to_datetime(dataframe.index.day, unit='D', origin=pd.Timestamp('2009-12-31'))). This creates the group based on the days so all jan, feb upto dec 01 day are in one group. But I want to also group by using month so that jan, feb .. does not get mixed up.
I am a beginner in pandas.
if timestamp is the index use DatetimeIndex.date
df.groupby(pd.to_datetime(df.index).date)
else Series.dt.date
df.groupby(pd.to_datetime(df['timestamp']).dt.date)
If you don't want group by year use:
time_index = pd.to_datetime(df.index)
df.groupby([time_index.month,time_index.day])

How to fill rest of the day with temperature_min and temperature_max of that day using pandas?

I have a dataframe which has 4 columns: day, time, tmin and tmax. tmin shows the temperature_min of the day and tmax shows the temperature_max.
What I want is to be able to fill all of the NaN values of one day with tmin and tmax of that day. For example I want to convert this dataframe:
day time tmin tmax
0 01 00:00:00 NaN NaN
1 01 03:00:00 -6.8 NaN
2 01 06:00:00 NaN NaN
3 01 09:00:00 NaN NaN
4 01 12:00:00 NaN NaN
5 01 15:00:00 NaN 1.2
6 01 18:00:00 NaN NaN
7 01 21:00:00 NaN NaN
8 02 00:00:00 NaN NaN
9 02 03:00:00 -7.2 NaN
10 02 06:00:00 NaN NaN
11 02 09:00:00 NaN NaN
12 02 12:00:00 NaN NaN
13 02 15:00:00 NaN 1.8
14 02 18:00:00 NaN NaN
15 02 21:00:00 NaN NaN
to this dataframe:
day time tmin tmax
0 01 00:00:00 -6.8 1.2
1 01 03:00:00 -6.8 1.2
2 01 06:00:00 -6.8 1.2
3 01 09:00:00 -6.8 1.2
4 01 12:00:00 -6.8 1.2
5 01 15:00:00 -6.8 1.2
6 01 18:00:00 -6.8 1.2
7 01 21:00:00 -6.8 1.2
8 02 00:00:00 -7.2 1.8
9 02 03:00:00 -7.2 1.8
10 02 06:00:00 -7.2 1.8
11 02 09:00:00 -7.2 1.8
12 02 12:00:00 -7.2 1.8
13 02 15:00:00 -7.2 1.8
14 02 18:00:00 -7.2 1.8
15 02 21:00:00 -7.2 1.8
Using groupby and transform:
df.assign(**df.groupby('day')[['tmin', 'tmax']].transform('first'))
day time tmin tmax
0 1 00:00:00 -6.8 1.2
1 1 03:00:00 -6.8 1.2
2 1 06:00:00 -6.8 1.2
3 1 09:00:00 -6.8 1.2
4 1 12:00:00 -6.8 1.2
5 1 15:00:00 -6.8 1.2
6 1 18:00:00 -6.8 1.2
7 1 21:00:00 -6.8 1.2
8 2 00:00:00 -7.2 1.8
9 2 03:00:00 -7.2 1.8
10 2 06:00:00 -7.2 1.8
11 2 09:00:00 -7.2 1.8
12 2 12:00:00 -7.2 1.8
13 2 15:00:00 -7.2 1.8
14 2 18:00:00 -7.2 1.8
15 2 21:00:00 -7.2 1.8
Or, if you'd like to modify the original DataFrame instead of returning a copy:
df[['tmin', 'tmax']] = df.groupby('day')[['tmin', 'tmax']].transform('first')
If you want to do it not as neatly as #user3483203 has done!
import pandas as pd
myfile = pd.read_csv('temperature.txt', sep=' ')
mydata = pd.DataFrame(data = myfile)
for i in mydata['day']:
row_start = (i-1) * 8 # assuming 8 data points per day
row_end = (i) * 8
mydata['tmin'][row_start:row_end] = pd.DataFrame.min(tempdata['tmin'][row_start:row_end], skipna=True)
mydata['tmax'][row_start:row_end] = pd.DataFrame.max(tempdata['tmax'][row_start:row_end], skipna=True)
Since you did not post any code, here's a general solution:
Step 1: Create variables that will keep track of the min and max temps
Step 2: Loop through each row in the frame
Step 3: For each row, check if the min or max == "NaN"
Step 4: If it is, replace with the value of the min or max variable we created earlier
just use the fillna with the forward fill and back fill parameters:
df.tmin = df.groupby('day')['tmin'].fillna(method='ffill').fillna(method='bfill')
df.tmax = df.groupby('day')['tmax'].fillna(method='ffill').fillna(method='bfill')

pandas - groupby and filtering for consecutive values

I have this dataframe df:
U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00
made by U and a Datetime object. What I would like to do is to filter U values having at least three consecutive occurrences in months/year. So far I have grouped by by U, year and month as:
m = df.groupby(['U',df.index.year,df.index.month]).size()
obtaining:
U
1 2015 1 1
2 1
4 1
5 1
7 1
2 2014 1 1
2 1
3 1
2015 1 1
8 1
9 1
3 2014 1 1
2 1
2015 1 1
2 1
5 1
10 1
11 1
12 1
The third column is related to the occurrences in different months/year. In this case only U values of 02 and 03 contain at least three consecutive values in months/year. Now I can't figured out how can I select those users and getting them out in a list, for instance, or just keeping them in the original dataframe df and discard the others. I tried also:
g = m.groupby(level=[0,1]).diff()
But I can't get any useful information.
Finally I could come up with the solution :) .
to give you an idea of how custom function works , simply it subtracts the value of the month from it's preceding value , the result should be one of course , and this should happen twice , for example if you have a list of numbers [5 , 6 , 7] , so 7 - 6 = 1 and 6 - 5 = 1 , 1 here appeared twice so the condition has been fulfilled
In [80]:
df.reset_index(inplace=True)
In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
Datetime U month year
0 2015-01-01 20:00:00 1 1 2015
1 2015-02-01 20:05:00 1 2 2015
2 2015-04-01 21:00:00 1 4 2015
3 2015-05-01 22:00:00 1 5 2015
4 2015-07-01 22:05:00 1 7 2015
5 2015-08-01 20:00:00 2 8 2015
6 2015-09-01 21:00:00 2 9 2015
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
9 2015-01-01 20:00:00 2 1 2015
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
17 2014-01-01 20:00:00 3 1 2014
18 2014-02-01 21:00:00 3 2 2014
In [284]:
g = df.groupby([df['U'] , df.year])
In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
Datetime U month year
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
if you want to see the result of the custom function
In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U year
1 2015 False
2 2014 True
2015 False
3 2014 False
2015 True
Name: month, dtype: bool
this is how custom function implemented
In [53]:
def is_at_least_three_consec(month_diff):
consec_count = 0
#print(month_diff)
for index , val in enumerate(month_diff):
if index != 0 and val == 1:
consec_count += 1
if consec_count == 2:
return True
else:
consec_count = 0
​
return False

Categories