Sorting Columns By Ascending Order - python

Given this example dataframe,
Date 01012019 01022019 02012019 02022019 03012019 03022019
Period
1 45 21 43 23 32 23
2 42 12 43 11 14 65
3 11 43 24 23 21 12
I will like to sort the date based on the month - (the date is in ddmmyyyy). However, the date is a string when I type(date). I tried to use pd.to_datetime but it failed with an error month must be in 1..12.
Any advice? Thank you!

Specify format of datetimes in to_datetime and then sort_index:
df.columns = pd.to_datetime(df.columns, format='%d%m%Y')
df = df.sort_index(axis=1)
print (df)
2019-01-01 2019-01-02 2019-01-03 2019-02-01 2019-02-02 2019-02-03
Date
1 45 43 32 21 23 23
2 42 43 14 12 11 65
3 11 24 21 43 23 12

Related

Use Pandas to update year of a subset of dates in a column

I have a pandas dataframe that consists of a date column of one year and then daily data to go along with it. I want to update the year of just the rows that pertain to January. I can select the January subset of my dataframe just fine, and try to change the year of that subset based on the answer given here, but when I try to update the values of that subset by adding an offset I get an error.
Setup:
import pandas as pd
df = pd.DataFrame({'Date': pd.date_range(start = "01-01-2023", end = "12-31-2023"), 'data': 25})
Select January subset:
df[df['Date'].dt.month == 1]
This works as expected:
Date data
0 2023-01-01 25
1 2023-01-02 25
2 2023-01-03 25
3 2023-01-04 25
4 2023-01-05 25
5 2023-01-06 25
6 2023-01-07 25
7 2023-01-08 25
8 2023-01-09 25
9 2023-01-10 25
10 2023-01-11 25
11 2023-01-12 25
12 2023-01-13 25
13 2023-01-14 25
14 2023-01-15 25
15 2023-01-16 25
16 2023-01-17 25
17 2023-01-18 25
18 2023-01-19 25
19 2023-01-20 25
20 2023-01-21 25
21 2023-01-22 25
22 2023-01-23 25
23 2023-01-24 25
24 2023-01-25 25
25 2023-01-26 25
26 2023-01-27 25
27 2023-01-28 25
28 2023-01-29 25
29 2023-01-30 25
30 2023-01-31 25
Attempt to change:
df[df['Date'].dt.month == 1] = df[df['Date'].dt.month == 1] + pd.offsets.DateOffset(years=1)
TypeError: Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.
I've tried a few different variations of this but seem to be having issues changing the subset dataframe data.
You have to select Date column (solution enhanced by #mozway, thanks):
df.loc[df['Date'].dt.month == 1, 'Date'] += pd.offsets.DateOffset(years=1)
print(df)
# Output
Date data
0 2024-01-01 25
1 2024-01-02 25
2 2024-01-03 25
3 2024-01-04 25
4 2024-01-05 25
.. ... ...
360 2023-12-27 25
361 2023-12-28 25
362 2023-12-29 25
363 2023-12-30 25
364 2023-12-31 25
[365 rows x 2 columns]

How to split in train and test by month

I have a dataframe structured like this
Time Z X Y
01-01-18 1 20 10
02-01-18 20 4 15
03-01-18 34 16 21
04-01-18 67 38 8
05-01-18 89 10 18
06-01-18 45 40 4
07-01-18 22 10 13
08-01-18 1 46 11
...
24-12-20 56 28 9
25-12-20 6 14 22
26-12-20 9 5 40
27-12-20 56 11 10
28-12-21 78 61 35
29-12-21 33 23 29
30-12-21 2 35 12
31-12-21 0 31 7
I have data for all days and months from 2018 to 2021, with around 50k observations
How can I aggregate all the data for the same month and perform a Train-Test splitting for each month? I.e. for all the data of the months of January, February, March and so on.
try this:
df['month'] = df.Time.apply(lambda x: x.split('-')[1]) #get month

Interpolate data and then merge 2 dataframes

I am starting off with Python and using Pandas.
I have 2 CSVs i.e
CSV1
Date Col1 Col2
2021-01-01 20 15
2021-01-02 22 12
2021-01-03 30 18
.
.
2021-12-31 125 160
so on and so forth...
CSV2
Start_Date End_Date Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2021-01-01 2021-02-25 15 25 35 45 30 40 55
2021-02-26 2021-05-31 25 30 44 35 50 45 66
.
.
2021-09-01 2021-0-25 44 25 65 54 24 67 38
Desired result
Date Col1 Col2 New_Col3 New_Col4
2021-01-01 20 15 Fri 40
2021-01-02 22 12 Sat 55
2021-01-03 30 18 Sun 15
.
.
2021-12-31 125 160 Fri 67
New_Col3 is the weekday abbreviation of Date
New_Col4 is the cell in CSV2 where the Date falls between Start_Date and End_Date row-wise, and from the corresponding weekday column-wise.
# Convert date column to datetime
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Start_Date'] = pd.to_datetime(df2['Start_Date'])
df2['End_Date'] = pd.to_datetime(df2['End_Date'])
# Get abbreviated weekday name
df1['New_Col3'] = df1['Date'].apply(lambda x: x.strftime('%a'))
New_Col4 = []
# Iterate over df1
for i in range(len(df1)):
# If df1['date'] is in between df2['Start_Date'] and df2['End_Date']
# Get the value according to df1['date'] weekday name
for j in range(len(df2)):
if df2.loc[j, 'Start_Date'] <= df1.loc[i, 'Date'] <= df2.loc[j, 'End_Date']:
day_name = df1.loc[i, 'Date'].strftime('%A')
New_Col4.append(df2.loc[j, day_name])
# Assign the result to a new column
df1['New_Col4'] = New_Col4
# print(df1)
Date Col1 Col2 New_Col3 New_Col4
0 2021-01-01 20 15 Fri 40
1 2021-01-02 22 12 Sat 55
2 2021-01-03 30 18 Sun 15
3 2021-03-03 40 18 Wed 35
Keys
Construct datetime and interval indexes to enable pd.IntervalIndex.get_indexer(pd.DatetimeIndex) for efficient row-matching. (reference post)
Apply a value retrieval function from df2 on each row of df1 for New_Col4.
With this approach, an explicit double for-loop search can be avoided in row-matching. However, a slow .apply() is still required. Maybe there is a fancy way to combine these two steps, but I will stop here for the time being.
Data
Typo in the last End_Date is changed.
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO("""
Date Col1 Col2
2021-01-01 20 15
2021-01-02 22 12
2021-01-03 30 18
2021-12-31 125 160
"""), sep=r"\s+", engine='python')
df2 = pd.read_csv(io.StringIO("""
Start_Date End_Date Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2021-01-01 2021-02-25 15 25 35 45 30 40 55
2021-02-26 2021-05-31 25 30 44 35 50 45 66
2021-09-01 2022-01-25 44 25 65 54 24 67 38
"""), sep=r"\s+", engine='python')
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Start_Date"] = pd.to_datetime(df2["Start_Date"])
df2["End_Date"] = pd.to_datetime(df2["End_Date"])
Solution
# 1. Get weekday name
df1["day_name"] = df1["Date"].dt.day_name()
df1["New_Col3"] = df1["day_name"].str[:3]
# 2-1. find corresponding row in df2
df1.set_index("Date", inplace=True)
idx = pd.IntervalIndex.from_arrays(df2["Start_Date"], df2["End_Date"], closed="both")
df1["df2_row"] = idx.get_indexer(df1.index)
# 2-2. pick out the value from df2
def f(row):
"""Get (#row, day_name) in df2"""
return df2[row["day_name"]].iloc[row["df2_row"]]
df1["New_Col4"] = df1.apply(f, axis=1)
Result
print(df1.drop(columns=["day_name", "df2_row"]))
Out[319]:
Col1 Col2 New_Col3 New_Col4
Date
2021-01-01 20 15 Fri 40
2021-01-02 22 12 Sat 55
2021-01-03 30 18 Sun 15
2021-12-31 125 160 Fri 67

python pandas assign yyyy-mm-dd from multiple years into accumulated week numbers

Given a file with the following columns:
date, userid, amount
where date is in yyyy-mm-dd format. I am trying to use python pandas to assign yyyy-mm-dd from multiple years into accumulated week numbers. For example:
2017-01-01 => 1
2017-12-31 => 52
2018-01-01 => 53
df_counts_dates=pd.read_csv("counts.csv")
print (df_counts_dates['date'].unique())
df = pd.to_datetime(df_counts_dates['date'])
print (df.unique())
print (df.dt.week.unique())
since the data contains Aug 2017-Aug 2018 dates, the above returns
[33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32]
I am wondering if there is any easy way to make the first date "week 1", and make the week number accumulate across years instead of becoming 1 at the beginning of each year?
I believe need a bit different approach - subtract all values of column by first, timedeltas convert to days, floor divide by 7 and last 1 for not starting by 0:
rng = pd.date_range('2017-08-01', periods=365)
df = pd.DataFrame({'date': rng, 'a': range(365)})
print (df.head())
date a
0 2017-08-01 0
1 2017-08-02 1
2 2017-08-03 2
3 2017-08-04 3
4 2017-08-05 4
w = ((df['date'] - df['date'].iloc[0]).dt.days // 7 + 1).unique()
print (w)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53]

Combine duplicated columns within a DataFrame

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)
pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.
Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer

Categories