I am a beginner in Python and I am trying to change column names that currently represent the week number, to something easier to digest. I wanted to change them to show the date of the week commencing but I am having issues with converting the types.
I have a table that looks similar to the following:
import pandas as pd
data = [[0,'John',1,2,3]
df = pd.dataframe(data, columns = ['Index','Owner','32.0','33.0','34.0']
print(df)
I tried to use df.melt to get a column with the week numbers and then convert them to datetime and obtain the week commencing from that but I have not been successfull.
df = df.melt(id_vars=['Owner'])
df['variable'] = pd.to_datetime(df['variable'], format = %U)
This is as far as I have gotten as I have not been able to obtain the week number as a datetime type to then use it to get the week commencing.
After this, I was going to then transform the dataframe back to its original shape and have the newly obtained week commencing date times as the column headers again.
Can anyone advise me on what I am doing wrong, or alternatively is there a better way to do this?
Any help would be greatly appreciated!
Add Index column to melt first for only week values in variable, then convert to floats, integers and strings, so possible match by weeks:
data = [[0,'John',1,2,3]]
df = pd.DataFrame(data, columns = ['Index','Owner','32.0','33.0','34.0'])
print(df)
Index Owner 32.0 33.0 34.0
0 0 John 1 2 3
df = df.melt(id_vars=['Index','Owner'])
s = df['variable'].astype(float).astype(int).astype(str) + '-0-2021'
print (s)
0 32-0-2021
1 33-0-2021
2 34-0-2021
Name: variable, dtype: object
#https://stackoverflow.com/a/17087427/2901002
df['variable'] = pd.to_datetime(s, format = '%W-%w-%Y')
print (df)
Index Owner variable value
0 0 John 2021-08-15 1
1 0 John 2021-08-22 2
2 0 John 2021-08-29 3
EDIT:
For get original DataFrame (integers columns for weeks) use DataFrame.pivot:
df1 = (df.pivot(index=['Index','Owner'], columns='variable', values='value')
.rename_axis(None, axis=1))
df1.columns = df1.columns.strftime('%W')
df1 = df1.reset_index()
print (df1)
Index Owner 32 33 34
0 0 John 1 2 3
One solution to convert a week number to a date is to use a timedelta. For example you may have
from datetime import timedelta, datetime
week_number = 5
first_monday_of_the_year = datetime(2021, 1, 3)
week_date = first_monday_of_the_year + timedelta(weeks=week_number)
Related
I have a Pandas DataFrame called new in which the YearMonth column has date in the format of YYYY-MM. I want to drop the rows based on the condition: if the date is beyond "2020-05". I tried using this:
new = new.drop(new[new.YearMonth>'2020-05'].index)
but its not working displaying a syntax error of "invalid token".
Here is a sample DataFrame:
>>> new = pd.DataFrame({
'YearMonth': ['2014-09', '2014-10', '2020-09', '2021-09']
})
>>> print(new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
The expected DataFrame after the drop should be:
YearMonth
0 2014-09
1 2014-10
Just convert to datetime, then format it to month and subset it.
from datetime import datetime as dt
new['YearMonth']=pd.to_datetime(new['YearMonth']).dt.to_period('M')
new=new[~(new['YearMonth']>'2020-05')]
I think you want boolean indexing with change > to <= so comparing by month periods working nice:
new = pd.DataFrame({
'YearMonth': pd.to_datetime(['2014-09', '2014-10', '2020-09', '2021-09']).to_period('m')
})
print (new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
df = new[new.YearMonth <= pd.Period('2020-05', freq='m')]
print (df)
YearMonth
0 2014-09
1 2014-10
In newest versions of pandas also working with compare by strings:
df = new[new.YearMonth <= '2020-05']
Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.
DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019 07:37:17.432 AM,1,23,10,FAIL
01/01/2019 07:37:23.661 AM,2,3,3,PASS
So there are many such log files. Each with the test data.
I have the the serial number of devices which failed in field. I want to create a model using these log files. And then use it to predict if the given device has a chance of failing in field given its log file.
Till now as a part of learning, I have worked with data like housing price. Every row was complete. Depending on area, number of rooms etc, it was easy to define a model for expected selling price.
Here I wish to find a way to somehow flatten all the logs into a single row. I am thinking of having something like:
DATE_1,TESTSTEP_1,READING_1,LIMIT_1,RESULT_1,DATE_2,TESTSTEP_2,READING_2,LIMIT_2,RESULT_2
1/1/2019 07:37:17.432 AM,1,23,10,FAIL,01/01/2019 07:37:23.661 AM,2,3,3,PASS
Is this the right way to deal with this kind of data?
If so, then does Pandas has any inbuilt support for this?
I will be using scikit-learn to create models.
First convert columns to ordered CategoricalIndex for same order of columns in output, convert DATE column by to_datetime and convert datetimes to dates by Series.dt.date with cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort second level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates, g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(drop=True)
print (df)
DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS
If need also dates in separate first column:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates.rename('DAT'), g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
DAT DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS
So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work
Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2
Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2
why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]
I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN
I have dates in a DataFrame's column like:
1 06AUG2010
2 07APR2011
I want to convert them to a type, where i can count diffrences between dates in days.
I'm searching the internet for the answer, but cant find it. New to pandas.
You can use to_datetime with custom format:
df = pd.DataFrame({'date':['06AUG2010','07APR2011']}, index=[1,2])
print (df)
date
1 06AUG2010
2 07APR2011
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
print (df)
date
1 2010-08-06
2 2011-04-07
And then for differences add diff:
df['date'] = df['date'].diff()
print (df)
date
1 NaT
2 244 days