Year range to date time format - python

Currently I have a series of string as a column in pandas dataframe which represents a particular year in a "yyyy-yyyy" format for example "2004-2005" is a single string value in this column.
I wanted to know if there is anyway to convert this from string to something similar to datetime format.
The purpose for this is to calculate the difference between the values of this column and other similar column in "Years". For example something similar to below:
col 1 col2 Answer(Total years)
2004-2005 2006-2007 3
Note: One of the ways I thought of doing was to make a dictionary mapping each year to a unique integer value and then calculate the difference between them.
Although I was wondering if there is any simpler way of doing it.

It looks like you subtracting the last year in column 2 with the first year in column 1. In which case I'd use str.extract (and convert the result to a number):
In [11]: pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[11]:
0 2004
Name: col 1, dtype: int64
In [12]: pd.to_numeric(df['col2'].str.extract('-(\d{4})')) - pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[12]:
0 3
dtype: int64

What do you mean by "something similar to a datetime object." Datetimes aren't designed to represent date ranges.
If you want to create a pair of datetime objects you could do something like this:
[datetime.datetime.strptime(x, '%Y') for x in '2005-2006'.split('-')]
Alternatively you could try using a Pandas date_range object if that's closer to what you want.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.date_range.html

If you are trying to find the difference between the lowest year and the highest year, here is a go at it
col1="2004-2005"
col2="2006-2007"
col1=col1.split("-") # make a list of the years in col1 ['2004', '2005']
col2=col2.split("-") # make a list of the years in col2 ['2006', '2007']
biglist=col1+col2 #add the two list
biglist.sort() #sort the list from lowest year to highest year
Answer=int(biglist[len(biglist)-1])-int(biglist[0]) #find the difference between lowest and highest year

Related

Can I change value's decimal point seperately in pandas?

I want each values of df have different decimal point like this
year month day
count 1234 5678 9101
mean 12.12 34.34 2.3456
std 12.12 3.456 7.789
I searched to find a way to change specific value's decimal point
but couldn't find the way. So this is what I've got
year month day
count 1234.0000 5678.0000 9101.0000
mean 12.1200 34.3400 2.3456
std 12.1200 3.4560 7.7890
I know the round() method but I don't know how to assign it to each values not the whole row or columns.
Is it possible to change values separately?
You can change displayning of floats:
pd.options.display.float_format = '{:,6f}'.format
#if necessary convert to floats
df = df.astype(float)
Or change format to 6 zeros:
df = df.astype(float).applymap('{:.6f}'.format)
The format approach is correct, but I think what you are looking for is this:
Input file data.txt
year month day
count 1234.0000 5678.0000 9101.0000
mean 12.1200 34.3400 2.3456
std 12.1200 3.4560 7.7890
Formatting (see formatting mini language)
import numpy as np
import pandas as pd
file = "/path/to/data.txt"
df = pd.read_csv(file, delim_whitespace=True)
# update all columns with data type number
# use the "n" format
df.update(df.select_dtypes(include=np.number).applymap('{:n}'.format))
print(df)
Output
year month day
count 1234 5678 9101
mean 12.12 34.34 2.3456
std 12.12 3.456 7.789

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

how can I group column by date and get the average from the other column in python?

From the dataframe below:
I would like to group column 'datum' by date 01-01-2019 and so on. and get an average at the same time on column 'PM10_gemiddelde'.
So now all 01-01-2019 (24 times) is on hour base and i need it combined to 1 and get the average on column ' PM10_gemiddelde' at the same time. See picture for the data.
besides that, PM10_gemiddelde has also negative data. How can i erase that data in python easily?
Thank you!
ps. im new with python
What you are trying to do can be achieve by:
data[['datum','PM10_gemiddelde']].loc[data['PM10_gemiddelde'] > 0 ].groupby(['datum']).mean()
You can create a new column with the average of PM10_gemiddelde using groupby along with transform. Try the following:
Assuming your dataframe is called df, start first by removing the negative data:
new_df = df[df['PM10_gemiddelde'] > 0]
Then, you can create a new column that contains the average value for every date:
new_df['avg_col'] = new_df.groupby('datum')['PM10_gemiddelde'].transform('mean')

groupby agg using a date offset or similar

Sample of dataset below
Trying to create a groupby that will give me the number of months that I specify eg last 12 months, last 36 months etc.
My groupby that rolls up my whole dataset for each 'client' is below. rolled_ret is just a custom function that geometrically links whatever performance array it gets, we can pretend is is sum()
df_client_perf = df_perf.groupby(df_perf.CLIENT_NAME)['GROUP_PERFORMANCE'].agg(Client_Return = rolled_ret)
If I put .rolling(12) I can take the most recent entry to get the previous 12 months but there is obviously a better way to do this.
Worth saying that the period column is a monthly period datetime type using to_period
thanks in advance
PERIOD,CLIENT_NAME,GROUP_PERFORMANCE
2020-03,client1,0.104
2020-04,client1,0.004
2020-05,client1,0.23
2020-06,client1,0.113
2020-03,client2,0.0023
2020-04,client2,0.03
2020-05,client2,0.15
2020-06,client2,0.143
lets say for example that I wanted to do a groupby to SUM the latest three months of data, my expected output of the above would be
client1,0.347
client2,0.323
also - I would like a way to return nan if the dataset is missing the minimum number of periods, as you can do with the rolling function.
Here is my answer.
I've used a DatetimeIndex because the method last does not work with period. First I sort values based on the PERIOD column, then I set it as Index to keep only the last 3 months (or whatever you provide), then I do the groupby the same way as you.
df['PERIOD'] = pd.to_datetime(df['PERIOD'])
(df.sort_values(by='PERIOD')
.set_index('PERIOD')
.last('3M')
.groupby('CLIENT_NAME')
.GROUP_PERFORMANCE
.sum())
# Result
CLIENT_NAME GROUP_PERFORMANCE
client1 0.347
client2 0.323

Pandas DataFrame shift columns by date to create lag values

I have a dataframe:
df = pd.DataFrame({'year':[2000,2000,2000,2001,2001,2002,2002,2002],'ID':['a','b','c','a','b','a','b','c'],'values':[1,2,3,4,5,7,8,9]})
I would like to create a column that has the lag value of each ID-year, for example, ID'a' in 2000 has a value of 1, so ID'a' in 2001 should have a pre-value of 1. The key point is that if an ID doesn't have an value in the previous year (so the year is not continuous for some ID), then the pre-value should be NaN, instead of having the value from two years ago. For example, ID'c' doesn't show up in 2001, then for 2002, ID'c' should have pre-value = NaN.
Ideally, the final output should look like the following:
I tried the df.groupby(['ID'])['values'].shift(1), but it gives the following:
The problem is that when ID'c' doesn't have a value one year ago, the value two years ago is used. I also tried multiindex shift, which gives me the same result.
df.set_index(['year','ID'], inplace = True)
df.groupby(level=1)['values'].shift(1)
The thing that works is the answer mentioned here. But since my dataframe is fairly large, the merge kills the kernel. So far, I haven't figured out a better way to do it. I hope I explained my problem clearly.
Suppose the year column is unique for each id, i.e, there are no duplicated years for each specific id, then you can shift the value firstly and then replace shifted values where the difference between the year at the current row and previous row is not equal to 1 with NaN:
import pandas as pd
import numpy as np
df['pre_value'] = df.groupby('ID')['values'].shift(1)
df['pre_value'] = df.pre_value.where(df.groupby('ID').year.diff() == 1, np.nan)
df
a reindex approach
def reindex_min_max(df):
mn = df.year.min()
mx = df.year.max() + 1
d = df.set_index('year').reindex(pd.RangeIndex(mn, mx, name='year'))
return pd.concat([d, d['values'].shift().rename('pre_value')], axis=1)
df.groupby('ID')[['year', 'values']].apply(reindex_min_max) \
.sort_index(level=[1, 0]).dropna(subset=['values']).reset_index()

Categories