Slicing a DataFrame - python

Consider the DataFrame data:
one two three four
Ohio 2013-01-01 1 2 3
Colorado 2014-01-05 5 6 7
Utah 2015-05-06 9 10 11
New York 2016-10-11 13 14 15
I'd like to extract the row using only the criterion that the year is a given year, e.g., something like data['one'][:][0:4] == '2013'. But the command data['one'][:][0:4] returns
Ohio 2013-01-01
Colorado 2014-01-05
Utah 2015-05-06
New York 2016-10-11
Name: one, dtype: object
I thought this is the right thing to do because the command data['one'][0][0:4] returns
'2013'
Why the difference, and what's the correct way to do this?

Since column 'one' consists of dates, it'd be best to have pandas recognize it as such, instead of recognizing it as strings. You can use pd.to_datetime to do this:
df['one'] = pd.to_datetime(df['one'])
This allows you to filter on date properties without needing to worry about slicing strings. For example, you can check for year using Series.dt.year:
df['one'].dt.year == 2013
Combining this with loc allows you to get all rows where the year is 2013:
df.loc[df['one'].dt.year == 2013, :]

The condition you are looking for is
df['one'].str[0:4] == "2013"
Basically, you need to tell Pandas to read your column as a string, then operate on the strings from that column.
The way you have it written (df['one'][:]), says "give me the column called "one", then give me all of them [:].

query works out well too on datetime columns
In [13]: df.query('one == 2013')
Out[13]:
one two three four
Ohio 2013-01-01 1 2 3

Related

Convert GroupBy object to Dataframe (pandas)

I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df

Add a new column to a dataframe which is the result of a groupby count

I'm trying to get the total number of books that an author wrote and put it in a column called book number with my dataframe that has 15 other columns.
I checked online and people use groupby with count(), however it doesn't create the column that I want, it only gives a column of numbers without a name and I can't put it together with the original dataframe.
author_count_df = (df_author["Name"]).groupby(df_author["Name"]).count()
print(author_count_df)
Result:
Name
A D 3
A Gill 4
A GOO 3
ALL SHOT 10
AMIT PATEL 5
..
vishal raina 7
walt walter 6
waqas alhafidh 3
yogesh koshal 8
zainab m.jawad 9
Name: Name, Length: 696, dtype: int64
Expected: A dataframe with
Name other 14 columns from author_df Book Number
A D ... 3
A Gill ... 4
A GOO ... 3
ALL SHOT ... 10
AMIT PATEL ... 5
... ..
vishal raina ... 7
walt walter ... 6
waqas alhafidh ... 3
yogesh koshal ... 8
zainab m.jawad ... 9
Use transform with the groupby and assign it back:
df_author['Book Number']=df_author.groupby("Name")['Name'].transform('count')
For a new df, use:
author_count_df = df_author.assign(BookNum=df_author.groupby("Name")['Name']
.transform('count'))
Use reset_index()
author_count_df = (df_author["Name"]).groupby(df_author["Name"]).count().reset_index()
This basically tells the pandas groupby to reset back to the original index
You have done the good Job except you need to check how to populate or assign the values back into a new column which you have got, Which you can achieve with DataFrame.assign method which does the Job quite elegantly.
Straight from the Docs:
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Transform a Column conditional on another Column

I have a dataframe that looks like:
Age Age Type
12 Years
5 Days
13 Hours
20 Months
... ......
I want to have my Age column in Years...so depending on Age Type, if it is either in Days, Hours, or Months, I will have to perform a scalar operation. I tried to implement a for loop but not sure if I'm going about it the right way. Thanks!
Create a map dict
d={'Years':1,'Days':1/365,'Hours':1/364/24,'Months':1/12}
df.Age*df.AgeType.map(d)
Out[373]:
0 12.000000
1 0.013699
2 0.001488
3 1.666667
dtype: float64

Python Pandas: Change value associated with each first day entry in every month

I'd like to change the value associated with the first day in every month for a pandas.Series I have. For example, given something like this:
Date
1984-01-03 0.992701
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 1.009894
1984-02-02 0.996608
1984-02-03 0.996595
...
I'd like to change the values associated with 1984-01-03, 1984-02-01 and so on. I've racked my brain for hours on this one and have looked around Stack Overflow a fair bit. Some solutions have come close. For example, using:
[In]: series.groupby((m_ret.index.year, m_ret.index.month)).first()
[Out]:
Date Date
1984 1 0.992701
2 1.009894
3 1.005963
4 0.997899
5 1.000342
6 0.995429
7 0.994620
8 1.019377
9 0.993209
10 1.000992
11 1.009786
12 0.999069
1985 1 0.981220
2 1.011928
3 0.993042
4 1.015153
...
Is almost there, but I'm sturggling to proceed further.
What I'd ike to do is set the values associated with the first day present in each month for every year to 1.
series[m_ret.index.is_month_start] = 1 comes close, but the problem here is that is_month_start only selects rows where the day value is 1. However in my series, this isn't always the case as you can see. For example, the date of the first day in January is 1984-01-03.
series.groupby(pd.TimeGrouper('BM')).nth(0) doesn't appear to return the first day either, instead I get the last day:
Date
1984-01-31 0.992701
1984-02-29 1.009894
1984-03-30 1.005963
1984-04-30 0.997899
1984-05-31 1.000342
1984-06-29 0.995429
1984-07-31 0.994620
1984-08-31 1.019377
...
I'm completely stumped. Your help is as always, greatly appreciated! Thank you.
One way would to be to use your .groupby((m_ret.index.year, m_ret.index.month)) idea, but use idxmin instead on the index itself converted into a Series:
In [74]: s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
Out[74]:
Date Date
1984 1 1984-01-03
2 1984-02-01
Name: Date, dtype: datetime64[ns]
In [75]: start = s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
In [76]: s.loc[start] = 999
In [77]: s
Out[77]:
Date
1984-01-03 999.000000
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 999.000000
1984-02-02 0.996608
1984-02-03 0.996595
dtype: float64

How to get the closest single row after a specific datetime index using Python Pandas

DataFrame I have:
A B C
2012-01-01 1 2 3
2012-01-05 4 5 6
2012-01-10 7 8 9
2012-01-15 10 11 12
What I am using now:
date_after = dt.datetime( 2012, 1, 7 )
frame.ix[date_after:].ix[0:1]
Out[1]:
A B C
2012-01-10 7 8 9
Is there any better way of doing this? I do not like that I have to specify .ix[0:1] instead of .ix[0], but if I don't the output changes to a TimeSeries instead of a single row in a DataFrame. I find it harder to work with a rotated TimeSeries back on top of the original DataFrame.
Without .ix[0:1]:
frame.ix[date_after:].ix[0]
Out[1]:
A 7
B 8
C 9
Name: 2012-01-10 00:00:00
Thanks,
John
Couldn't resist answering this, even though the question was asked, and answered, in 2012, by Wes himself, and again in 2015, by ajsp. Yes, besides 'truncate', you can also use get_loc with the option 'backfill' to get the nearest date after the specific date. By the way, if you want to nearest date before the specific date, use 'ffill'. If you just want nearby, use 'nearest'.
df.iloc[df.index.get_loc(datetime.datetime(2016,2,2),method='backfill')]
You might want to go directly do the index:
i = frame.index.searchsorted(date)
frame.ix[frame.index[i]]
A touch verbose but you could put it in a function. About as good as you'll get (O(log n))
Couldn't resist answering this, even though the question was asked, and answered, in 2012, by Wes himself. Yes, just use truncate.
df.truncate(before='2012-01-07')

Categories