How to take one part of a Series in Pandas? - python

How do I only take the values on the right? Ex. only have an array/list of [120, 108, 82...]
d = daily_counts.loc[daily_counts['workingday'] == "yes", 'casual']
d

You can simply use tolist() or values or to_numpy method. Here is a toy example:
>>> df = pd.DataFrame({'a':[1,2,3,4,5,7,1,4]})
>>>
a
0 1
1 2
2 3
3 4
4 5
5 7
6 1
7 4
>>> df['a'].value_counts() #generates similar output as you
>>>
4 2
1 2
7 1
5 1
3 1
2 1
Name: a, dtype: int64
>>> df['a'].value_counts().tolist() #extracting as a list
[2, 2, 1, 1, 1, 1]
>>> df['a'].value_counts().values #extracting as a numpy array
array([2, 2, 1, 1, 1, 1])
>>> df['a'].value_counts().to_numpy() #extracting as a numpy array
array([2, 2, 1, 1, 1, 1])

Related

Dataframe Concatenation with Pandas

i'm quite new to pandas and i'm stuck with this dataframe concatenation.
Let's say i've 2 dataframes:
df_1=pd.DataFrame({
"A":[1, 1, 2, 2, 3, 4, 4],
"B":[1, 2, 1, 2, 1, 1, 3],
"C":['a','b','c','d','e','f','g']
})
and
df_2=pd.DataFrame({
"A":[1, 3, 4],
"D":[1,'m',7]
})
I would like to concatenate/merge the 2 dataframes on the same values of ['A'] so that the resulting dataframe is:
df_3=pd.DataFrame({
"A":[1, 1, 3, 4, 4],
"B":[1, 2, 1, 1, 3],
"C":['a','b','e','f','g'],
"D":[1, 1, 'm', 7, 7]
})
How can i do that?
Thanks in advance
Just do an inner merge:
df_1.merge(df_2, how="inner", on="A")
outputs
A B C D
0 1 1 a 1
1 1 2 b 1
2 3 1 e m
3 4 1 f 7
4 4 3 g 7
You can also do a left merge, and then dropna
df_3 = df_1.merge(df_2, on=['A'], how='left').dropna(axis=0)
Output:
A B C D
0 1 1 a 1
1 1 2 b 1
4 3 1 e m
5 4 1 f 7
6 4 3 g 7

dataframe new column based on groupby operations

import pandas
import numpy
df = pandas.DataFrame({'id_1' : [1,2,1,1,1,1,1,2,2,2,2],
'id_2' : [1,1,1,1,1,2,2,2,2,2,2],
'v_1' : [2,1,1,3,2,1,2,4,1,1,2],
'v_2' : [1,1,1,1,2,2,2,1,1,2,2],
'v_3' : [3,3,3,3,4,4,4,3,3,3,3]})
In [4]: df
Out[4]:
id_1 id_2 v_1 v_2 v_3
0 1 1 2 1 3
1 2 1 1 1 3
2 1 1 1 1 3
3 1 1 3 1 3
4 1 1 2 2 4
5 1 2 1 2 4
6 1 2 2 2 4
7 2 2 4 1 3
8 2 2 1 1 3
9 2 2 1 2 3
10 2 2 2 2 3
sub = df[(df['id_1'] == 1) & (df['id_2'] == 1)].copy()
sub['v_4'] = numpy.where(sub['v_1'] == sub['v_2'].shift(), 'A', \
numpy.where(sub['v_1'] == sub['v_3'].shift(), 'B', 'C'))
In [6]: sub
Out[6]:
id_1 id_2 v_1 v_2 v_3 v_4
0 1 1 2 1 3 C
2 1 1 1 1 3 A
3 1 1 3 1 3 B
4 1 1 2 2 4 C
I have a dataframe as defined above. I would like to perform some operation, basically categorize whether v_1 equals the previous v_2 or v_3 for each group of (id_1, id_2)
I have done the the operation which performs on a sub df. And I would like to have a one line code to combine the following groupby together with the operation I have on the sub df together.
gbdf = df.groupby(by=['id_1', 'id_2'])
I have tried something like
gbdf['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
and the error was
'DataFrameGroupBy' object does not support item assignment
I also tried
df['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
which I believe the result was wrong, it does not align the groupby result with the original ordering.
I am wondering whether there is an elegant way to achieve this.
This gets you a list of dataframes that match the content of the dataframe sub, but for all results of the .groupby():
import numpy
import pandas
source = pandas.DataFrame(
{'id_1': [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'id_2': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'v_1': [2, 1, 1, 3, 2, 1, 2, 4, 1, 1, 2],
'v_2': [1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2],
'v_3': [3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3]})
def add_v4(df):
df['v_4'] = numpy.where(df['v_1'] == df['v_2'].shift(), 'A', numpy.where(df['v_1'] == df['v_3'].shift(), 'B', 'C'))
return df
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
print(dfs)
About this line:
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
It's a list comprehension that gets all the slices from the groupby and turns them into actual new dataframes before passing them to add_v4, which returns the modified dataframe to be added to the list.

Create new column on grouped data frame

I want to create new column that is calculated by groups using multiple columns from current data frame. Basically something like this in R (tidyverse):
require(tidyverse)
data <- data_frame(
a = c(1, 2, 1, 2, 3, 1, 2),
b = c(1, 1, 1, 1, 1, 1, 1),
c = c(1, 0, 1, 1, 0, 0, 1),
)
data %>%
group_by(a) %>%
mutate(d = cumsum(b) * c)
In pandas I think I should use groupby and apply to create new column and then assign it to the original data frame. This is what I've tried so far:
import numpy as np
import pandas as pd
def create_new_column(data):
return np.cumsum(data['b']) * data['c']
data = pd.DataFrame({
'a': [1, 2, 1, 2, 3, 1, 2],
'b': [1, 1, 1, 1, 1, 1, 1],
'c': [1, 0, 1, 1, 0, 0, 1],
})
# assign - throws error
data['d'] = data.groupby('a').apply(create_new_column)
# assign without index - incorrect order in output
data['d'] = data.groupby('a').apply(create_new_column).values
# assign to sorted data frame
data_sorted = data.sort_values('a')
data_sorted['d'] = data_sorted.groupby('a').apply(create_new_column).values
What is preferred way (ideally without sorting the data) to achieve this?
Add parameter group_keys=False for avoid MultiIndex, so possible assign back to new column:
data['d'] = data.groupby('a', group_keys=False).apply(create_new_column)
Alternative is remove first level:
data['d'] = data.groupby('a').apply(create_new_column).reset_index(level=0, drop=True)
print (data)
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
Detail:
print (data.groupby('a').apply(create_new_column))
a
1 0 1
2 2
5 0
2 1 0
3 2
6 3
3 4 0
dtype: int64
print (data.groupby('a', group_keys=False).apply(create_new_column))
0 1
2 2
5 0
1 0
3 2
6 3
4 0
dtype: int64
Now you can also implement it in python with datar in the way exactly you did in R:
>>> from datar.all import c, f, tibble, cumsum
>>>
>>> data = tibble(
... a = c(1, 2, 1, 2, 3, 1, 2),
... b = c(1, 1, 1, 1, 1, 1, 1),
... c = c(1, 0, 1, 1, 0, 0, 1),
... )
>>>
>>> (data >>
... group_by(f.a) >>
... mutate(d=cumsum(f.b) * f.c))
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
[Groups: ['a'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

Pandas: group columns of duplicate rows into column of lists

I have a Pandas dataframe that looks something like this:
>>> df
m event
0 3 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 0
6 3 1
7 2 2
8 3 2
9 3 1
I want to group the values of the event column into lists based on the m column so that I would get this:
>>> df
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]
There should be one row per unique value of m with a corresponding list of all events that belongs to m.
I tried this:
>>> list(df.groupby('m').event)
[(3, m_id
0 1
6 1
8 2
9 1
Name: event, dtype: int64), (1, m_id
1 1
2 2
3 2
Name: event, dtype: int64), (2, m_id
4 1
5 0
7 2
Name: event, dtype: int64)]
It sort of does what I want in that it groups the events after m. I could massage this back into the dataframe that I wanted with some loops, but I feel that I have started on an ugly an unnecessarily complex path. And slow, if there are thousands of unique values for m.
Can I perform the conversion I wanted in an elegant manner using Pandas methods?
Bonus if the events column can contain (numpy) arrays so that I can do math directly on the events rows, like df[df.m==1].events + 100, but regular lists are also ok.
In [320]: r = df.groupby('m')['event'].apply(np.array).reset_index(name='event')
In [321]: r
Out[321]:
m event
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
Bonus:
In [322]: r.loc[r.m==1, 'event'] + 1
Out[322]:
0 [2, 3, 3]
Name: event, dtype: object
You could
In [1163]: df.groupby('m')['event'].apply(list).reset_index(name='events')
Out[1163]:
m events
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
If you don't want sorted m
In [1164]: df.groupby('m', sort=False).event.apply(list).reset_index(name='events')
Out[1164]:
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]

Change pivot table from Series to DataFrame

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'B': ['X', 'Y', 'Z'] * 3,
'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
>>> df
A B C
0 1 X 1
1 1 Y 2
2 1 Z 3
3 2 X 1
4 2 Y 2
5 2 Z 3
6 3 X 1
7 3 Y 2
8 3 Z 3
result = df.pivot_table(index=['B'], values='C', aggfunc=sum)
>>> result
B
X 3
Y 6
Z 9
Name: C, dtype: int64
How can I have the column name for C show up a above the sums, and how can I sort result either ascending or descending. Result is a series not a dataframe and seems non-sortable?
Python: 2.7.11 and Pandas: 0.17.1
You were very close. Note that the brackets around the values coerces the result into a dataframe instead of a series (i.e. values=['C'] instead of values='C').
result = df.pivot_table(index = ['B'], values=['C'], aggfunc=sum)
>>> result
C
B
X 3
Y 6
Z 9
Asresult is now a dataframe, you can use sort_values on it:
>>> result.sort_values('C', ascending=False)
C
B
Z 9
Y 6
X 3

Categories