Reformatting a dataframe without using for loops - python

I want to convert a dataframe like:
id event_type count
1 "a" 3
1 "b" 5
2 "a" 1
3 "b" 2
into a dataframe like:
id a b a > b
1 3 5 0
2 1 0 1
3 0 2 0
Without using for-loops. What's a proper pythonic (Pandas-tonic?) way of doing this?

Well, not sure if this is exactly what you need or if it has to be more flexible than this. However, this would be one way to do it - assuming missing values can be replaced by 0.
import pandas as pd
from io import StringIO
# Creating and reading the data
data = """
id event_type count
1 "a" 3
1 "b" 5
2 "a" 1
3 "b" 2
"""
df = pd.read_csv(StringIO(data), sep='\s+')
# Transforming
df_ = pd.pivot_table(df, index='id', values='count', columns='event_type') \
.fillna(0).astype(int)
df_['a > b'] = (df_['a'] > df_['b']).astype(int)
Where df_ will take the form:
event_type a b a > b
id
1 3 5 0
2 1 0 1
3 0 2 0

This can be split up into two parts.
pivot see post
assign new column
Solution
df.set_index(
[‘id’, ‘event_type’]
)[‘count’].unstack(
fill_value=0
).assign(**{
‘a < b’: lambda d: d.eval(‘a < b’)
})

Related

How to substitute with 0, the first of 2 consecutive values in a pandas column with groupby

I have the following pandas dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1]})
I would like to create a new column (col_a_new) which will be the same as col_a but substitute with 0 the 1st out of the 2 consecutive 1s in col_a, by id
The resulting dataframe looks like this:
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1],
'col_a_new': [0,0,1,0,1,0,1,0,0,1]})
Any ideas ?
Other approach: Just group by id and define new values using appropriate conditions.
(foo.groupby("id").col_a
.transform(lambda series: [0 if i < len(series) - 1
and series.tolist()[i+1] == 1
else x for i, x in enumerate(series.tolist())]))
# group by id and non-consecutive clusters of 0/1 in col_a
group = foo.groupby(["id", foo["col_a"].ne(foo["col_a"].shift()).cumsum()])
# get cumcount and count of groups
foo_cumcount = group.cumcount()
foo_count = group.col_a.transform(len)
# set to zero all first ones of groups with two ones, otherwise use original value
foo["col_a_new"] = np.where(foo_cumcount.eq(0)
& foo_count.gt(1)
& foo.col_a.eq(1),
0, foo.col_a)
# result
id col_a col_a_new
0 1 0 0
1 1 1 0
2 1 1 1
3 1 0 0
4 1 1 1
5 2 1 0
6 2 1 1
7 2 0 0
8 2 1 0
9 2 1 1

Creating a pandas column conditional to another columns values

I'm trying to create a class column in a pandas dataframe conditional another columns values. The value will be 1 if the other column's i+1 value is greater than the i value and 0 otherwise.
For example:
column1 column2
5 1
6 0
3 0
2 1
4 0
How do create column2 by iterating through column1?
You can use the diff method on the first column with a period of -1, then check if it is less than zero to create the second column.
import pandas as pd
df = pd.DataFrame({'c1': [5,6,3,2,4]})
df['c2'] = (df.c1.diff(-1) < 0).astype(int)
df
# returns:
c1 c2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0
You can also use shift. Performance is almost the same as diff but diff seems to be faster by a a little.
df = pd.DataFrame({'column1': [5,6,3,2,4]})
df['column2'] = (df['column1'] <df['column1'].shift(-1)).astype(int)
print(df)
column1 column2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0

Conditional statement and split in a Dataframe

I am looking for a conditional statement in python to look for a certain information in a specified column and put the results in a new column
Here is an example of my dataset:
OBJECTID CODE_LITH
1 M4,BO
2 M4,BO
3 M4,BO
4 M1,HP-M7,HP-M1
and what I want as results:
OBJECTID CODE_LITH M4 M1
1 M4,BO 1 0
2 M4,BO 1 0
3 M4,BO 1 0
4 M1,HP-M7,HP-M1 0 1
What I have done so far:
import pandas as pd
import numpy as np
lookup = ['M4']
df.loc[df['CODE_LITH'].str.isin(lookup),'M4'] = 1
df.loc[~df['CODE_LITH'].str.isin(lookup),'M4'] = 0
Since there is multiple variables per rows in "CODE_LITH" it seems like the script in not able to find only "M4" it can find "M4,BO" and put 1 or 0 in the new column
I have also tried:
if ('M4') in df['CODE_LITH']:
df['M4'] = 0
else:
df['M4'] = 1
With the same results.
Thanks for your help.
PS. The dataframe contains about 2.6 millions rows and I need to do this operation for 30-50 variables.
I think this is the Pythonic way to do it:
for mn in ['M1', 'M4']: # Add other "M#" as needed
df[mn] = df['CODE_LITH'].map(lambda x: mn in x)
Use str.contains accessor:
>>>> for key in ('M4', 'M1'):
... df.loc[:, key] = df['CODE_LITH'].str.contains(key).astype(int)
>>> df
OBJECTID CODE_LITH M4 M1
0 1 M4,BO 1 0
1 2 M4,BO 1 0
2 3 M4,BO 1 0
3 4 M1,HP-M7,HP-M1 0 1
I was able to do:
for index,data in enumerate(df['CODE_LITH']):
if "I1" in data:
df['Plut_Felsic'][index] = 1
else:
df['Plut_Felsic'][index] = 0
It does work, but takes quite some time to calculate.

Transform pandas timeseries into timeseries with non-date index

I'm trying to generate a timeseries from a dataframe, but the solutions I've found here don't really address my specific problem. I have a dataframe which is a series of id's which iterate from 1 to n, then repeat, like this:
key ID Var_1
0 1 1
0 2 1
0 3 2
1 1 3
1 2 2
1 3 1
I want to reshape it into a timeseries in which the index
ID Var_1_0 Var_2_0
1 1 3
2 1 2
3 2 1
I have tried the stack() method but it doesn't generate the result I want. Generating an index from ID seems to be the right ID is not a proper date so I'm not sure how to proceed. Pointers much appreciated.
Try this:
import pandas as pd
df = pd.DataFrame([[0,1,1], [0,2,1], [0,3,2], [1,1,3], [1,2,2], [1,3,1]], columns=('key', 'ID', 'Var_1'))
Use the pivot function:
df2 = df.pivot('ID', 'key', 'Var_1')
You can rename the columns by:
df2.columns = ('Var_1_0', 'Var_2_0')
Result:
Out:
Var_1_0 Var_2_0
ID
1 1 3
2 1 2
3 2 1

Pandas: conditional rolling count

I have a Series that looks the following:
col
0 B
1 B
2 A
3 A
4 A
5 B
It's a time series, therefore the index is ordered by time.
For each row, I'd like to count how many times the value has appeared consecutively, i.e.:
Output:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.
Related:
Counting consecutive events on pandas dataframe by their index
Finding consecutive segments in a pandas data frame
I think there is a nice way to combine the solution of #chrisb and #CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).
df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
One-liner:
df['count'] = df.groupby('col').cumcount()
or
df['count'] = df.groupby('col').cumcount() + 1
if you want the counts to begin at 1.
Based on the second answer you linked, assuming s is your series.
df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))
In [88]: df
Out[88]:
col block count
0 B 1 1
1 B 1 2
2 A 2 1
3 A 2 2
4 A 2 3
5 B 3 1
I like the answer by #chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....
1) Create a function that uses static variables
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
2) apply it to your Series after converting to dataframe
df = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe
output of df
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
If you wish to do the same thing but filter on two columns, you can use this.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
col_a col_b count
0 1 B 1
1 1 B 2
2 1 A 1
3 2 A 1
4 2 A 2
5 2 B 1

Categories