count occurrences of combinations - python

I am trying to create a table of counts for every combination of two columns. This was the best I could come up with, but Im not sure how to get it in the correct form. Is there a method or something that would make this easier?
from itertools import product
d = {'ballot1': ['a','b','a','a','b','a','a','b'],
'ballot1_x':['c','c','d','d','a','a','a','a']}
df1=pd.DataFrame(d)
for i in product(set(df1['ballot1']), set(df1['ballot1_x'])):
print(str(i[0])+str(i[1]))
GOAL (as a dataframe):
a b
c 1 1
d 2 0
a 2 2

You can try using crosstab from pandas (detail documentation):
from pandas import crosstab
d = {'ballot1': ['a','b','a','a','b','a','a','b'],
'ballot1_x':['c','c','d','d','a','a','a','a']}
df1=pd.DataFrame(d)
result_df = crosstab(df1['ballot1_x'], df1['ballot1'])
print(result_df)
Result:
ballot1 a b
ballot1_x
a 2 2
c 1 1
d 2 0

Related

Split Pandas Dataframe Column According To a Value

I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]

Add column to DataFrame in a loop

Let's say I have a very simple pandas dataframe, containing a single indexed column with "initial values". I want to read in a loop N other dataframes to fill a single "comparison" column, with matching indices.
For instance, with my inital dataframe as
Initial
0 a
1 b
2 c
3 d
and the following two dataframes to read in a loop
Comparison
0 e
1 f
Comparison
2 g
3 h
4 i <= note that this index doesn't exist in Initial so won't be matched
I would like to produce the following result
Initial Comparison
0 a e
1 b f
2 c g
3 d h
Using merge, concat or join, I only ever seem to be able to create a new column for each iteration of the loop, filling the blanks with NaN.
What's the most pandas-pythonic way of achieving this?
Below an example from the proposed duplicate solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
print df1
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
print df2
df3 = pd.DataFrame(np.array([[2,'g'],[3,'h'],[4,'i']]), columns=['','Compare'])
df3 = df3.set_index('')
print df3
print df1.merge(df2,left_index=True,right_index=True).merge(df3,left_index=True,right_index=True)
>>
Initial
0 a
1 b
2 c
3 d
Compare
0 e
1 f
Compare
2 g
3 h
4 i
Empty DataFrame
Columns: [Initial, Compare_x, Compare_y]
Index: []
Second edit: #W-B, the following seems to work, but it can't be the case that there isn't a simpler option using proper pandas methods. It also requires turning off warnings, which might be dangerous...
pd.options.mode.chained_assignment = None
df1["Compare"]=pd.Series()
for ind in df1.index.values:
if ind in df2.index.values:
df1["Compare"][ind]=df2.T[ind]["Compare"]
if ind in df3.index.values:
df1["Compare"][ind]=df3.T[ind]["Compare"]
print df1
>>
Initial Compare
0 a e
1 b f
2 c g
3 d h
Ok , since Op need more info
Data input
import functools
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
df1['Compare']=np.nan
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
df3 = pd.DataFrame(np.array(['g','h','i']), columns=['Compare'],index=[2,3,4])
Solution
newdf=functools.reduce(lambda x,y: x.fillna(y),[df1,df2,df3])
newdf
Out[639]:
Initial Compare
0 a e
1 b f
2 c g
3 d h

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

reshape data frame in pandas with pivot table

I have a dataframe that looks like this:
baz qux
one A
one B
two C
three A
one B
one C
I'm trying to reshape it to look like this:
one two three
A C A
B
B
C
I'm pretty confused about whether this is possible, and if so, how you would do it. I've tried using the pivot_table method as pd.pivot_table(cols='baz', rows='qux') but that threw a TypeError. I think I'm being an idiot and missing something really basic here. Any ideas?
I'm not sure if it's the most optimal way of doing it but it does the job:
import io
import pandas as pd
data = u'baz,qux\none,A\none,B\ntwo,C\nthree,A\none,B\none,C'
df = pd.read_csv(io.StringIO(data))
new = pd.DataFrame()
for key, group in df.groupby('baz'):
new = pd.concat([new, pd.DataFrame(group.reset_index().qux, columns=[key])],
axis=1)
print new.replace(np.nan, '')
Which gives back:
one two three
0 A C A
1 B
2 B
3 C
With pivot table you can get a matrix showing which baz corresponds to which qux:
>>> df['foo'] = 1 # Add aggregation column
>>> df.pivot_table('foo', cols='baz', rows=['qux'])
one three two
A 1 1 NaN
B 1 NaN NaN
C 1 NaN 1
This is not quite what you asked for, but perhaps it suffices:
import numpy as np
import pandas as pd
df = pd.DataFrame({'baz':'one one two three one one'.split(),
'qux': list('ABCABC')})
grouped = df.groupby(['baz', 'qux'])
df2 = grouped.apply(pd.DataFrame.reset_index, drop=True)['qux'].unstack(level=0)
df2.reset_index(drop=True, inplace=True)
df2 = df2.reindex(columns='one two three'.split())
df2 = df2.replace(np.nan, '')
print(df2)
yields
one two three
0 A A
1 B
2 B
3 C C

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories