python split pd dataframe by column - python

Is there a function that splits a pandas.dataframe object into multiple sub-dataframes, by a specific column value? For example, if I have
A 1
B 2
A 3
B 4
I want the result as follow:
A 1
A 3
and
B 2
B 4
In R, it is the split function. How is it being done in python? I know I can use subset within a forloop. But is there a function does that? Thanks.

You can use groupby() with list-comprehension to extract a list of sub data frames where each of them contains only a single ind value:
import pandas as pd
from StringIO import StringIO
df = pd.read_csv(StringIO("""A 1
B 2
A 3
B 4"""), sep = "\s+", names=['ind', 'value'])
lst = [g for _, g in df.groupby('ind')]
lst[0]
# ind value
#0 A 1
#2 A 3
lst[1]
# ind value
#1 B 2
#3 B 4

Related

Pandas: Get top n columns based on a row values

Having a dataframe with a single row, I need to filter it into a smaller one with filtered columns based on a value in a row.
What's the most effective way?
df = pd.DataFrame({'a':[1], 'b':[10], 'c':[3], 'd':[5]})
a
b
c
d
1
10
3
5
For example top-3 features:
b
c
d
10
3
5
Use sorting per row and select first 3 values:
df1 = df.sort_values(0, axis=1, ascending=False).iloc[:, :3]
print (df1)
b d c
0 10 5 3
Solution with Series.nlargest:
df1 = df.iloc[0].nlargest(3).to_frame().T
print (df1)
b d c
0 10 5 3
You can transpose T, and use nlargest():
new = df.T.nlargest(columns = 0, n = 3).T
print(new)
b d c
0 10 5 3
You can use np.argsort to get the solution. This Numpy method, in the below code, gives the indices of the column values in descending order. Then slicing selects the largest n values' indices.
import pandas as pd
import numpy as np
# Your dataframe
df = pd.DataFrame({'a':[1], 'b':[10], 'c':[3], 'd':[5]})
# Pick the number n to find n largest values
nlargest = 3
# Get the order of the largest value columns by their indices
order = np.argsort(-df.values, axis=1)[:, :nlargest]
# Find the columns with the largest values
top_features = df.columns[order].tolist()[0]
# Filter the dateframe by the columns
top_features_df = df[top_features]
top_features_df
output:
b d c
0 10 5 3

How to get a transition string per row object based on two different columns in python (without using loops)?

I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12

Split Pandas Dataframe Column According To a Value

I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]

Add column to DataFrame in a loop

Let's say I have a very simple pandas dataframe, containing a single indexed column with "initial values". I want to read in a loop N other dataframes to fill a single "comparison" column, with matching indices.
For instance, with my inital dataframe as
Initial
0 a
1 b
2 c
3 d
and the following two dataframes to read in a loop
Comparison
0 e
1 f
Comparison
2 g
3 h
4 i <= note that this index doesn't exist in Initial so won't be matched
I would like to produce the following result
Initial Comparison
0 a e
1 b f
2 c g
3 d h
Using merge, concat or join, I only ever seem to be able to create a new column for each iteration of the loop, filling the blanks with NaN.
What's the most pandas-pythonic way of achieving this?
Below an example from the proposed duplicate solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
print df1
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
print df2
df3 = pd.DataFrame(np.array([[2,'g'],[3,'h'],[4,'i']]), columns=['','Compare'])
df3 = df3.set_index('')
print df3
print df1.merge(df2,left_index=True,right_index=True).merge(df3,left_index=True,right_index=True)
>>
Initial
0 a
1 b
2 c
3 d
Compare
0 e
1 f
Compare
2 g
3 h
4 i
Empty DataFrame
Columns: [Initial, Compare_x, Compare_y]
Index: []
Second edit: #W-B, the following seems to work, but it can't be the case that there isn't a simpler option using proper pandas methods. It also requires turning off warnings, which might be dangerous...
pd.options.mode.chained_assignment = None
df1["Compare"]=pd.Series()
for ind in df1.index.values:
if ind in df2.index.values:
df1["Compare"][ind]=df2.T[ind]["Compare"]
if ind in df3.index.values:
df1["Compare"][ind]=df3.T[ind]["Compare"]
print df1
>>
Initial Compare
0 a e
1 b f
2 c g
3 d h
Ok , since Op need more info
Data input
import functools
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
df1['Compare']=np.nan
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
df3 = pd.DataFrame(np.array(['g','h','i']), columns=['Compare'],index=[2,3,4])
Solution
newdf=functools.reduce(lambda x,y: x.fillna(y),[df1,df2,df3])
newdf
Out[639]:
Initial Compare
0 a e
1 b f
2 c g
3 d h

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

Categories