Pandas groupby object back to a data frame [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
So I'm trying to split a pandas Dataframe into two separate data frames by a single binary variable. Accordingly, the groupby function seems a decent option except it doesn't return a data frame but rather a groupby object which isn't nearly as useful to me. Moreover, I can't access any values from within the groupby object. I ran a simple df.groupby('Type') statement and would like to partition the data from here meaning output those two groups to two new data frames. Any help would be sincerely appreciated. The last question I posted was met with ridiculously childish admonitions not to post homework questions. Needless to say, this as well as the aforementioned were/are NOT homework so please spare me of such remarks. As always thanks so much.

If you use groupby, you can iterate through the groups as follows:
g = df.groupby('class')
for k, v in g.groups.iteritems():
print k # a
print df.iloc[v] # df_a, the dict values are position indices for the group
print
a
class data1 data2
0 a -0.173070 141.437719
2 a -0.087673 200.815709
6 a 1.220608 159.456053
8 a 0.428373 -6.491034
9 a -0.123463 -96.898025
c
class data1 data2
5 c -0.358996 162.715982
7 c -1.339496 23.043417
b
class data1 data2
1 b -1.761652 -12.405066
3 b 1.366879 22.988654
4 b 1.125314 60.489373
Note: iterating over a set/dict is not guaranteed to be in order.

How's this?
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'class': np.random.choice(list('abc'), size=10),
'data1': np.random.randn(10),
'data2': np.random.randn(10) * 100})
df_a = df[df['class']=='a']
df_b = df[df['class']=='b']
df_c = df[df['class']=='c']
print df, '\n'
print df_a
print df_b
print df_c
Gives:
class data1 data2
0 a -0.173070 141.437719
1 b -1.761652 -12.405066
2 a -0.087673 200.815709
3 b 1.366879 22.988654
4 b 1.125314 60.489373
5 c -0.358996 162.715982
6 a 1.220608 159.456053
7 c -1.339496 23.043417
8 a 0.428373 -6.491034
9 a -0.123463 -96.898025
class data1 data2
0 a -0.173070 141.437719
2 a -0.087673 200.815709
6 a 1.220608 159.456053
8 a 0.428373 -6.491034
9 a -0.123463 -96.898025
class data1 data2
1 b -1.761652 -12.405066
3 b 1.366879 22.988654
4 b 1.125314 60.489373
class data1 data2
5 c -0.358996 162.715982
7 c -1.339496 23.043417

Related

Concat two rows of a CSV file [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to clean data in a dataframe in python, where I am to concatenate Rows in which data in two columns(name, phone_no) are similar i.e.
I have
What I have
Trying to get
Expected Result
P.S It would be much better if you could provide a sample of the dataset instead of the images. Next time you can use df.to_clipboard and paste it as a code snippet in the question for reproducibility.
Now to the answer. You can use pandas groupby and then a custom aggregation.
First I created a dataset for the example:
df = pd.DataFrame({"A": ["a", "b", "a", "b", "c"], "B": list(map(str, range(5))), "C": list(map(str, range(5, 10)))})
Looks as follows
A B C
0 a 0 5
1 b 1 6
2 a 2 7
3 b 3 8
4 c 4 9
Then you can contact rows with similar keys (in your case the keys are name and phone_no
gdf = df.groupby("A").agg({
"B": ",".join,
"C": ",".join
})
print(gdf)
And the results are as follows:
A B C
0 a 0,2 5,7
1 b 1,3 6,8
2 c 4 9

Change multiple column names in pandas dataframe (not all colmn names) at the same time using index numbers

I have successfully changed a single column name in the dataframe using this:
df.columns=['new_name' if x=='old_name' else x for x in df.columns]
However i have lots of columns to update (but not all 240 of them) and I don't want to have to write it out for each single change if i can help it.
I have tried to follow the advice from #StefanK in this thread:
Changing multiple column names but not all of them - Pandas Python
my code:
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
but i am getting an error message:
File "<ipython-input-17-2808488b712d>", line 3
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
^
SyntaxError: can't assign to literal
So having googled the error and read many more S.O. questions here it looks to me like it is trying to read the numbers as integers instead of an index? I'm not certain here though.
So how do i fix it so it looks at the numbers as the index?! The column names I am replacing are at least 10 words long each so I'm keen not to have to type them all out! my only ideas are to use iloc somehow but i'm going into new territory here!
really appreciate some help please
Remove the '=' after df.columns in your code and use this instead:
df.columns.values[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
Because index does not support mutable operations convert it to numpy array, reassign and set back:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
arr = df.columns.to_numpy()
arr[[0,2,3]] = list('RTG')
df.columns = arr
print (df)
R B T G E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
So with your data use:
idx = [4,18,181,182,187,188,189,190,203,204]
names = ['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
arr = df.columns.to_numpy()
arr[idx] = names
df.columns = arr

Split Pandas Dataframe Column According To a Value

I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]

Re-shaping pandas data frame using shape or pivot_table (stack each row)

I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories