Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})
Related
I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5
What does pd.DataFrame does on a dataframe? Please see the code below.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
In [3]: b = pd.DataFrame(a)
In [4]: a['c'] = [7,8,9]
In [5]: a
Out[5]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [6]: b
Out[6]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [7]: a.drop(columns='c', inplace=True)
In [8]: a
Out[8]:
a b
0 1 4
1 2 5
2 3 6
In [9]: b
Out[9]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In In[3], the function pd.DataFrame is applied on a dataframe a. It turns out that the ids of a and b are different. However, when a column is added to a, the same column is added to b, but when we drop a column from a, the column is not dropped from b. So what does pd.DataFrame does? Are a and b the same object or different? What should we do to a so that we drop the column from b? Or, how do we prevent a column from being added to b when we add a column to a?
I would avoid your statements at all cost. Better would be to make a dataframe as such:
df=pd.DataFrame({'a': [0,1,2], 'b': [3,4,5], 'c':[6,7,8]})
The above result is a dataframe, with indices and column names.
You can add a column to df, like this:
df['d'] = [8,9,10]
And remove a column to the dataframe, like this:
df.drop(columns='c',inplace=True)
I would not create a dataframe from a function definition, but use 'append' instead. Append works for dictionaries and dataframes. An example for a dictionary based append:
df = pd.DataFrame(columns=['Col1','Col2','Col3','Col4']) # create empty df with column names.
append_dict = {'Col1':value_1, 'Col2':value_2, 'Col3':value_3,'Col4':value_4}
df = df.append(append_dict,ignore_index=True).
The values can be changed in a loop, so it does something with respect to the previous values. For dataframe append, you can check the pandas documentation (just replace the append_dict argument with the dataframe that you like to append)
Is this what you want?
I hope the title speaks for itself; I'd just like to add that it can be assumed that each key has the same amount of values.
Online searching the title yielded the following solution:
Split pandas dataframe based on groupby
Which supposed to be solving my problem, although it does not.
I'll give an example:
Input:
pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
Output:
pd.DataFrame(data={'a':['foo','bar'],'b':[1,4],'c':[2,5],'d':[3,6]})
Intuitively, it would be a groupby function without an aggregation function, or an aggregation function that makes a list out of the keys.
Obviously, it can be done 'manually' using for loops etc., but using for loops with large data sets is very expensive computationally.
Use GroupBy.cumcount for Series or column g, then reshape by DataFrame.set_index + Series.unstack or DataFrame.pivot, last data cleaning by DataFrame.add_prefix, DataFrame.rename_axis with
DataFrame.reset_index:
g = df1.groupby('a').cumcount()
df = (df1.set_index(['a', g])['b']
.unstack()
.add_prefix('new_')
.reset_index()
.rename_axis(None, axis=1))
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Or:
df1['g'] = df1.groupby('a').cumcount()
df = df1.pivot('a','g','b').add_prefix('new_').reset_index().rename_axis(None, axis=1)
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Here is an alternative approach, using groupby.apply and string.ascii_lowercase if column names are important:
from string import ascii_lowercase
df = pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
# Groupby 'a'
g = df.groupby('a')['b'].apply(list)
# Construct new DataFrame from g
new_df = pd.DataFrame(g.values.tolist(), index=g.index).reset_index()
# Fix column names
new_df.columns = [x for x in ascii_lowercase[:new_df.shape[1]]]
print(new_df)
a b c d
0 bar 4 5 6
1 foo 1 2 3
I just need one column of my dateframe, but in the original order. When I take it off, it is sorted by the values, and I can't understand why. I tried different ways to pick out one column but all the time it was sorted by the values.
this is my code:
import pandas
data = pandas.read_csv('/data.csv', sep=';')
longti = data.iloc[:,4]
To return the first Column your function should work.
import pandas as pd
df = pd.DataFrame(dict(A=[1,2,3,4,5,6], B=['A','B','C','D','E','F']))
df = df.iloc[:,0]
Out:
0 1
1 2
2 3
3 4
4 5
5 6
If you want to return the second Column you can use the following:
df = df.iloc[:,1]
Out:
0 A
1 B
2 C
3 D
4 E
5 F
I have a very simple dataframe like so:
In [8]: df
Out[8]:
A B C
0 2 a a
1 3 s 3
2 4 c !
3 1 f 1
My goal is to extract the first row in such a way that looks like this:
A B C
0 2 a a
As you can see the dataframe shape (1x3) is preserved and the first row still has 3 columns.
However when I type the following command df.loc[0] the output result is this:
df.loc[0]
Out[9]:
A 2
B a
C a
Name: 0, dtype: object
As you can see the row has turned into a column with 3 rows! (3x1 instead of 3x1). How is this possible? how can I simply extract the row and preserve its shape as described in my goal? Could you provide a smart and elegant way to do it?
I tried to use the transpose command .T but without success... I know I could create another dataframe where the columns are extracted by the original dataframe but this way quite tedious and not elegant I would say (pd.DataFrame({'A':[2], 'B':'a', 'C':'a'})).
Here is the dataframe if you need it:
import pandas as pd
df = pd.DataFrame({'A':[2,3,4,1], 'B':['a','s','c','f'], 'C':['a', 3, '!', 1]})
You need add [] for DataFrame:
#select by index value
print (df.loc[[0]])
A B C
0 2 a a
Or:
print (df.iloc[[0]])
A B C
0 2 a a
If need transpose Series, first need convert it to DataFrame by to_frame:
print (df.loc[0].to_frame())
0
A 2
B a
C a
print (df.loc[0].to_frame().T)
A B C
0 2 a a
Use a range selector will preserve the Dataframe format.
df.iloc[0:1]
Out[221]:
A B C
0 2 a a