What does pd.DataFrame does on a dataframe? Please see the code below.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
In [3]: b = pd.DataFrame(a)
In [4]: a['c'] = [7,8,9]
In [5]: a
Out[5]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [6]: b
Out[6]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [7]: a.drop(columns='c', inplace=True)
In [8]: a
Out[8]:
a b
0 1 4
1 2 5
2 3 6
In [9]: b
Out[9]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In In[3], the function pd.DataFrame is applied on a dataframe a. It turns out that the ids of a and b are different. However, when a column is added to a, the same column is added to b, but when we drop a column from a, the column is not dropped from b. So what does pd.DataFrame does? Are a and b the same object or different? What should we do to a so that we drop the column from b? Or, how do we prevent a column from being added to b when we add a column to a?
I would avoid your statements at all cost. Better would be to make a dataframe as such:
df=pd.DataFrame({'a': [0,1,2], 'b': [3,4,5], 'c':[6,7,8]})
The above result is a dataframe, with indices and column names.
You can add a column to df, like this:
df['d'] = [8,9,10]
And remove a column to the dataframe, like this:
df.drop(columns='c',inplace=True)
I would not create a dataframe from a function definition, but use 'append' instead. Append works for dictionaries and dataframes. An example for a dictionary based append:
df = pd.DataFrame(columns=['Col1','Col2','Col3','Col4']) # create empty df with column names.
append_dict = {'Col1':value_1, 'Col2':value_2, 'Col3':value_3,'Col4':value_4}
df = df.append(append_dict,ignore_index=True).
The values can be changed in a loop, so it does something with respect to the previous values. For dataframe append, you can check the pandas documentation (just replace the append_dict argument with the dataframe that you like to append)
Is this what you want?
Related
I have a Dataframe with multiindex that looks like this:
a 1
2
3
b 2
3
So The outer level has values a, b and the inner value is 1, 2, 3 for a and 2, 3 for b
I want to make sure that the indexes on the inner level are the same for all indexes on the outer level (in that case, create a new row for b with inner index 1). The values on the columns would be all Nulls for these new rows.
Is there an easy way to do it?
You can re-index with a MultiIndex made from your original dataframe indices:
df.reindex(pd.MultiIndex.from_product(df.index.levels))
Example:
idx = pd.MultiIndex.from_arrays([['a','a','a','b','b'],[1,2,3,2,3]])
df = pd.DataFrame(np.random.random(5), index=idx)
>>> df
0
a 1 0.354691
2 0.322138
3 0.195380
b 2 0.731177
3 0.912628
>>> df.reindex(pd.MultiIndex.from_product(df.index.levels))
0
a 1 0.354691
2 0.322138
3 0.195380
b 1 NaN
2 0.731177
3 0.912628
I have a very simple dataframe like so:
In [8]: df
Out[8]:
A B C
0 2 a a
1 3 s 3
2 4 c !
3 1 f 1
My goal is to extract the first row in such a way that looks like this:
A B C
0 2 a a
As you can see the dataframe shape (1x3) is preserved and the first row still has 3 columns.
However when I type the following command df.loc[0] the output result is this:
df.loc[0]
Out[9]:
A 2
B a
C a
Name: 0, dtype: object
As you can see the row has turned into a column with 3 rows! (3x1 instead of 3x1). How is this possible? how can I simply extract the row and preserve its shape as described in my goal? Could you provide a smart and elegant way to do it?
I tried to use the transpose command .T but without success... I know I could create another dataframe where the columns are extracted by the original dataframe but this way quite tedious and not elegant I would say (pd.DataFrame({'A':[2], 'B':'a', 'C':'a'})).
Here is the dataframe if you need it:
import pandas as pd
df = pd.DataFrame({'A':[2,3,4,1], 'B':['a','s','c','f'], 'C':['a', 3, '!', 1]})
You need add [] for DataFrame:
#select by index value
print (df.loc[[0]])
A B C
0 2 a a
Or:
print (df.iloc[[0]])
A B C
0 2 a a
If need transpose Series, first need convert it to DataFrame by to_frame:
print (df.loc[0].to_frame())
0
A 2
B a
C a
print (df.loc[0].to_frame().T)
A B C
0 2 a a
Use a range selector will preserve the Dataframe format.
df.iloc[0:1]
Out[221]:
A B C
0 2 a a
Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})
I have a dataframe with values like
A B
1 4
2 6
3 9
I need to add a new column by adding values from column A and B, like
A B C
1 4 5
2 6 8
3 9 12
I believe this can be done using lambda function, but I can't figure out how to do it.
Very simple:
df['C'] = df['A'] + df['B']
Building a little more on Anton's answer, you can add all the columns like this:
df['sum'] = df[list(df.columns)].sum(axis=1)
The simplest way would be to use DeepSpace answer. However, if you really want to use an anonymous function you can use apply:
df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
You could use sum function to achieve that as #EdChum mentioned in the comment:
df['C'] = df[['A', 'B']].sum(axis=1)
In [245]: df
Out[245]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
You could do:
df['C'] = df.sum(axis=1)
If you only want to do numerical values:
df['C'] = df.sum(axis=1, numeric_only=True)
The parameter axis takes as arguments either 0 or 1, with 0 meaning to sum across columns and 1 across rows.
As of Pandas version 0.16.0 you can use assign as follows:
df = pd.DataFrame({"A": [1,2,3], "B": [4,6,9]})
df.assign(C = df.A + df.B)
# Out[383]:
# A B C
# 0 1 4 5
# 1 2 6 8
# 2 3 9 12
You can add multiple columns this way as follows:
df.assign(C = df.A + df.B,
Diff = df.B - df.A,
Mult = df.A * df.B)
# Out[379]:
# A B C Diff Mult
# 0 1 4 5 3 4
# 1 2 6 8 4 12
# 2 3 9 12 6 27
Concerning n00b's comment: "I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
I was getting the same error. In my case it was because I was trying to perform the column addition on a dataframe that was created like this:
df_b = df[['colA', 'colB', 'colC']]
instead of:
df_c = pd.DataFrame(df, columns=['colA', 'colB', 'colC'])
df_b is a copy of a slice from df
df_c is an new dataframe. So
df_c['colD'] = df['colA'] + df['colB']+ df['colC']
will add the columns and won't raise any warning. Same if .sum(axis=1) is used.
I wanted to add a comment responding to the error message n00b was getting but I don't have enough reputation. So my comment is an answer in case it helps anyone...
n00b said:
I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
He got this error because whatever manipulations he did to his dataframe prior to creating df['C'] created a view into the dataframe rather than a copy of it. The error didn't arise form the simple calculation df['C'] = df['A'] + df['B'] suggested by DeepSpace.
Have a look at the Returning a view versus a copy docs.
Can do using loc
In [37]: df = pd.DataFrame({"A":[1,2,3],"B":[4,6,9]})
In [38]: df
Out[38]:
A B
0 1 4
1 2 6
2 3 9
In [39]: df['C']=df.loc[:,['A','B']].sum(axis=1)
In [40]: df
Out[40]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
eval lets you sum and create columns right away:
In [8]: df.eval('C = A + B', inplace=True)
In [9]: df
Out[9]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
Since inplace=True you don't need to assign it back to df.
You can solve it by adding simply:
df['C'] = df['A'] + df['B']
I have 2 data frames with one column each. Index of the first is [C,B,F,A,Z] not sorted in any way. Index of the second is [C,B,Z], also unsorted.
I use pd.concat([df1,df2],axis=1) and get a data frame with 2 columns and NaN in the second column where there is no appropriate value for the index.
The problem I have is that index automatically becomes sorted in alphabetical order.
I have tried = pd.concat([df1,df2],axis=1, names = my_list) where my_list = [C,B,F,A,Z], but that didn't make any changes.
How can I specify index to be not sorted?
This seems to be by design, the only thing I'd suggest is to call reindex on the concatenated df and pass the index of df:
In [56]:
df = pd.DataFrame(index=['C','B','F','A','Z'], data={'a':np.arange(5)})
df
Out[56]:
a
C 0
B 1
F 2
A 3
Z 4
In [58]:
df1 = pd.DataFrame(index=['C','B','Z'], data={'b':np.random.randn(3)})
df1
Out[58]:
b
C -0.146799
B -0.227027
Z -0.429725
In [67]:
pd.concat([df,df1],axis=1).reindex(df.index)
Out[67]:
a b
C 0 -0.146799
B 1 -0.227027
F 2 NaN
A 3 NaN
Z 4 -0.429725