pandas: problem using diff with groupby when index is non-unique - python

I am using pandas (version 0.20.3) and I want to apply the diff() method with groupby() but instead of a DataFrame, the result is an "underscore".
Here is the code:
import numpy as np
import pandas as pd
# creating the DataFrame
data = np.random.random(18).reshape(6,3)
indexes = ['B']*3 + ['A']*3
columns = ['x', 'y', 'z']
df = pd.DataFrame(data, index=indexes, columns=columns)
df.index.name = 'chain_id'
# Now I want to apply the diff method in function of the chain_id
df.groupby('chain_id').diff()
And the result is an underscore!
Note that df.loc['A'].diff() and df.loc['B'].diff() do return the expected results so I don't understand why it wouldn't work with groupby().

IIUC,Your error :cannot reindex from a duplicate axis
df.reset_index().groupby('chain_id').diff().set_index(df.index)
Out[859]:
x y z
chain_id
B NaN NaN NaN
B -0.468771 0.192558 -0.443570
B 0.323697 0.288441 0.441060
A NaN NaN NaN
A -0.198785 0.056766 0.081513
A 0.138780 0.563841 0.635097

Related

How to insert Pandas dataframe into another Pandas dataframe without wrapping it in a Series?

import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
df2[y] = [df1]
#df2.iloc[:,'y'].shape = (1,)
# type(df2.iloc[:,1][0]) = pandas.core.frame.DataFrame
I want to make a df a column in an existing row. However Pandas wraps this df in a Series object so that I cannot access it with dot notation such as df2.y.a to get the value 1. Is there a way to make this not occur or is there some constraint on object type for df elements such that this is impossible?
the desired output is a df like:
x y
0 100 a b
0 1 2
and type(df2.y) == pd.DataFrame
You can combine two DataFrame objects along the columns axis, which I think achieves what you're trying to. Let me know if this is what you're looking for
import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
a b x
0 1 2 100

Merging two columns in a pandas DataFrame

Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?
Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0

pandas removes rows with nan in df

I have a df that contains nan,
A
nan
nan
nan
nan
2017
2018
I tried to remove all the nan rows in df,
df = df.loc[df['A'].notnull()]
but df still contains those nan values for column 'A' after the above code. The dtype of 'A' is object.
I am wondering how to fix it. The thing is I need to define multiple conditions to filter the df, and df['A'].notnull() is one of them. Don't know why it doesn't work.
Please provide a reproducible example. As such this works:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan], [np.nan], [2017], [2018]], columns=['A'])
df = df[df['A'].notnull()]
df2 = pd.DataFrame([['nan'], ['nan'], [2017], [2018]], columns=['A'])
df2 = df2.replace('nan', np.nan)
df2 = df2[df2['A'].notnull()]
# output [df or df2]
# A
# 2017.0
# 2018.0

Include empty series when creating a pandas dataframe with .concat

UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN

Pass a pd.Series to a dataframe?

I tried the following code but the new column consists of only NAN values.
df['new'] = pd.Series(np.repeat(1, len(df)))
Can someone explain to me what the problem is here?
It is possible that the index of the DataFrame df does not match with the newly created Series'. For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [11, 22, 33, 44, 55]}, index=['r1','r2','r3','r4','r5'])
df['new'] = pd.Series(np.repeat(1, len(df)))
print df
and the output will be:
a new
r1 11 NaN
r2 22 NaN
r3 33 NaN
r4 44 NaN
r5 55 NaN
since the index of pd.Series(np.repeat(1, len(df))) is Int64Index([0, 1, 2, 3, 4], dtype='int64').
To prevent that, specify the index argument when creating the Series:
df['new'] = pd.Series(np.repeat(1, len(df)), index=df.index)
Alternatively, you can just pass a numpy array if the index is to be ignored:
df['new'] = np.repeat(1, len(df))
without needing to create a Series (in fact, df['new'] = 1 will do for this case). Using a Series is helpful when you need to align the new column with the existing DataFrame using the index.

Categories