I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.
Related
I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).
The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2
#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7
I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']
Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64
I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa
I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.
df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True
My expectation was that pd.concat will not produce duplicate columns.
I want to understand when it could result in duplicate columns so that I can debug the source.
I could not reproduce the problem with a toy dataset.
I have verified that the input data frames have unique columns by running df.columns.duplicated().any().
The pandas version used 1.0.1
(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True
Check the below behaviour:
In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})
In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})
In [460]: df_list = [df1,df2]
This concats and keeps duplicate columns:
In [463]: pd.concat(df_list, axis=1)
Out[474]:
A B A B
0 1 2 1 2
1 2 3 2 4
2 3 4 3 5
pd.concat always concatenates the dataframes as is. It does not drop duplicate columns at all.
If you concatenate without the axis, it will append one dataframe below another in the same columns.
So you can have duplicate rows now, but not columns.
In [477]: pd.concat(df_list)
Out[477]:
A B
0 1 2 ## duplicate row
1 2 3
2 3 4
0 1 2 ## duplicate row
1 2 4
2 3 5
You can remove these duplicate rows by using drop_duplicates():
In [478]: pd.concat(df_list).drop_duplicates()
Out[478]:
A B
0 1 2
1 2 3
2 3 4
1 2 4
2 3 5
Update after OP's comment:
In [507]: df_list[0].columns.duplicated().any()
Out[507]: False
In [508]: df_list[1].columns.duplicated().any()
Out[508]: False
In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()
Out[510]: False
I have the same issue when I get data from IEXCloud. I used IEXFinance functions to grab different data sets which are all suppose to return dataframes. I then Use concat to join the dataframes. It looks to have repeated the first column (symbols) into column 97. The data in columns 96 and 98 where from the second dataframe. There are no duplicate columns in df1 or df2. I can't see any logical reason for duplicating it there. DF2 has 70 columns.I suspect some of what was returned as a 'dataframe' is something else but this doesnt explain the seeming random nature of the position the concat function chooses to duplicate the first column of the first df!
I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0
I am trying to do a division of column 0 by columns 1 and 2. From the below, I would like to return a dataframe of 10 rows, 3 columns. The first column should all be 1's. Instead I get a 10x10 dataframe. What am I doing wrong?
data = np.random.randn(10,3)
df = pd.DataFrame(data)
df[0] / df
First you should create a 10 by 3 DataFrame with all columns equal to the first column and then divide it with your DataFrame.
df[[0, 0, 0]] / df.values
or
df[[0, 0, 0]].values / df
If you want to keep the column names.
(I use .values to avoid reindexing which will fail due to duplicate column values.)
You need to match the dimension of the Series with the rows of the DataFrame. There are a few ways to do this but I like to use transposes.
data = np.random.randn(10,3)
df = pd.DataFrame(data)
(df[0] / df.T).T
0 1 2
0 1 -0.568096 -0.248052
1 1 -0.792876 -3.539075
2 1 -25.452247 1.434969
3 1 -0.685193 -0.540092
4 1 0.451879 -0.217639
5 1 -2.691260 -3.208036
6 1 0.351231 -1.467990
7 1 0.249589 -0.714330
8 1 0.033477 -0.004391
9 1 -0.958395 -1.530424