I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)
Related
I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.
given a panda dataframes, how would i delete all rows that are in between 2 rows that have the same values on 2 specific columns. In my case I have columns x,y and id. I would like if a x-y pair appears twice in the dataframe to delete all rows that lay in between those 2.
Example:
import pandas as pd
df1 = pd.DataFrame({'x':[1,2,3,2,1,3,4],
'y':[1,2,3,4,3,3,4],
'id':[1,2,3,4,5,6,7]})
^ ^
As you can see the value pair x=3,y=3 appears twice in the dataframe, once at id=3, once at id=6.
How could I spot these rows and drop all rows in between?
So that I would get this for example:
df1 = pd.DataFrame({'x':[1,2,3,4],
'y':[1,2,3,4],
'id':[1,2,3,7]})
The dataframe could also be like that, so that there are more "duplicates" as in my next example the 4,2 pair. I want to spot the outer duplicates so that with the deleting the rows in between them, all other twice or more appearing rows are eliminated too. For example:
df1 = pd.DataFrame({'x':[1,2,3,4,1,4,3,4],
'y':[1,2,3,2,3,2,3,4],
'id':[1,2,3,4,5,6,7,8]})
^ ^ ^ ^
out in in out
#should become:
df1 = pd.DataFrame({'x':[1,2,3,4],
'y':[1,2,3,4],
'id':[1,2,3,8]})
For my example this should cause a kind of loop elimination of the graph that i represent with the dataframe.
How would i implement that?
One of possible solutions:
Let's start from creation of your DataFrame (here I omitted the required import):
d = {'id': [1,2,3,4,5,6,7,8], 'x': [1,2,3,4,1,4,3,4], 'y': [1,2,3,2,3,2,3,4]}
df = pd.DataFrame(data=d)
Note that index values are consecutive numbers (from 0), what will be used later.
Then we have to find duplicated rows, marking all instances (keep=False):
dups = df[df.duplicated(subset=['x', 'y'], keep=False)]
These duplicates should then be groupped on x and y:
gr = dups.groupby(['x', 'y'])
Then, number of group to which belongs particular row should be added
to df as e.g. grpNo column.
df['grpNo'] = gr.ngroup()
The next step is to find the first and last index of row, which
were groupped within the first group (with group No == 0) and save them in
ind1 and ind2.
ind1 = df[df['grpNo'] == 0].index[0]
ind2 = df[df['grpNo'] == 0].index[-1]
Then we find a list of index values to be deleted:
indToDel = df[(df.index > ind1) & (df.index <= ind2)].index
To perform actual deletion of rows, we should execute:
df.drop(indToDel, inplace=True)
And the last step is to delete grpNo column, not needed any more.
df.drop('grpNo', axis=1, inplace=True)
The result is:
id x y
0 1 1 1
1 2 2 2
2 3 3 3
7 8 4 4
So the whole script can be as follows:
import pandas as pd
d = {'id': [1,2,3,4,5,6,7,8], 'x': [1,2,3,4,1,4,3,4], 'y': [1,2,3,2,3,2,3,4]}
df = pd.DataFrame(data=d)
dups = df[df.duplicated(subset=['x', 'y'], keep=False)]
gr = dups.groupby(['x', 'y'])
df['grpNo'] = gr.ngroup()
ind1 = df[df['grpNo'] == 0].index[0]
ind2 = df[df['grpNo'] == 0].index[-1]
indToDel = df[(df.index > ind1) & (df.index <= ind2)].index
df.drop(indToDel, inplace=True)
df.drop('grpNo', axis=1, inplace=True)
print(df)
This works for both your examples, although not sure if generalizes to all examples you have in mind:
df1[df1['x']==df1['y']]
I have 2 dataframes, df1 and df2, and want to do the following, storing results in df3:
for each row in df1:
for each row in df2:
create a new row in df3 (called "df1-1, df2-1" or whatever) to store results
for each cell(column) in df1:
for the cell in df2 whose column name is the same as for the cell in df1:
compare the cells (using some comparing function func(a,b) ) and,
depending on the result of the comparison, write result into the
appropriate column of the "df1-1, df2-1" row of df3)
For example, something like:
df1
A B C D
foo bar foobar 7
gee whiz herp 10
df2
A B C D
zoo car foobar 8
df3
df1-df2 A B C D
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
I've started with this:
for r1 in df1.iterrows():
for r2 in df2.iterrows():
for c1 in r1:
for c2 in r2:
but am not sure what to do with it, and would appreciate some help.
So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy. Ideally, you shouldn't ever be calling iterrows(). To be a little more explicit with my suggestion:
# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']
# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0 foofoofoozoo
1 NaN
Name: A, dtype: object
# more generally
# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns)
for colName in df1:
df3[colName] = func(df1[colName], df2[colName])
Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:
# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]
# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
df3[colName] = func(df1[colName], df2[colName])
The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns. For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined. Just something to be aware of.
Edit, re: your comment: That is doable as well. Iterate over the dfX.columns that is larger, so you don't run into a KeyError, and throw an if statement in there:
# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
if colName not in df1:
df3[colName] = np.nan # be sure to import numpy as np
else:
df3[colName] = func(df1[colName], df2[colName])
I've run into an issue trying to drop a nan column from a table.
Here's the example that works as expected:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=['A', 'B', 'C'],
index=['Foo', 'Bar'])
mapping1 = pd.DataFrame([['a', 'x'], ['b', 'y']],
index=['A', 'B'],
columns=['Test', 'Control'])
# rename the columns using the mapping file
df1.columns = mapping1.loc[df1.columns, 'Test']
From here we see that the C column in df1 doesn't have an entry in the mapping file, and so that header is replaced with a nan.
# drop the nan column
df1.drop(np.nan, axis=1)
In this situation, calling np.nan finds the final header and drops it.
However, in the situation below, the df.drop does not work:
# set up table
sample1 = np.random.randint(0, 10, size=3)
sample2 = np.random.randint(0, 5, size=3)
df2 = pd.DataFrame([sample1, sample2],
index=['sample1', 'sample2'],
columns=range(3))
mapping2 = pd.DataFrame(['foo']*2, index=range(2),
columns=['test'])
# assign columns using mapping file
df2.columns = mapping2.loc[df2.columns, 'test']
# try and drop the nan column
df2.drop(np.nan, axis=1)
And the nan column remains.
This may be an answer (from https://stackoverflow.com/a/16629125/5717589):
When index is unique, pandas use a hashtable to map key to value.
When index is non-unique and sorted, pandas use binary search,
when index is random ordered pandas need to check all the keys in the
index.
So, if entries are unique, np.nan gets hashed I think. In a non-unique cases, pandas compares values, but:
np.nan == np.nan
Out[1]: False
Update
I guess it's impossible to access a NaN column by label. But it's doable by index position. Here is a workaround for dropping columns with null labels:
notnull_col_idx = np.arange(len(df.columns))[~pd.isnull(df.columns)]
df = df.iloc[:, notnull_col_idx]
Hmmm... this might be considered a bug but it seems like this problem occurs if your columns are labeled with the same label, in this case as foo. If I switch up the labels, the issue disappears:
mapping2 = pd.DataFrame(['foo','boo'], index=range(2),
columns=['test'])
I also attempted to call the columns by their index positions and the problem still occurs:
# try and drop the nan column
df2.drop(df2.columns[[2]], axis=1)
Out[176]:
test foo foo nan
sample1 4 4 4
sample2 4 0 1
But after altering the 2nd column label to something other than foo, the problem resolves itself. My best piece of advice is to have unique column labels.
Additional info:
So this also occurs when there are multiple nan columns as well...
I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)