I want to fill the column of the df2 (~100.000 rows) with the values from the same column of df (~1.000.000 rows). Df often has several times the same row but with wrong data, so I always want to take the first value of my column 'C'.
df = pd.DataFrame([[100, 1, 2], [100, 3, 4], [100, 5, 6], [101, 7, 8], [101, 9, 10]],
columns=['A', 'B', 'C'])
df2=pd.DataFrame([[100,0],[101,0]], columns=['A', 'C'])
for i in range(0,len(df2.index)):
#My Question:
df2[i,'C']=first value of 'C' column of df where the 'A' column is the same of both dataframes. E.g. the first value for 100 would be 2 and then the first value for 101 would be 8
In the end, my output should be a table like this:
df2=pd.DataFrame([[100,2],[101,8]], columns=['A', 'C'])
You can try this:
df2['C'] = df.groupby('A')['C'].first().values
Which will give you:
A C
0 100 2
1 101 8
first() returns the first value of every group.
Then you want to assign the values to df2 column, unfortunately, you cannot assign the result directly like this:
df2['C'] = df.groupby('A')['C'].first() .
Because the above line will result in :
A C
0 100 NaN
1 101 NaN
(You can read about the cause here: Adding new column to pandas DataFrame results in NaN)
Related
I am doing an experiment and want to observe the impact of missing values on the query results. I am doing it using Python Pandas. Consider that I have dataframe df. This dataframe is the complete data. My real data consists of many columns and thousands of rows.
I made a copy of df to df_copy. Then I do an experiment using df_copy and df is the ground truth. I put some NaN values on df_copy randomly.
I have some ideas to fix the missing values on df_copy using a heuristic ways. Currently, I can do easily using row operation in pandas. For instance, if I want to fix any rows on df_copy, I just can get the row by the id from df_copy then drop the row and replace from the df.
My question is, how can I do an operation on a cell-based in pandas? For instance, How can I get the index (x,y) from all missing values and when I want to fix a missing cell, I can just replace the value on that cell from the ground truth by calling the index (x,y)
Example:
df
df = pd.DataFrame(np.array([["x", 2, 3], ["y", 5, 6], ["z", 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x 2 3
1 y 5 6
2 z 8 9
df_copy
df_copy = pd.DataFrame(np.array([["x", np.nan, 3], ["y", 5, np.nan], [np.nan, 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x nan 3
1 y 5 nan
2 nan 8 9
I want to select the entries of one dataframe, say df2, based on the cross-sectional statistic of another dataframe, say df1:
df1 = pd.DataFrame([[4, 5, 9, 11],
[3, 1, 45, 1],
[88, 314, 2, 313]], columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame([['h','e','l','p'],
['m','y','q','u'],
['e','r','y','.']], columns = ['A', 'B', 'C', 'D'])
For instance, if the cross-sectional statistic on df1 is a max operation, then for the 3 rows in df1 the corresponding columns with the max entries are 'D', 'C', 'B' (corresponding to entries 11, 45, 314).
Selecting only those entries in df2 should give me:
which I can achieve by:
mask_ = pd.DataFrame(False, index=df1.idxmax(1).index, columns=df1.idxmax(1))
for k,i in enumerate(df1.idxmax(1)):
mask_.loc[k, i] = True
df2[mask_]
However, this feels cumbersome; is there an easier way to do this?
Solution working if index and columns names are same in both DataFrames.
Use DataFrame.where with mask for compare maximal values by all values of rows:
df = df2.where(df1.eq(df1.max(axis=1), axis=0))
print (df)
A B C D
0 NaN NaN NaN p
1 NaN NaN q NaN
2 NaN r NaN NaN
I want to filter my df down to only those rows who have a value in column A which appears less frequently than some threshold. I currently am using a trick with two value_counts(). To explain what I mean:
df = pd.DataFrame([[1, 2, 3], [1, 4, 5], [6, 7, 8]], columns=['A', 'B', 'C'])
'''
A B C
0 1 2 3
1 1 4 5
2 6 7 8
'''
I want to remove any row whose value in the A column appears < 2 times in the column A. I currently do this:
df = df[df['A'].isin(df.A.value_counts()[df.A.value_counts() >= 2].index)]
Does Pandas have a method to do this which is cleaner than having to call value_counts() twice?
It's probably easiest to filter by group size, where the groups are done on column A.
df.groupby('A').filter(lambda x: len(x) >=2)
I want to add a row to a pandas dataframe with using df.loc[rowname] = s (where s is a series).
However, I constantly get the Cannot reindex from a duplicate axis ValueError.
I presume that this is due to having duplicate column names in df as well as duplicate index names in s (the index of s is identical to df.columns.
However, when I try to reproduce this error on a small example, I don't get this error. What could the reason for this behavior be?
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
b=pd.DataFrame(columns=a.columns)
b.loc['mean'] = a.replace('',np.nan).mean(skipna=True)
print(b)
a b a
mean 3.0 3.0 6.0
I think duplicated columns names should be avoid, because then should be weird errors.
It seems there are non matched values between index of Series and columns of DataFrame:
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
a.loc['mean'] = pd.Series([2,5,4], index=list('abb'))
print(a)
ValueError: cannot reindex from a duplicate axis
One possible solution for deduplicated columns names with rename columns:
s = a.columns.to_series()
a.columns = s.add(s.groupby(s).cumcount().astype(str).replace('0',''))
print(a)
a b a1
0 1 2 7
1 5 4 5
2
Or drop duplicated columns:
a = a.loc[:, ~a.columns.duplicated()]
print(a)
a b
0 1 2
1 5 4
2
I have seen lots of advice about sorting based on a pandas column name but I am trying to sort based on the column index.
I have included some code to demonstrate what I am trying to do.
import pandas as pd
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'D', 'C', 'D'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
})
df2 = df.sort_values(by=['col2'])
I want to sort a number of dataframes that all have different names for the second column. It is not practical to sort based on (by=['col2'] but I always want to sort on the second column (i.e. Column index 1). Is this possible?
Select columns name by position and pass to by parameter:
print (df.columns[1])
col2
df2 = df.sort_values(by=df.columns[1])
print (df2)
col1 col2 col3
1 A 1 1
0 A 2 0
5 D 4 3
4 C 7 2
3 D 8 4
2 B 9 9