I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)
Related
I have a pandas DataFrame where each cell is a set of numbers. I would like to go through the DataFrame and run each number along with the row index in a function. What's the most pandas-esque and efficient way to do this? Here's an example of one way to do it with for-loops, but I'm hopeful that there's a better approach.
def my_func(a, b):
pass
d = {"a": [{1}, {4}], "b": [{1, 2, 3}, {2}]}
df = pd.DataFrame(d)
for index, item in df.iterrows():
for j in item:
for a in list(j):
my_func(index, a)
Instead of iterating we can reshape the values into 1 column using stack then explode into separate rows:
s:
df.stack().explode()
0 a 1
b 1
b 2
b 3
1 a 4
b 2
dtype: object
We can further droplevel if we don't want the old column labels:
s = df.stack().explode().droplevel(1)
s:
0 1
0 1
0 2
0 3
1 4
1 2
dtype: object
reset_index can be used to create a DataFrame instead of a Series:
new_df = df.stack().explode().droplevel(1).reset_index()
new_df.columns = ['a', 'b'] # Rename columns to whatever
new_df:
a b
0 0 1
1 0 1
2 0 2
3 0 3
4 1 4
5 1 2
If i fully understood your problem. This might be one way of doing it:
[list(item) for sublist in df.values.tolist() for item in sublist]
The output will look like this:
[[1], [1, 2, 3], [4], [2]]
Since this is a nested list, you can flatten it if your requirement is a single list.
I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.
Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4
IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64
I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!
I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6
There is a question on SO with the title "Set value for particular cell in pandas DataFrame", but it assumes the row index is known. How do I change a value, if the row values in the relevant column are unique? Is using set_index as below the easiest option? Returning a list of index values with index.tolist does not seem like a very elegant solution.
Here is my code:
import pandas as pd
## Fill the data frame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=('a', 'b'))
print(df, end='\n\n')
## Just use the index, if we know it.
index = 0
df.set_value(index, 'a', 5)
print('set_value', df, sep='\n', end='\n\n')
## Define a new index column.
df.set_index('a', inplace=True)
print('set_index', df, sep='\n', end='\n\n')
## Use the index values of column A, which are known to us.
index = 3
df.set_value(index, 'b', 6)
print('set_value', df, sep='\n', end='\n\n')
## Reset the index.
df.reset_index(inplace = True)
print('reset_index', df, sep='\n')
Here is my output:
a b
0 1 2
1 3 4
set_value
a b
0 5 2
1 3 4
set_index
b
a
5 2
3 4
set_value
b
a
5 2
3 6
reset_index
a b
0 5 2
1 3 6
Regardless of the performance, you should be able to do this using loc with boolean indexing:
df = pd.DataFrame([[5, 2], [3, 4]], columns=('a', 'b'))
# modify value in column b where a is 3
df.loc[df.a == 3, 'b'] = 6
df
# a b
#0 5 2
#1 3 6