Selecting rows from one DataFrame depending on values from another

Selecting rows from one DataFrame depending on values from another - python

Take two data frames
print(df1)
A B
0 a 1
1 a 3
2 a 5
3 b 7
4 b 9
5 c 11
6 c 13
7 c 15
print(df2)
C D
a apple 1
b pear 1
c apple 1
So the values in column df1['A'] are the indexes of df2.
I want to select the rows in df1 where the values in column A are 'apple' in df2['C']. Resulting in:
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15

Made many edits due to comments and question edits,
Basically you first extract the indexes of df2 by filtering the dataframe by values in C, then filter the df2 by indexes with isin
indexes = df2[df2['C']=='apple'].index
df1[df1['A'].isin(indexes)]
>>>
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15
UPDATE
If you want to minimize memory allocation try to prevent saving information, (note. That i am not sure ot will solve your menory allocation issue because i didnt have full details of the situation and maybe even not suited enough to provide a solution):
df1[df1['A'].isin( df2[df2['C']=='apple'].index)]

Related

Is there a way to get the data after a specific condition in Pandas?

i want to know if there is a way to take the data from a dataframe after a specific condition, and keep taking that data until another condition is applied.
I have the following dataframe:
column_1 column_2
0 1 a
1 1 a
2 1 b
3 4 b
4 4 c
5 4 c
6 0 d
7 0 d
8 0 e
9 4 e
10 4 f
11 4 f
12 1 g
13 1 g
I want to select from this dataframe only the rows when in column_1 when it changes from 1->4 and stays 4 until it changes to another value, as follow:
column_1 column_2
3 4 b
4 4 c
5 4 c
Is there a way to do this in Pandas and not make them lists?

Another option is to find the cut off points using shift+eq; then use groupby.cummax to create a boolean filter:
df[(df['column_1'].shift().eq(1) & df['column_1'].eq(4)).groupby(df['column_1'].diff().ne(0).cumsum()).cummax()]
Output:
column_1 column_2
3 4 b
4 4 c
5 4 c

You can create helper column for groups by duplicated values new first, then test if shifted values is 1 compare with actual row and for these rows get new values. Last compare new column by filtered values for all duplicated 4 rows:
df['new'] = df['column_1'].ne(df['column_1'].shift()).cumsum()
s = df.loc[df['column_1'].shift().eq(1) & df['column_1'].eq(4), 'new']
df = df[df['new'].isin(s)]
print (df)
column_1 column_2 new
3 4 b 2
4 4 c 2
5 4 c 2

Is there a way to filter out rows from a table with an unnamed column

I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.

Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']

The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)

Pandas, merge multiple CSV's into one and clean strings

I am having some issues using pandas to manipulate some files into the right format.
I have multiple CSV's that I want to merge into one file.
They all have this sort of structure,
data1, data2 etc
And then with the key values, which are area (ideally want to rename column there), any arearea that has two words in it has looks like this, [![enter image description here][2]][2] , with "%20" instead of a space, and I'd want to remove that %20 ideally.
And lastly, I want to have the arearea have their respective long and lats from this file here,
If anyone has some pointers on achieving this, that would be amazing, I'm stuck using df.merge, and keep receiving errors. So close to achieving getting it complete!

You can use pd.concat (pandas documentation). As for string cleaning, you can apply a lambda function after the fact. Here's an example:
df1 = pd.read_csv('file1.csv')
df1
a b c
0 1 2 aardvark
1 3 4 pangolin
2 5 6 cat%20dog
df2 = pd.read_csv('file2.csv')
df2
a b c
0 7 8 bus
1 9 10 boat
2 11 12 ferry%20plane
# use pd.concat
df = pd.concat([df1, df2]).reset_index(drop=True)
df
a b c
0 1 2 aardvark
1 3 4 pangolin
2 5 6 cat%20dog
3 7 8 bus
4 9 10 boat
5 11 12 ferry%20plane
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['clean_c'] = df['c'].apply(f)
df[['a', 'b', 'clean_c']]
a b clean_c
0 1 2 aardvark
1 3 4 pangolin
2 5 6 cat dog
3 7 8 bus
4 9 10 boat
5 11 12 ferry plane

Pandas: remove old DataFrame from memory after groupby

value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas

df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.

Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5

Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows

Indexing in order to calculate with columns in Python pandas dataframe [duplicate]

This question already has an answer here:
Pandas: Adding column with calculations from other columns
(1 answer)
Closed 6 years ago.
I am using Python 3.4.4 with Pandas 0.18.1 to determine confidence intervals for experimental data. This leads to many calculations with datframe columns.
From Pandas doc the .loc[] method is recommended over the chained[] method, but it seems to be impossible to apply. Here an example with a simplified dataframe
df1 = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9]]), index=['a','b','c'], columns=['A','B','C'])
print(df1)
A B C
a 1 2 3
b 4 5 6
c 7 8 9
To calculate column 'A' times 3 in a new column 'E' I try
df1.loc[:,'E'] = df1.loc[:,['A']]*3
print(df1)
A B C E
a 1 2 3 NaN
b 4 5 6 NaN
c 7 8 9 NaN
If I use the un-recommended method I obtain
df1.loc[:,'E'] = df1['A']*3
print(df1)
A B C E
a 1 2 3 3
b 4 5 6 12
c 7 8 9 21
Thus it looks like the second method is the good one, but for my larger dataframe I get
"SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame"
I spend a lot of time to find a satifying solution, without result.
Many thanks in advance for your help.

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9]]), index=['a','b','c'], columns=['A','B','C'])
print(df1)
df1['E']= df1['A']*3
print(df1)
output:
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B C E
a 1 2 3 3
b 4 5 6 12
c 7 8 9 21

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting rows from one DataFrame depending on values from another - python

Related

Is there a way to get the data after a specific condition in Pandas?

Is there a way to filter out rows from a table with an unnamed column

Pandas, merge multiple CSV's into one and clean strings

Pandas: remove old DataFrame from memory after groupby

Indexing in order to calculate with columns in Python pandas dataframe [duplicate]

Categories

Resources