pandas.Index.isin produces a different dataframe than simple slicing - python

I'm really new to pandas and python in general, so I apologize if this is too basic.
I have a list of indices that I must use to take a subset of the rows of a dataframe. First, I simply sliced the dataframe using the indices to produce (df_1). Then I tried to use index.isin just to see if it also works (df_2). Well, it works but it produces a shorter dataframe (and seemingly ignores some of the rows that are supposed to be selected).
df_1 = df.iloc[df_idx]
df_2 = df[df.index.isin(df_idx)]
So my question is, why are they different? How exactly does index.isin work and when is it appropriate to use it?

Synthesising duplicates in index and then it re-produces the behaviour you note. If your index has duplicates it's absolutely expected the two will give different results. If you want to use these interchangeably you need to ensure that your index values uniquely identify a row
n = 6
df = pd.DataFrame({"idx":[i//2 for i in range(n)],"col1":[f"text {i}" for i in range(n)]}).set_index("idx")
df_idx = df.index
print(f"""
{df}
{df.iloc[df_idx]}
{df[df.index.isin(df_idx)]}
""")
output
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5
col1
idx
0 text 0
0 text 0
0 text 1
0 text 1
1 text 2
1 text 2
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5

Related

Filter dataframe based on matching values from two columns

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,2,3,0,0]})
I would like to filter the dataframe based on the below criteria
cdf['Id']==cdf['Label'] # first 3 rows are matching for both columns in cdf
I tried the below
flag = np.where[cdf['Id'].eq(cdf['Label'])==True,1,0]
final_df = cdf[cdf['flag']==1]
but I got the below error
TypeError: 'function' object is not subscriptable
I expect my output to be like as shown below
Id Label
0 1 1
1 2 2
2 3 3
I think you're overthinking this. Just compare the columns:
>>> cdf[cdf['Id'] == cdf['Label']]
Id Label
0 1 1
1 2 2
2 3 3
Your particular error though is coming from the fact that you're using square brackets to call np.where, e.g. np.where[...], which is wrong. You should be using np.where(...) instead, but the above solution is bound to be as fast as it gets ;)
Also you can check query
cdf.query('Id == Label')
Out[248]:
Id Label
0 1 1
1 2 2
2 3 3

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

Pandas save counts of multiple columns in single dataframe

I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!
One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1

Search for particular substring in pandas

I have the following list of strings:
my_list=["health","nutrition","nature","nutritionist", "nutritionists", "wellness", "food", "drink", "diet"]
I would like to assign a label to all the rows which contains one or more of the above words:
Search_Col
heathen
dietcoke
loveguru
drinkwine
lovefood
Pringles
then
Search_Col Tag
heathen 1
dietcoke 1
loveguru 0
drinkwine 1
lovefood 1
Pringles 0
I have tried first to select rows which contains elements in my_list as follows
df.Search_Col.str.contains(my_list)
but it does not select rows.
Chain values in list by | for regex or and then convert boolean mask to 0,1 by Series.view:
df['Tag'] = df.Search_Col.str.contains('|'.join(my_list)).view('i1')
print (df)
Search_Col Tag
0 heathen 0
1 dietcoke 1
2 loveguru 0
3 drinkwine 1
4 lovefood 1
5 Pringles 0
This should do what you are looking for. The final answer is different because the first row in your example does not contain any of the entries in the list you provided. (Perhaps you meant "healthen"?)
import pandas as pd
my_list=["health","nutrition","nature","nutritionist", "nutritionists", "wellness", "food", "drink", "diet"]
df = pd.DataFrame(['heathen','dietcoke','loveguru','drinkwine','lovefood','Pringles'], columns = ['Search_Col'] )
df['Tag'] = df.Search_Col.str.contains('|'.join(my_list)).astype('int')
Gives:
Search_Col Tag
0 heathen 0
1 dietcoke 1
2 loveguru 0
3 drinkwine 1
4 lovefood 1
5 Pringles 0

How to add a new column and fill it up with a specific value depending on another column's series?

I'm new to Pandas but thanks to Add column with constant value to pandas dataframe I was able to add different columns at once with
c = {'new1': 'w', 'new2': 'y', 'new3': 'z'}
df.assign(**c)
However I'm trying to figure out what's the path to take when I want to add a new column to a dataframe (currently 1.2 million rows * 23 columns).
Let's simplify the df a bit and try to make it more clear:
Order Orderline Product
1 0 Laptop
1 1 Bag
1 2 Mouse
2 0 Keyboard
3 0 Laptop
3 1 Mouse
I would like to add a new column where depending if the Order has at least 1 product == Bag then it should be 1 (for all rows for that specific order), otherwise 0.
Result would become:
Order Orderline Product HasBag
1 0 Laptop 1
1 1 Bag 1
1 2 Mouse 1
2 0 Keyboard 0
3 0 Laptop 0
3 1 Mouse 0
What I could do is find all the unique order numbers, then filter out the subframe, check the Product column for Bag, if found then add 1 to a new column, otherwise 0, and then replace the original subframe with the result.
Likely there's a way better manner to accomplish this, and also way more performant.
The main reason I'm trying to do this, is to flatten things down later on. Every order should become 1 line with some values of product. I don't need the information for Bag anymore but I want to keep in my dataframe if the original order used to have a Bag (1) or no Bag (0).
Ultimately when the data is cleaned out it can be used as a base for scikit-learn (or that's what I hope).
If I understand you correctly, you want GroupBy.transform.any
First we create a boolean array by checking which rows in Product are Bag with Series.eq. Then we GroupBy on this boolean array and check if any of the values are True. We use transform to keep the shape of our initial array so we can assign the values back.
df['ind'] = df['Product'].eq('Bag').groupby(df['Order']).transform('any').astype(int)
Order Orderline Product ind
0 1 0 Laptop 1
1 1 1 Bag 1
2 1 2 Mouse 1
3 2 0 Keyboard 0
4 3 0 Laptop 0
5 3 1 Mouse 0

Categories