maintaining pandas df index with selection & groupby (python)

maintaining pandas df index with selection & groupby (python) - python

I am having an issue with returning the original df index of a row given a groupby condition after subselecting some of the df. It's easier to understand through code.
So if we start with a toy dataframe:
headers = ['a','b']
nrows = 8
df = pd.DataFrame(columns = headers)
df['a'] = [0]*(nrows//2) + [1]*(nrows//2)
df['b'] = [2]*(nrows//4) + [4]*(nrows//4) + [2]*(nrows//4) + [4]*(nrows//4)
print(df)
then I select the subset of data I am interested in and checking that the index is retained:
sub_df = df[df['a']==1] ## selects for only group 1 (indices 4-7)
print(sub_df.index) ## looks good so far
sub_df.index returns
Int64Index([4, 5, 6, 7], dtype='int64')
Which seems great! I would like to group data from that subset and extract the original df index and that is where the issue occurs:
For example:
g_df = sub_df.groupby('b')
g_df_idx = g_df.indices
print(g_df_idx) ## bad!
when I print(g_df_idx) I want it to return:
{2: array([4,5]), 4: array([6,7])}
Due to the way I will be using this code I can't just groupby(['a','b'])
I'm going nuts with this thing. Here are some of the many solutions I have tried:
## 1
e1_idx = sub_df.groupby('b').indices
# print(e1_idx) ## issue persists
## 2
e2 = sub_df.groupby('b', as_index = True) ## also tried as_index = False
e2_idx = e2.indices
# print(e2_idx) ## issue persists
## 3
e3 = sub_df.reset_index()
e3_idx = e3.groupby('b').indices
# print(e3_idx) ## issue persists
I'm sure there must be some simple solution I'm just overlooking. Would be very grateful for any advice.

you can do like this
g_df_idx = g_df.apply(lambda x: x.index).to_dict()
print(g_df_idx)
# {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}

Related

How do I remove repeated Row Comparisons?

I'm currently new to coding and would require to do pairwise comparisons using Pandas. Hence, I have to find a way to code row-by-row comparisons without any repetitions.
A mock data will be as follows:
Whereby i'm comparing males & their age. However, as seen in the image above, in index 1 there is a combination of Vyel & Allsebrook & in index 4 the same combination is seen with Allsebrook and Vyel.
Ideally, the desired output would be like:
Desired Results
I have managed to remove rows containing the same person twice, but is there a way i can code so i can avoid overlapping data comparisons? Would appreciate any feedback. Thank you!

Try This:
import pandas as pd
# Create the DF
id_x = [1, 1, 1, 1, 2]
last_name_x = ["Vyel", "Vyel", "Vyel", "Vyel", "Allsebrook"]
gender = ["Male", "Male", "Male", "Male", "Male"]
age_x = [66, 66, 66, 66, 50]
id_y = [1, 2, 3, 4, 1]
last_name_y = ["Vyel", "Allsebrook", "Prinett", "Jinda", "Vyel"]
age_y = [66, 50, 30, 31, 66]
df = pd.DataFrame(id_x, columns=['id_x'])
df['last_name_x'] = last_name_x
df['gender'] = gender
df['age_x'] = age_x
df['id_y'] = id_y
df['last_name_y'] = last_name_y
df['age_y'] = age_y
# Create the keys in order to check duplicates
df["key_x"] = df["id_x"].astype(str) + df["last_name_x"] + df["age_x"].astype(str)
df["key_y"] = df["id_y"].astype(str) + df["last_name_y"] + df["age_y"].astype(str)
df["combined"] = df['key_x'] + "|" +df['key_y']
s = set(df['key_y'] + "|" + df['key_x'])
d = {}
# mark all of the duplicated rows
def check_if_duplicate(row):
if row in s:
if row in d:
return True
else:
arr = row.split("|")
d[arr[1] + "|" + arr[0]] = 1
return False
df['duplicate'] = df['combined'].apply(check_if_duplicate)
# drop the duplicates and the rows we added in order to check the duplicates
df = df[df['duplicate'] != True]
df.drop(["key_x", "key_y", "combined", "duplicate"], axis=1, inplace=True)
print(df)

Create smaller dataframes from a large dataframe using the index values from a list

I have a list
a = [15, 50 , 75]
Using the above list I have to create smaller dataframes filtering out rows (the number of rows is defined by the list) on the index from the main dataframe.
let's say my main dataframe is df
the dataframes I'd like to have is df1 (from row index 0-15),df2 (from row index 15-65), df3 (from row index 65 - 125)
since these are just three I can easily use something like this below:
limit1 = a[0]
limit2 = a[1] + limit1
limit3 = a[2] + limit3
df1 = df.loc[df.index <= limit1]
df2 = df.loc[(df.index > limit1) & (df.index <= limit2)]
df2 = df2.reset_index(drop = True)
df3 = df.loc[(df.index > limit2) & (df.index <= limit3)]
df3 = df3.reset_index(drop = True)
But what if I want to implement this with a long list on the main dataframe df, I am looking for something which is iterable like the following (which doesn't work):
df1 = df.loc[df.index <= limit1]
for i in range(2,3):
for j in range(2,3):
for k in range(2,3):
df[i] = df.loc[(df.index > limit[j]) & (df.index <= limit[k])]
df[i] = df[i].reset_index(drop=True)
print(df[i])

you could modify your code by building dataframes from the main dataframe iteratively cutting out slices from the end of the dataframe.
dfs = [] # this list contains your partitioned dataframes
a = [15, 50 , 75]
for idx in a[::-1]:
dfs.insert(0, df.iloc[idx:])
df = df.iloc[:idx]
dfs.insert(0, df) # add the last remaining dataframe
print(dfs)
Another option is to use list expressions as follows:
a = [0, 15, 50 , 75]
dfs = [df.iloc[a[i]:a[i+1]] for i in range(len(a)-1)]

This does it. It's better to use dictionaries if you want to store multiple variables and call them later. It's bad practice to create variables in an iterative way, so always avoid it.
df = pd.DataFrame(np.linspace(1,75,75), columns=['a'])
a = [15, 50 , 25]
d = {}
b = 0
for n,i in enumerate(a):
d[f'df{n}'] = df.iloc[b:b+i]
b+=i
Output:

How to speed up the matching of columns in pandas dataframe [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I'm trying to find matching values in a pandas dataframe. Once a match is found I want to perform some operations on the row of the dataframe.
Currently I'm using this Code:
import pandas as pd
d = {'child_id': [1, 2,5,4], 'parent_id': [3, 4,2,3], 'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
print(df.content[i])
else:
pass
It works fine, but is rather slow. Since I'm dealing with a dataset with millions of rows, I would take months. Is there a faster way to do this?
Edit: To clarify what, I want to create is a dataframe, which contains the Content of Matches.
import pandas as pd
d = {'child_id': [1,2,5,4],
'parent_id': [3,4,2,3],
'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(columns = ("content_child", "content_parent"))
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
content_child = str(df["content"][i])
content_parent = str(df["content"][j])
s = pd.Series([content_child, content_parent], index=['content_child', 'content_parent'])
df2 = df2.append(s, ignore_index=True)
else:
pass
print(df2)

The fastest way is to use the features of numpy:
import pandas as pd
d = {
'child_id': [1, 2, 5, 4],
'parent_id': [3, 4, 2, 3],
'content': ["a", "b", "c", "d"]
}
df = pd.DataFrame(data=d)
comp1 = df['child_id'].values == df['parent_id'].values
comp2 = df['child_id'].values[::-1] == df['parent_id'].values
comp3 = df['child_id'].values == df['parent_id'].values[::-1]
if comp1.any() and not comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1] ]
elif comp1.any() and comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2] ]
elif comp1.any() and comp2.any() and comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2], df['content'].values[comp3] ]
print( df['content'].values[comp] )
Which outputs:
[]

Pandas assignment using nested loops leading to memory error

I am using pandas and trying to do an assignment using a nested loops. I iterate over a dataframe and then run a distance function if it meets a certain criteria. I am faced with two problems:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Memory Error. It doesn't work on large datasets. I end up having to terminate the process.
How should I change my solution to ensure it can scale with a larger dataset of 60,000 rows?
for i, row in df.iterrows():
listy = 0
school = []
if row['LS_Type'] == 'Primary (1-4)':
a = row['Northing']
b = row['Easting']
LS_ID = row['LS_ID']
for j, row2 in df.iterrows():
if row2['LS_Type'] == 'Primary (1-8)':
dist_km = distance(a,b, df.Northing[j], df.Easting[j])
if (listy == 0):
listy = dist_km
school.append([df.LS_Name[j], df.LS_ID[j]])
else:
if dist_km < listy:
listy = dist_km
school[0] = [df.LS_Name[j], int(df.LS_ID[j])]
df['dist_up_prim'][i] = listy
df["closest_up_prim"][i] = school[0]
else:
df['dist_up_prim'][i] = 0

The double for loop is what's killing you here. See if you can break it up into two separate apply steps.
Here is a toy example of using df.apply() and partial to do a nested for loop:
import math
import pandas as pd
from functools import partial
df = pd.DataFrame.from_dict({'A': [1, 2, 3, 4, 5, 6, 7, 8],
'B': [1, 2, 3, 4, 5, 6, 7, 8]})
def myOtherFunc(row):
if row['A'] <= 4:
return row['B']*row['A']
def myFunc(the_df, row):
if row['A'] <= 2:
other_B = the_df.apply(myOtherFunc, axis=1)
return other_B.mean()
return pd.np.NaN
apply_myFunc_on_df = partial(myFunc, df)
df.apply(apply_myFunc_on_df, axis=1)
You can rewrite your code in this form, which will be much faster.

filtering columns that has different dtype cells

I have recently asked a question on applying select_dtypes for specific columns of a data frame.
I have this data frame that has different dtypes on its columns (str and int in this case).
df = pd.DataFrame([
[-1, 3, 0],
[5, 2, 1],
[-6, 3, 2],
[7, '<blank>', 3 ],
['<blank>', 2, 4],
['<blank>', '<blank', '<blank>']], columns='A B C'.split())
I want to create different masks for strings and integers. And then I will apply stylings based on these masks.
First let's define a function that will help me create my mask for different dtypes. (Thanks to #jpp)
def filter_type(s, num=True):
s_new = pd.to_numeric(s, errors='coerce')
if num:
return s_new.notnull()
else:
return s_new.isnull()
then our first mask will be:
mask1 = filter_type(df['A'], num=False) # working and creating the bool values
Second mask will be based on an interval of integers:
mask2 = df['A'].between(7 , 0 , inclusive=False)
But when I run the mask2 it gives me the error:
TypeError:'>' not supported between instances of 'str' and 'int'
How can I overcome this issue?
Note: Stylings I would like to apply is like below:
def highlight_col(x):
df=x.copy
mask1 = filter_type(df['A'], num=False)
mask2 = df['A'].between(7 , 0 , inclusive=False)
x.loc[mask1, ['A', 'B', 'C']] = 'background-color: ""'
x.loc[mask2, ['A', 'B', 'C']] = 'background-color: #7fbf7f'

pd.DataFrame.loc is used to set values. You need pd.DataFrame.style to set styles. In addition, you can use try / except for a method of identifying when numeric comparisons fail.
Here's a minimal example:
def styler(x):
res = []
for i in x:
try:
if 0 <= i <= 7:
res.append('background: red')
else:
res.append('')
except TypeError:
res.append('')
return res
res = df.style.apply(styler, axis = 1)
Result:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

maintaining pandas df index with selection & groupby (python) - python

you can do like this g_df_idx = g_df.apply(lambda x: x.index).to_dict() print(g_df_idx) # {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}

Related

How do I remove repeated Row Comparisons?

Create smaller dataframes from a large dataframe using the index values from a list

How to speed up the matching of columns in pandas dataframe [duplicate]

Pandas assignment using nested loops leading to memory error

filtering columns that has different dtype cells

Categories

Resources