Fast splitting of pandas dataframe by column value - python

I have a pandas dataframe:
0 1
0 john 14
1 jack 2
2 emma 6
3 john 23
4 john 53
5 jack 43
that is really large(1+GB). I want to split the dataframe by name and to execute code on each of the resulting dataframes. This is my code, that works:
df.sort(columns=[0], inplace=True)
df.set_index(keys=[0], drop=False, inplace=True)
names = df[0].unique().tolist()
for name in names:
name_df = df.loc[df[0] == name]
do_stuff(name_df)
However it runs really slow. Is there a faster way to accomplish this task?

Here is an dictionary comprehension example that simply adds together each sub dataframe grouped on name:
>>> {k: gb['1'].sum() for k, gb in df.groupby('0')}
{'emma': 6, 'jack': 45, 'john': 90}
For something more complicated, you can create a function and then apply it to the group.
def foo(df):
df += 1
df *= 2
df = df.sum()
return df
{k: g['1'].apply(foo) for k, g in df.groupby('0')}

Related

Calculate a difference in times between each pair of values in class using Pandas

I am trying to calculate the difference in the "time" column between each pair of elements having the same value in the "class" column.
This is an example of an input:
class name time
0 A Bob 2022-09-05 07:22:15
1 A Sam 2022-09-04 17:18:29
2 B Bob 2022-09-04 03:29:06
3 B Sue 2022-09-04 01:28:34
4 A Carol 2022-09-04 10:40:23
And this is an output:
class name1 name2 timeDiff
0 A Bob Carol 0 days 20:41:52
1 A Bob Sam 0 days 14:03:46
2 A Carol Sam 0 days 06:38:06
3 B Bob Sue 0 days 02:00:32
I wrote this code to solve this problem:
from itertools import combinations
df2 = pd.DataFrame(columns=['class', 'name1', 'name2', 'timeDiff'])
for c in df['class'].unique():
df_class = df[df['class'] == c]
groups = df_class.groupby(['name'])['time']
if len(df_class) > 1:
out = (pd
.concat({f'{k1} {k2}': pd.Series(data=np.abs(np.diff([g2.values[0],g1.values[0]])).astype('timedelta64[s]'), index=[f'{k1} {k2}'], name='timeDiff')
for (k1, g1), (k2, g2) in combinations(groups, 2)},
names=['name']
)
.reset_index()
)
new = out["name"].str.split(" ", n = -1, expand = True)
out["name1"]= new[0].astype(str)
out["name2"]= new[1].astype(str)
out["class"] = c
del out['level_1'], out['name']
df2 = df2.append(out, ignore_index=True)
I didn't come up with a solution without going through all the class values in a loop. However, this is very time-consuming if the input table is large. Does anyone have any solutions without using a loop?
The whole thing is a self cross join and a time difference
import pandas as pd
df = pd.DataFrame({
'class': ['A', 'A', 'B', 'B', 'A'],
'name': ['Bob', 'Sam', 'Bob', 'Sue', 'Carol'],
'time': [
pd.Timestamp('2022-09-05 07:22:15'),
pd.Timestamp('2022-09-04 17:18:29'),
pd.Timestamp('2022-09-04 03:29:06'),
pd.Timestamp('2022-09-04 01:28:34'),
pd.Timestamp('2022-09-04 10:40:23'),
]
})
rs = list()
for n, df_g in df.groupby('class'):
t_df = df_g.merge(
df_g, how='cross',
suffixes=('_1', '_2')
)
t_df = t_df[t_df['name_1'] != t_df['name_2']]
t_df = t_df.drop(['class_2'], axis=1)\
.rename({'class_1': 'class'}, axis=1).reset_index(drop=True)
t_df['timeDiff'] = abs(t_df['time_1'] - t_df['time_2'])\
.astype('timedelta64[ns]')
t_df = t_df.drop(['time_1', 'time_2'], axis=1)
rs.append(t_df)
rs_df = pd.concat(rs).reset_index(drop=True)
Check below code without Outerjoin, using Aggegrate & Itertools
from itertools import combinations
# Function to create list of names
def agg_to_list(value):
return list(list(i) for i in combinations(list(value), 2))
# Fucntion to calculate list of time & calculate differences between them
def agg_to_list_time(value):
return [ t[0] - t[1] for t in list(combinations(list(value), 2))]
# Apply aggregate functions
updated_df = df.groupby(['class']).agg({'name':agg_to_list,'time':agg_to_list_time})
# Explode DataFrame & rename column
updated_df = updated_df.explode(['name','time']).rename(columns={'time':'timediff'})
# Unpack name column in two new columns
updated_df[['name1','name2']] = pd.DataFrame(updated_df.name.tolist(), index=updated_df.index)
# Final DataFrame
print(updated_df.reset_index()[['class','name1','name2','timediff']])
Output:
class name1 name2 timediff
0 A Bob Sam 0 days 14:03:46
1 A Bob Carol 0 days 20:41:52
2 A Sam Carol 0 days 06:38:06
3 B Bob Cue 0 days 02:00:32

Pandas - How to update non-numeric values to numeric and sum them up? Pandas

I have a DF like this:
Student
Age
Grades
Joe
23
A B C
Mark
22
B B C
Ian
24
A B A
As you can see grades are in non-numeric format, I would like to:
map each letter from column "grades" to numeric values (A=1, B=2, C=3)
sum up updated values (e.g. A B C = 1,2,3 = 6)
create a new column which would hold summed data
Example of wanted output:
Student
Age
Grades
Sum
Joe
23
A B C
6
Mark
22
B B C
7
Ian
24
A B A
4
How to do this with Pandas?
thanks
You could define a function in which you map the non-numerical grades to a numerical one.
def grades_to_num(grades):
grades_dict = {"A":1, "B":2, "C":3}
numgrades = [grades_dict[grade] for grade in grades.split()]
return sum(numgrades)
You can then apply the function to your df to create a new column with the numerical sum.
df["Sum"] = df["Grades"].apply(grades_to_num)
You can create a dictionary of map of grades to number. Then split Grades column by space and convert each of the grades to number using map and the dictionary (list(map(lambda x: grade_num[x], x))) and sum the obtained values.
grade_num = {'A': 1, 'B': 2, 'C': 3}
df['Sum'] = (df['Grades'].str.split(' ')
.apply(lambda x: np.sum(list(map(lambda x: grade_num[x], x)))))
df
Student Age Grades Sum
0 Joe 23 A B C 6
1 Mark 22 B B C 7
2 Ian 24 A B A 4
I have added a sample example where I have created a dataframe (without the column names) and updated the values by creating a copy of the dataframe and introducing a new column (3) as the sum. The code example is written just using pandas.
import pandas as pd
grades = {'A': 1, 'B': 2, 'C': 3}
df = pd.DataFrame([['Joe', 23, 'A B C'], ['Mark', 22, 'B B C']])
new_df = df.copy()
new_df[3] = 0
for i, r in new_df.iterrows():
grade_list = r[2].split(" ")
s = 0
for g in grade_list:
s += grades[g]
new_df.loc[i, 3] = s

return True if partial match success between two column

I am looking for partial matches success. below is the code for the dataframe.
import pandas as pd
data = [['tom', 10,'aaaaa','aaa'], ['nick', 15,'vvvvv','vv'], ['juli', 14,'sssssss','kk']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age','random','partial'])
df
Output:
I am expecting the output shown below
Name Age random partial Matches
0 tom 10 aaaaa aaa True
1 nick 15 vvvvv vv True
2 juli 14 sssssss kk False
You can use df.apply in combination with a lambda function that checks whether the partial string is part of the other string by using in.
Then we can assign this to a new column in the dataframe:
>>>import pandas as pd
>>>data = [['tom', 10,'aaaaa','aaa'], ['nick', 15,'vvvvv','vv'], ['juli',14,'sssssss','kk']]
>>>df = pd.DataFrame(data, columns = ['Name', 'Age','random','partial'])
>>>df['matching'] = df.apply(lambda x : x.partial in x.random, axis=1)
>>>print(df)
Name Age random partial matching
0 tom 10 aaaaa aaa True
1 nick 15 vvvvv vv True
2 juli 14 sssssss kk False
One important thing to be aware of when using df.apply is the axis argument, as it here allows us to access all columns of a given row at once.
df['Matches'] = df.apply(lambda row: row['partial'] in row['random'], axis = 'columns')
df
gives
Name Age random partial Matches
0 tom 10 aaaaa aaa True
1 nick 15 vvvvv vv True
2 juli 14 sssssss kk False

How to fastly select dataframe according to multi-columns in pandas

I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.
Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])

How to insert rows at specific positions into a dataframe in Python?

suppose you have a dataframe
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
[28,34,29,42]})
and another dataframe
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
as well as a list with indices
pos = [0,2].
What is the most pythonic way to create a new dataframe df2 where df1 is integrated into df right before the index positions of df specified in pos?
So, the new array should look like this:
df2 =
Age Name
0 20 Anna
1 28 Tom
2 34 Jack
3 50 Susie
4 29 Steve
5 42 Ricky
Thank you very much.
Best,
Nathan
The behavior you are searching for is implemented by numpy.insert, however, this will not play very well with pandas.DataFrame objects, but no-matter, pandas.DataFrame objects have a numpy.ndarray inside of them (sort of, depending on various factors, it may be multiple arrays, but you can think of them as on array accessible via the .values parameter).
You will simply have to reconstruct the columns of your data-frame, but otherwise, I suspect this is the easiest and fastest way:
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
...: [28,34,29,42]})
In [3]: df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
In [4]: np.insert(df.values, (0,2), df1.values, axis=0)
Out[4]:
array([['Anna', 20],
['Tom', 28],
['Jack', 34],
['Susie', 50],
['Steve', 29],
['Ricky', 42]], dtype=object)
So this returns an array, but this array is exactly what you need to make a data-frame! And you have the other elements, i.e. the columns already on the original data-frames, so you can just do:
In [5]: pd.DataFrame(np.insert(df.values, (0,2), df1.values, axis=0), columns=df.columns)
Out[5]:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
So that single line is all you need.
Tricky solution with float indexes:
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age': [28,34,29,42]})
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]}, index=[-0.5, 1.5])
result = df.append(df1, ignore_index=False).sort_index().reset_index(drop=True)
print(result)
Output:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
Pay attention to index parameter in df1 creation. You can construct index from pos using simple list comprehension:
[x - 0.5 for x in pos]

Categories