Speeding-up pandas column operation based on several rules - python

I have a data frame consisting of 5.1 mio rows. Now, consider only a query of my data frame
df_queried = df.query("ID1=='a' or ID2=='Y'")
which has the following form:
date
ID1
ID2
201908
a
X
201905
b
Y
201811
a
Y
201807
a
Z
You can assume that the date is sorted and that there are no duplicates in the subset ['ID1', 'ID2'].
Now, the goal is to check whether there are ID2 duplicates that contain more than one ID1 value. If that’s the case, then assign the most recent ID1 value from that list to a new column for each ID1 in that list.
For the special query of my data frame:
Y is a duplicate of ID2 containing different values for ID1, namely ['a', 'b']. Now, we have to find the most recent value from the list and assign it to the new column for all ID1 values that are in the list.
Output:
date
ID1
ID2
New_ID
201908
a
X
a
201905
b
Y
a
201811
a
Y
a
201807
a
Z
a
where New_ID equals the most recent value of ID1 and follows the following rules:
Within each ID2 attribute New_ID must have the same and most recent value
Example:
This obviously holds for ID2=X and ID2=Z. For ID2=Y there are two values for ID1, {a, b}. b must be overwritten with the most recent ID1 value of this segment.
If there is more than one ID1 value within an ID2 value, then find all rows for which ID1 equals one of those values and assign the most recent one
Example: For ID2=Y, ID1 contains two values, a and b. Now, for each ID1==a or ID1==b, the new columns New_ID must equal the most recent value of ID1 independent of ID2.
I am able to achieve this:
date
ID1
ID2
New_ID
201908
a
X
b
201905
b
Y
b
201811
a
Y
b
201807
a
Z
b
using the following loop:
df_queried['New_ID'] = df_queried['ID1']
for v2 in df_queried.ID2.unique():
# Query data frame by ID2 value
df_query1 = df_queried.query(f'ID2 == {v2!r}')
# Get most recent value
most_recent_val = df_query1.iloc[0, 1]
# Define unique ID1 values within ID2 query
unique_ID1_vals = df_query1.ID1.unique()
# If several ID1 values were found, check if one val
# also occurs in different ID1 position
if len(unique_ID1_vals) > 1:
for v1 in unique_ID1_vals:
# Get id1 query to check existence of multiple id2's
df_queried.loc[df_queried['ID1'] == v1, 'New_ID'] = most_recent_val
Now, I can join the actual value a to the new column:
mapping = df_queried.drop_duplicates(subset=['New_ID'])[['ID1', 'New_ID']]
pd.merge(df_queried, mapping.rename(columns={'ID1': 'ID_temp'}), how='left')\
.drop(columns=['New_ID'])\
.rename(columns={'ID_temp': 'New_ID'})
which yields the desired result.
However, it takes way too long. I was thinking about a smarter approach. One that mainly relies on joins. But I was not able to find one.
Note: Obviously, I want to operate over the whole data frame not only on the queried one. Therefore, the code must be stable and applicable to the whole data frame. I think my code is, but I did not try it out on the whole data (after 6 hours I killed the kernel). I also tried to use numba, but failed to fully implement it.
I hope my problem got clear.
EDIT 1:
df_queried['New_ID'] = df_queried.groupby('ID2')['ID1'].transform('last')
This approach indeed works for this special case. However, if it is applied to a larger subset of my data, for instance:
date
ID1
ID2
New_ID
New_ID_desired
201908
a
X
a
a
201905
b
Y
a
a
201811
a
Y
a
a
201807
a
Z
a
a
202003
c
H
d
c
202001
d
H
d
c
201907
c
I
c
c
201904
d
J
d
c
the method does not hold anymore. It satisfies rule 1, but not rule 2.
However, when you use my approach, you get:
date ID1 ID2 New_ID
0 201906 a X a
1 201903 b Y a
2 201811 a Y a
3 201802 a Z a
4 202003 c H c
5 202001 d H c
6 201907 c I c
7 201904 d J c

If your data is sorted by date, then I believe what you want is simply:
df['New_ID'] = df.groupby('ID2')['ID1'].transform('last')
output:
date ID1 ID2 New_ID
0 201908 a X a
1 201905 b Y a
2 201811 a Y a
3 201807 a Z a

Okay, after googling and thinking about an approach I finally found one using the library networkx. I wanted to share it for the case someone else is/will be facing the same problem. Basically, I have a bipartit graph that I want to decompose in connected components. You can define the following functions and get the desired result as follows:
import pandas as pd
import networkx as nx
from itertools import chain
df_sub = pd.DataFrame(
data=dict(
date=[201906, 201903, 201811, 201802, 202003, 202001, 201907, 201904],
ID1=["a", "b", "a", "a", "c", "d", "c", "d"],
ID2=["X", "Y", "Y", "Z", "H", "H", "I", "J"]
)
)
def _graph_decomposition(graph_as_df: pd.DataFrame) -> list:
# Initialize Graph (in my case, bipartit graph)
G = nx.Graph()
# Get connections
G.add_edges_from(graph_as_df.drop_duplicates().to_numpy().tolist())
# Create list containing connected components
connected_components = list(nx.connected_components(G))
return connected_components
def stabilized_ID(graph_as_df: pd.DataFrame) -> pd.DataFrame:
components: list= _graph_decomposition(graph_as_df)
# Chain components -> list of list to only one list
ID1_mapping = list(chain.from_iterable(components))
ID1_true = []
for component in components:
# Convert set to list
component = list(component)
# For my case, ID2 starts always with '0' and ID1 always with 'C'
# and max(['C', '0999999']) = 'C'
ID1_true += [max(component)] * len(component)
# Assert length are equal
assert len(ID1_true) == len(ID1_mapping)
# Define final mapping
mapping = pd.DataFrame(data={'ID1': ID1_mapping, 'ID1_true': ID1_true})
return mapping
mapping = stabilized_ID(df_sub[['ID1', 'ID2']])
pd.merge(df_sub, mapping, on=['ID1'], how='inner')
This approach takes 40 seconds for my whole data frame that consists of 5.1 mio rows (the merge operation alone takes 34 seconds). It produces the following data frame:
date ID1 ID2 ID1_true
0 201906 a X b
1 201811 a Y b
2 201802 a Z b
3 201903 b Y b
4 202003 c H d
5 201907 c I d
6 202001 d H d
7 201904 d J d
Since I made the next steps time-independent, I do not need the most recent value anymore. Now, it is only important to me that the ID_New values are equal to one of the connected components from ID1, not to the most recent one. If needed, one could also map the most recent ID1 value as described in my question.

Related

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

Split Pandas Dataframe Column According To a Value

I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]

How to find occurrences of values in a different column in Python?

I have a pandas DataFrame df that portrays edges of a directed acyclic graph, sorted by Target:
Source Target
C A
D A
A B
C B
D B
E B
E C
C D
E D
I would like to add a column Weight based on occurrences of values.
Weight should illustrate the number of appearance of the Target value in Target divided by the number of appearance of the Source value in Target.
In other words, the first row of the example should have the Weight of 2/1 = 2, since A appears twice in Target where C appears only once in Target.
I have first tried
df.apply(pd.Series.value_counts)
but the problem is my actual DataFrame is extremely large, so I am not able to manually search for each occurrence value from the outcome and make a quotient. I have also tried to write two new columns that signify the values I need, then to write a final column that consists of what I want:
df['tfreq'] = df.groupby('Target')['Target'].transform('count')
df['sfreq'] = df.groupby('Source')['Target'].transform('count')
but it seems like my second line of code returns the occurrences of Source values in Source column instead of Target column.
Are there any insights on this problem?
Use value_counts with map. Then divide them:
val_counts = df['Target'].value_counts()
counts1 = df['Target'].map(val_counts)
counts2 = df['Source'].map(val_counts)
df['Weights'] = counts1.div(counts2) # same as counts1 / counts2
Output
Source Target Weights
0 C A 2.0
1 D A 1.0
2 A B 2.0
3 C B 4.0
4 D B 2.0
5 E B NaN
6 E C NaN
7 C D 2.0
8 E D NaN
note: we get NaN because E does not occur in column Target

How to split tuple of tuples into columns

I have a pandas dataframe where one column is a tuple with a nested tuple. The nested tuple has two existing ids. I want to explode every element in the total tuple into new appended columns. Here's my df so far:
df
id1 id2 tuple_of_tuple
0 a e ('cat',100,('a','f'))
1 b f ('dog',100,('b','g'))
2 c g ('cow',100,('d','h'))
3 d h ('tree',100,('c','e'))
I was trying to implement the code below on a small subset of data, and it seemed to work. There were new appended columns with each extracted/exploded element where it needed to be.
df[['Link_1', 'Link_2','Link_3','Link_4']] = df['tuple_of_tuple'].apply(pd.Series)
But when I apply it on the entire dataset, I get the error "ValueError: Columns must be same length as key". (I should mention that there are a couple NaN's littered around, as in an entire entry in the row for the tuple_of_tuple column will just be NaN). How can I fix this?
Here's an extremely elegant way to do it using python3.6's * unpacking operator:
df2 = pd.DataFrame(
data=[[*i, *j] for *i, j in df.pop('tuple_of_tuple')],
columns=['link_1', 'link_2', 'link_3', 'link_4']
)
You can then link df2 with df using pd.concat:
pd.concat([df, df2], axis=1)
id1 id2 link_1 link_2 link_3 link_4
0 a e cat 100 a f
1 b f dog 100 b g
2 c g cow 100 d h
3 d h tree 100 c e

Subtract subgroup averages from individuals without resorting to for loop

I have a dataframe with a number of columns, two of which are grouping variables.
>>> df2
Groupvar1 Groupvar2 x y z
0 A 1 0.726317 0.574514 0.700475
1 A 2 0.422089 0.798931 0.191157
2 A 3 0.888318 0.658061 0.686496
....
13 B 2 0.978920 0.764266 0.673941
14 B 3 0.759589 0.162488 0.698958
and I want to make a new dataframe which holds the diffrence between each datapoint in the origianl df and the mean corresponding to its subgroup.
So to begin with a make the new df with the grouped averages:
>>> grp_vars = ['Groupvar1','Groupvar2']
>>> df2_grp = df2.groupby(grp_vars)
>>> df2_grp_avg = df2_grp.mean()
>>> df2_grp_avg
x y z
Groupvar1 Groupvar2
A 1 0.364533 0.645237 0.886286
2 0.325533 0.500077 0.246287
3 0.796326 0.496950 0.510085
4 0.774854 0.688732 0.487547
B 1 0.743783 0.452482 0.612006
2 0.575687 0.396902 0.446126
3 0.473152 0.476379 0.508060
4 0.434320 0.406458 0.382187
and in the new dtaframe I want to keep the deltas, defined as:
delta = individual value - average value of the subgroup this individual is a member of
Now, it's clear to me how to do this the hard way (for loop) but I supose there must be a more elegant solution. Apprecaite any advice on finding that more elegant solution. TIA.
Use .groupby(...).transform function:
>>> demean = lambda df: df - df.mean()
>>> df.groupby(['Groupvar1', 'Groupvar2']).transform(demean)
ant then pd.concat the result with the original data-frame.

Categories