Eliminate duplicate connections in undirected network - python

I have an undirected network of connections in a dataframe.
Source_ID Target_ID
0 1 5
1 7 2
2 12 6
3 3 9
4 16 11
5 2 7 <------The same as row 1
6 4 8
7 5 1 <------The same as row 0
8 99 81
But since this is an undirected network, row 0 and row 7 are technically the same, as are row 1 and row 5. df.drop_duplicates() isn't smart enough to know how to eliminate these as duplicates, as it see them as two distinct rows, at least as far as my attempts have yielded.
I also tried what I thought should work, which is using the index of Source_ID and Target_ID and setting Source_ID to be "lower" than target_ID. But that didn't seem to produce the results I needed either.
df.drop(df.loc[df['Target_ID'] < d['Source_ID']]
.index.tolist(), inplace=True)
Therefore, I need to figure out a way to drop the duplicate connections (while keeping the first) such that my fixed dataframe looks like (after an index reset):
Source_ID Target_ID
0 1 5
1 7 2
2 12 6
3 3 9
4 16 11
5 4 8
6 99 81

Certainly not the most efficient, but might do the job:
df.apply(lambda row: pd.Series() if row[::-1].values in df.values \
and row[0] < row[1] else row, axis=1).dropna().reset_index(drop=True)

Related

How can I copy values from one dataframe to other dataframe fastly

I would like to create on my Dataframe (Global_Dataset) a new column (Col_val) based on the other Dataframe (List_Data).
I need a faster code because I have a dataset of 2 million samples and List_data contains 50000 samples.
Col_Val must contain the value of column Value according to Col_Key
List_Data:
id Key Value
1 5 0
2 7 1
3 9 2
Global_Dataset:
id Col_Key Col_Val
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
I have tried this code but it needs a long time to be executed. Is there any other faster way for achieving my goal?
Col_Val = []
for i in range (len(List_Data)):
for j in range (len(Global_Data)):
if List_Data.get_value(i, "Key") == Global_Data.get_value(j, 'Col_Key') :
Col_Val.append(List_Data.get_value(i, 'Value'))
Global_Data['Col_Val'] = Col_Val
PS: I have tried loc and iloc instead of get_value but it works very slow
Try this:
data_dict = {key : value for key, value in zip(List_Data['Key'], List_Data['Value'])}
Global_Data['Col_Val'] = pd.Series([data_dict[key] for key in Global_Data['Col_Key']])
I don't know how long it will takes on your machine with the amount of data you need to handle, but it should be faster of what you are using now.
You could also generate the dictionary with data_dict = {row['Key'] : row['Value'] for _, row in list_data.iterrows()} but on my machine is slower than what I proposed above.
It works under the assumption that all the keys in Global_Data['Col_Keys'] are present in List_Data['Key'], otherwise you will get a KeyError.
There is no reason to loop through anything, either manually or with iterrows. If I understand your problem, this should be a simple merge operation.
df
Key Value
id
1 5 0
2 7 1
3 9 2
global_df
Col_Key
id
1 9
2 5
3 9
4 7
5 7
6 5
7 9
8 7
9 9
10 5
global_df.reset_index()\
.merge(df, left_on='Col_Key', right_on='Key')\
.drop('Key', axis=1)\
.set_index('id')\
.sort_index()
Col_Key Value
id
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
Note that the essence of this is the global_df.merge(...), but the extra operations are to keep the original indexing and remove unwanted extra columns. I encourage you to try each step individually to see the results.

How to modify values which are one row below the values that meet a condition?

Is there an efficient way to change the value of a previous row whenever a conditional is met in a subsequent entry? Specifically I am wondering if there is anyway to adapt pandas.where to modify the entry in a row prior or subsequent to the conditional test. Suppose
Data={'Energy':[12,13,14,12,15,16],'Time':[2,3,4,2,5,6]}
DF = pd.DataFrame(Data)
DF
Out[123]:
Energy Time
0 12 2
1 13 3
2 14 4
3 12 2
4 15 5
5 16 6
If I wanted to change the value of Energy to 'X' whenever Time <= 2 I could just do something like.
DF['ENERGY']=DF['ENERGY'].where(DF['TIME'] >2,'X')
or
DF.loc[DF['Time']<=2,'Energy']='X'
Which would output
Energy Time
0 X 2
1 13 3
2 14 4
3 X 2
4 15 5
5 16 6
But what if I want to change the value of 'Energy' in the row after Time <=2 so that the output would actually be.
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Is there an easy modification for a vectorized approach to this?
Shift the values one row down using Series.shift and then compare:
df.loc[df['Time'].shift() <= 2, 'Energy'] = 'X'
df
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Side note: I assume 'X' is actually something else here, but FYI, mixing strings and numeric data leads to object type columns which is a known pandas anti-pattern.

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

How to update a value in a dataframe by a specific condition, a specific colum and a specific row?

I am trying update a value in this selector (in a loop):
df.loc[df['wsid']==w,col_name].iloc[int(lag)]
Rebuild an example (inside the loop), will be:
df.loc[df['wsid']==329,'stp_1'].iloc[0]
I can print the value, but I don't know how to update it:
df.loc[df['wsid']==329,'stp_1'].iloc[0] = 0 ??
This should work:
idx = df.loc[df['wsid']==w].index
df.loc[df.loc[idx, 'wsid'].index[0], 'wsid'] = 0
Explanation
.loc accessor can be used to slice and set parts of a dataframe.
It accepts inputs of the form df.loc[index_labels, column_name]. For more details, see Selection by Label.
The index is extracted only for the subset of data you specify.
It seems like you only want to update a certain cell in a dataframe based on some condition.
Here's the setup -
df = pd.DataFrame({'col' : np.arange(3, 13)})
df
col
0 3
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 12
Now, assume you want to find records which are divisible by 3. However, you only want to update the first item that matches this condition. You can use idxmax in this case.
m = df.col.mod(3).eq(0)
df.loc[m.idxmax(), 'col'] = 0
df
col
0 0 # first item matching condition updated
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 12
On the other hand, if it is anything besides the first index, you'll need something a little more involved. For example, in the third row matching the condition.
i = 3
df.loc[m.mask(~m).dropna().index[i], 'col'] = 0
df
col
0 3
1 4
2 5
3 6
4 7
5 8
6 9
7 10
8 11
9 0 # third item matching condition updated

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories