I have a pandas DataFrame like so:
from_user to_user
0 123 456
1 894 135
2 179 890
3 456 123
Where each row contains two IDs that reflect whether the from_user "follows" the to_user. How can I count the total number of mutual followers in the DataFrame using pandas?
In the example above, the answer should be 1 (users 123 & 456).
One way is to use MultiIndex set operations:
In [11]: i1 = df.set_index(["from_user", "to_user"]).index
In [12]: i2 = df.set_index(["to_user", "from_user"]).index
In [13]: (i1 & i2).levels[0]
Out[13]: Int64Index([123, 456], dtype='int64')
To get the count you have to divide the length of this index by 2:
In [14]: len(i1 & i2) // 2
Out[14]: 1
Another way to do is to concat the values and sort them as string.
Then count how many times the values occur:
# concat the values as string type
df['concat'] = df.from_user.astype(str) + df.to_user.astype(str)
# sort the string values of the concatenation
df['concat'] = df.concat.apply(lambda x: ''.join(sorted(x)))
# count the occurences of each and substract 1
count = (df.groupby('concat').size() -1).sum()
Out[64]: 1
Here is another slightly more hacky way to do this:
df.loc[df.to_user.isin(df.from_user)]
.assign(hacky=df.from_user * df.to_user)
.drop_duplicates(subset='hacky', keep='first')
.drop('hacky', 1)
from_user to_user
0 123 456
The whole multiplication hack exists to ensure we don't return 123 --> 456 and 456 --> 123 since both are valid given the conditional we provide to loc
Related
I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros.
dataframe is df.
ID
345
345
543
000922
000345
000345
000543
So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.
I do that like this:
new_df = df.loc[df['ID'].isin(df['ID'])]
My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.
Now I have
ID
345
345
543
000345
000345
000543
I can use a .unique() to get a series of each unique combo.
ID
345
543
000345
000543
This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.
I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa.
Any tips on that would be appreciated.
If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.
Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.
One option is to strip leading "0"s:
out = df['ID'].str.lstrip('0').unique()
Output:
array(['345', '543', '922'], dtype=object)
or prepend "0"s:
out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()
Output:
array(['000345', '000543', '000922'], dtype=object)
Use:
print (df)
ID
0 345
1 345
2 540
3 2922
4 002922
5 000344
6 000345
7 000543
#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922 002922
344 000344
345 000345
543 000543
Name: ID, dtype: object
#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}
#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}
Create a dictionary for finding potentially affected records.
# Creates a dummy dataframe.
df = pd.DataFrame(['00456', '0000456', '567', '00567'], columns=['ID'])
df['stripped'] = pd.to_numeric(df['ID'])
df['affected_id'] = df.ID.str.len() == df.stripped.astype(str).str.len()
df
ID stripped affected_id
0 00456 456 False
1 0000456 456 False
2 567 567 True
3 00567 567 False
# Creates a dictionary of potentially affected records.
d = dict()
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[(df.stripped == i) & (df.ID != str(i))].ID.unique().tolist()
d
{567: ['00567']}
If you want to include the stripped records into the list, then:
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[df.stripped == i].ID.unique().tolist()
d
{567: ['567', '00567']}
You can convert the column type to int
m = df['ID'].ne(df['ID'].astype(int))
print(m)
0 False
1 False
2 False
3 True
4 True
5 True
Name: ID, dtype: bool
print(df[m])
ID
3 000345
4 000345
5 000543
I have a pd Dataframe and would like to calculate one column based on two others from the same dataframe. I would like to use Numpy vectorisation for this as the dataset is large.
Here is the dataframe:
Input Dataframe
A B
0 567 345
1 123 456
2 568 354
Output Dataframe
A B C
0 567 345 567.345
1 123 456 123.456
2 568 354 568.354
where column C is a concatenation between A and B with dot between both values.
I am using apply():
df['C'] = df.apply(lambda row: str(row['A']) + '.' + str(row['B']), axis=1)
instead to iterate over rows/index etc. but still it is slow.
I know that I could do:
df['C'] = df['A'].values + df['B'].values
which is extremely faster, but this will not give me the desired result, and on the same time:
df['C'] = str(df['A'].values) + '.' + str(df['B'].values)
will give me something completely different.
The example is just for presentation purposes (the values of A and B could be of any type). The question is more general.
Thank you in advance!
A list comprehension should be faster then apply or such use case:
df['C'] = [f"{a}.{b}" for a,b in zip(df['A'],df['B'])]
Outputs
A B C
0 567 345 567.345
1 123 456 123.456
2 568 354 568.354
To convert numbers to strings you can use the method astype():
df['A'].astype('str') + '.' + df['B'].astype('str')
I have a data frame with duplicate rows ('id').
I want to aggregate the data, but first need to sum unique sessions per id.
id session
123 X
123 X
123 Y
123 Z
234 T
234 T
This code works well, but not when I want to add this new column 'ncount' to my data frame.
df['ncount'] = df.groupby('id')['session'].nunique().reset_index()
I tried using transform and it didn't work.
df['ncount'] = df.groupby('id')['session'].transform('nunique')
This is the result from the transform code (my data as duplicates id):
id session ncount
123 X 1
123 X 1
123 Y 1
123 Z 1
234 T 1
234 T 1
This is the result I'm interested in:
id session ncount
123 X 3
123 X 3
123 Y 3
123 Z 3
234 T 1
234 T 1
Use the following steps:
1.Group data and store in separate variable.
2.Then merge back to original data frame.
Code:
import pandas as pd
df = pd.DataFrame({"id":[123,123,123,123,234,234],"session":["X","X","Y","Z","T","T"]})
x = df.groupby(["id"])['session'].nunique().reset_index()
res = pd.merge(df,x,how="left",on="id")
print(res)
You can rename the column names if required .
using .count()
Steps:
1: Group the data by "id" and count the values of id values then
2: Decrease the Count by one for index format and Merge to two DataFrames
import pandas as pd
df = pd.DataFrame({"id":[123,123,123,123,234,234],"session":["X","X","Y","Z","T","T"]})
uniq_df = df.groupby(["id"])["session"].count().reset_index()
uniq_df["session"] = uniq_df["session"] - 1
result = pd.merge(df,uniq_df,how="left",on="id")
print(result)
I have a pandas dataframe that looks something like this:
Item Status
123 B
123 BW
123 W
123 NF
456 W
456 BW
789 W
789 NF
000 NF
And I need to create a new column Value which will be either 1 or 0 depending on the values in the Item and Status columns. The assignment of the value 1 is prioritized by this order: B, BW, W, NF. So, using the sample dataframe above, the result should be:
Item Status Value
123 B 1
123 BW 0
123 W 0
123 NF 0
456 W 0
456 BW 1
789 W 1
789 NF 0
000 NF 1
Using Python 3.7.
Taking your original dataframe as input df dataframe, the following code will produce your desired output:
#dictionary assigning order of priority to status values
priority_map = {'B':1,'BW':2,'W':3,'NF':4}
#new temporary column that converts Status values to order of priority values
df['rank'] = df['Status'].map(priority_map)
#create dictionary with Item as key and lowest rank value per Item as value
lowest_val_dict = df.groupby('Item')['rank'].min().to_dict()
#new column that assigns the same Value to all rows per Item
df['Value'] = df['Item'].map(lowest_val_dict)
#replace Values where rank is different with 0's
df['Value'] = np.where(df['Value'] == df['rank'],1,0)
#delete rank column
del df['rank']
I would prefer an approach where the status is an ordered pd.Categorical, because a) that's what it is and b) it's much more readable: if you have that, you just compare if a value is equal to the max of its group:
df['Status'] = pd.Categorical(df['Status'], categories=['NF', 'W', 'BW', 'B'],
ordered=True)
df['Value'] = df.groupby('Item')['Status'].apply(lambda x: (x == x.max()).astype(int))
# Item Status Value
#0 123 B 1
#1 123 BW 0
#2 123 W 0
#3 123 NF 0
#4 456 W 0
#5 456 BW 1
#6 789 W 1
#7 789 NF 0
#8 0 NF 1
I might be able to help you conceptually, by explaining some steps that I would do:
Create the new column Value, and fill it with zeros np.zeros() or pd.fillna()
Group the dataframe by Item with groupby = pd.groupby('Item')
Iterate through all the groups founds for name, group in groupby:
By using a simple function with if's, a custom priority queue, custom sorting criteria, or any other preferred method, determine which entry has higher priority " by this value 1 is prioritized by this order: B, BW, W, NF ", and assign a value of 1 to it's Value column group.loc[entry]['Value'] == 0
Let's say we are looking at group '123':
Item Status Value
-------------------------
123 B 0 (before 0, after 1)
123 BW 0
123 W 0
123 NF 0
Because the row [123, 'B', 0] had the highest priority based on your criteria, you change it to [123, 'B', 1]
When finished, create the dataframe back from the groupby object, and you're done. You have a lot of possibilities for doing that, might check here: Converting a Pandas GroupBy object to DataFrame
If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]