How to count the instances of unique values in Python Dataframe - python

I have a dataframe like below where I have 2 million rows. The sample data can be found here.
The list of matches in every row can be any number between 1 to 761. I want to count the occurrences of every number between 1 to 761 in the matches column altogether. For example, the result of the above data will be:
If a particular id is not found, then the count will be 0 in the output. I tried using for loop approach but it is quite slow.
def readData():
df = pd.read_excel(file_path)
pattern_match_count = [0] * 761
for index, row in df.iterrows():
matches = row["matches"]
for pattern_id in range(1, 762):
if(pattern_id in matches):
pattern_match_count[pattern_id - 1] = pattern_match_count[pattern_id - 1] + 1
Is there any better approach with pandas to make the implementation faster?

You can use the .explode() method to "explode" the lists into new rows.
def readData():
df = pd.read_excel(file_path)
return df.loc[:, "count"].explode().value_counts()

You can use collections.Counter:
df = pd.DataFrame({"matches": [[1,2,3],[1,3,3,4]]})
#df:
# matches
#0 [1, 2, 3]
#1 [1, 3, 3, 4]
from collections import Counter
C = Counter([i for sl in df.matches for i in sl])
#C:
#Counter({1: 2, 2: 1, 3: 3, 4: 1})
pd.DataFrame(C.items(), columns=["match_id", "counts"])
# match_id counts
#0 1 2
#1 2 1
#2 3 3
#3 4 1
If you want zeros for match_ids that aren't in any of the matches, then you can update the Counter object C:
for i in range(1,762):
if i not in C:
C[i] = 0
pd.DataFrame(C.items(), columns=["match_id", "counts"])

Related

How to create new column and assign values by column group

I have dataframe,
df
uid
1
2
3
...
I want to assign a new column, with values 0 or 1 depending on the uid, which I will assign.
df
uid new
1 0
2 0
3 1
..
You must explain the underlying logic.
That said, there are many possible ways.
Considering an explicit mapping with map:
mapper = {1: 0, 2: 0, 3: 1}
df['new'] = df['uid'].map(mapper)
# or
mapper = {0: [1, 2], 1: [3]}
df['new'] = df['uid'].map({k:v for v,l in mapper.items() for k in l})
Or using a list of targets for the 1 with isin and conversion to int
target = [3]
df['new'] = df['uid'].isin(target).astype(int)
Output:
uid new
0 1 0
1 2 0
2 3 1
If there is a correlation between uid and new, you can create a function to define the mapping between uid and new
def mapping(value):
new_value = value // 2
return new_value
Then
df["new"] = df["uid"].apply(mapping)
Or directly
df["new"] = df["uid"].apply(lambda value: value // 2)
from the relation of 3 uid's I came up as a relation the uid which is divisible by 3 is assign to 1 else 0. (Not sure the relation is correct or not as you have given only 3 values of uid)
you can apply np.where() -> np.where(condition, x, y) if condition satisfy it assign value x else value y
import pandas as pd
import numpy as np
df = pd.DataFrame({'uid': [1, 2, 3]})
df["new"] = np.where(df["uid"] % 3 == 0, 1, 0)
print(df)
Output:
uid new
0 1 0
1 2 0
2 3 1

Python iterating through data and returning deltas

Python newbie here with a challenge I'm working to solve...
My goal is to iterate through a data frame and return what changed line by line. Here's what I have so far:
pseudo code (may not be correct method)
step 1: set row 0 to an initial value
step 2: compare row 1 to row 0, add changes to a list and record row number
step 3: set current row to new initial
step 4: compare row 2 to row 1, add changes to a list and record row number
step 5: iterate through all rows
step 6: return a table with changes and row index where change occurred
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta():
changes = []
initial = df.loc[0]
for row in df:
if row[i] != initial:
changes.append[i]
delta()
changes I expect to see:
index 1: col2 changed from 1 to 2, 2 should be added to changes list
index 2: col 1 and col3 changed from 1 to 2, both 2s should be added to changes list
index 4: col 1 changed from 2 to 3, 3 should be added to changes list
You can check where each of the columns have changed using the shift method and then use a mask to only get the ones that have changed
df.loc[:, 'col1_changed'] = df['col1'].mask(df['col1'].eq(df['col1'].shift()))
df.loc[:, 'col2_changed'] = df['col2'].mask(df['col2'].eq(df['col2'].shift()))
df.loc[:, 'col3_changed'] = df['col3'].mask(df['col3'].eq(df['col3'].shift()))
Once you have identified the changes, you can agg them together
# We don't consider the first row
df.loc[0, ['col1_changed', 'col2_changed', 'col3_changed']] = [np.nan] * 3
df[['col1_changed', 'col2_changed', 'col3_changed']].astype('str').agg(','.join, axis=1).str.replace('nan', 'no change')
#0 no change,no change,no change
#1 no change,2.0,no change
#2 2.0,no change,2.0
#3 no change,no change,no change
#4 3.0,no change,no change
You can use the pandas function diff() which will already provide the increment compared to the previous row:
import pandas as pd
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta(df):
deltas = df.diff() # will convert to float because this is needed to set Nans in the first row
deltas.iloc[0] = df.iloc[0] # replace Nans in first row with original data from first row
deltas = deltas.astype(df.dtypes) # reset data types according to input data
filter = (deltas!=0).any(axis=1) # filter to use only those rows where all values are non zero
filter.iloc[0] = True # make sure the first row is included even if original data for first row held only zeros
deltas = deltas.loc[filter] # actually apply the filter
return deltas
print( delta(df) )
This prints:
col1 col2 col3
0 1 1 1
1 0 1 0
2 1 0 1
4 1 0 0

Populating a dictionary of lists with values from 2 columns of a DataFrame

If I have a DataFrame that stores values from 2 columns (A & B) from a CSV file, how would I populate a dictionary using a for loop to get the values from the DataFrame?
I need to store the rows of A B pairs like this..
A_&_B_sets = {1:[A1,B1],2:[A2,B2],…}
A_&_B_sets = {1:[A1,B1],2:[A2,B2],…}
for i in (1,n+1):
A_&_B_sets[i] = i * I
I am quite lost. Any help greatly appreciated.
If you have df like this:
Column A Column B
0 1 4
1 2 5
2 3 6
Then you can do:
A_and_B_sets = {}
for i, (_, row) in enumerate(df.iterrows(), 1):
A_and_B_sets[i] = [row["Column A"], row["Column B"]]
print(A_and_B_sets)
to print:
{1: [1, 4], 2: [2, 5], 3: [3, 6]}

retain function in python

Recently, I am converting from SAS to Python pandas. One question I have is that does pandas have a retain like function in SAS,so that I can dynamically referencing the last record. In the following code, I have to manually loop through each line and reference the last record. It seems pretty slow compared to the similar SAS program. Is there anyway that makes it more efficient in pandas? Thank you.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 1, 1, 1], 'B': [0, 0, 1, 0]})
df['C'] = np.nan
df['lag_C'] = np.nan
for row in df.index:
if row == df.head(1).index:
df.loc[row, 'C'] = (df.loc[row, 'A'] == 0) + 0
else:
if (df.loc[row, 'B'] == 1):
df.loc[row, 'C'] = 1
elif (df.loc[row, 'lag_C'] == 0):
df.loc[row, 'C'] = 0
elif (df.loc[row, 'lag_C'] != 0):
df.loc[row, 'C'] = df.loc[row, 'lag_C'] + 1
if row != df.tail(1).index:
df.loc[row +1, 'lag_C'] = df.loc[row, 'C']
Very complicated algorithm, but I try vectorized approach.
If I understand it, there can be use cumulative sum as using in this question. Last column lag_C is shifted column C.
But my algorithm can't be use in first rows of df, because only these rows are counted from first value of column A and sometimes column B. So I created column D, where are distinguished rows and latter are copy to output column C, if conditions are True.
I changed input data and test first problematic rows. I try test all three possibilities of first 3 rows of column B with first row of column A.
My input condition are:
Column A and B are only 1 or O. Column C and lag_C are helper columns with only NaN.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
df1 = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
#cumulative sum of column B
df1['C'] = df1['B'].cumsum()
df1['lag_C'] = 1
#first 'group' with min value is problematic, copy to column D for latter use
df1.loc[df1['C'] == df1['C'].min() ,'D'] = df1['B']
#cumulative sums of groups to column C
df1['C']= df1.groupby(['C'])['lag_C'].cumsum()
#correct problematic states in column C, use value from D
if (df1['A'].loc[0] == 1):
df1.loc[df1['D'].notnull() ,'C'] = df1['D']
if ((df1['A'].loc[0] == 1) & (df1['B'].loc[0] == 1)):
df1.loc[df1['D'].notnull() ,'C'] = 0
del df1['D']
#shifted column lag_C from column C
df1['lag_C'] = df1['C'].shift(1)
print df1
# A B C lag_C
#0 1 0 0 NaN
#1 1 0 0 0
#2 1 1 1 0
#3 1 1 1 1
#4 1 0 2 1
#5 0 0 3 2
#6 0 0 4 3
#7 1 1 1 4
#8 1 0 2 1
#9 0 1 1 2
#10 0 0 2 1

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

Categories