Iterate to find the repeat values in Pandas dataframe

Iterate to find the repeat values in Pandas dataframe - python

Window 10, Python 3.6
I have a dataframe df
df=pd.DataFrame({'name':['boo', 'foo', 'too', 'boo', 'roo', 'too'],
'zip':['30004', '02895', '02895', '30750', '02895', '02895']})
I want to find the repeat record that has same 'name' and 'zip', and record the repeat times. The idea output is
name repeat zip
0 too 1 02895
Because my dataframe is much more than six rows, I need to use a iterate method. I appreciate any tips.

I believe you need groupby all columns and use GroupBy.size:
#create DataFrame from online source
#df = pd.read_csv('someonline.csv')
#df = pd.read_html('someurl')[0]
#L = []
#for x in iterator:
#in loop added data to list
# L.append(x)
##created DataFrame from contructor
#df = pd.DataFrame(L)
df = df.groupby(df.columns.tolist()).size().reset_index(name='repeat')
#if need specify columns
#df = df.groupby(['name','zip']).size().reset_index(name='repeat')
print (df)
name zip repeat
0 boo 30004 1
1 boo 30750 1
2 foo 02895 1
3 roo 02895 1
4 too 02895 2

Pandas has a handy .duplicated() method that can help you identify duplicates.
df.duplicated()
By passing the duplicate vector into a selection you can get the duplicate record:
df[df.duplicated()]
You can get the sum of the duplicated records by using .sum()
df.duplicated().sum()

Related

Pandas Series of lists of strings how to count and append to df per row

Summarize the problem
I have a Dataframe with a Series of lists of strings.
I can use a Counter from 'collections' to iterate over each row of the series and count the times each string appeared easy enough:
for list_of_strings in df['info']:
counter = Counter()
for string in list_of_strings:
counter[string]+=1
What I want is to mutate the df to now have as many columns as there were unique strings found in all of the lists of strings in the df['info'] series. The column names would be the string value.
For each row in the df, the new columns values would either be 0 if the string was not found or N where N is the count of how many times the string was found in that rows list_of_strings
Describe what you’ve tried
I was able to make a df whose # and name of columns matches the unique strings found, but I cant figure out how to append the counts per row, and Im not even sure if crafting and appending to a brand new dataframe is the right way to go about it?:
unique_df = pd.Series(copy_df['info'].explode().unique()).to_frame().T
I tried doing something using df.at() for each counter key but it exploded my juptyr notebook :\
Any help is appreciated, let me know what other info I can provide.

Assuming this input Series:
s = pd.Series([['A','B','A'],['C','B'],['D']])
You can explode, groupby.value_counts and unstack with a fill_value of 0:
(s.explode()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0)
.reindex(s.index, fill_value=0)
)
Or use crosstab:
s2 = s.explode()
(pd.crosstab(s2.index, s2)
.rename_axis(index=None, columns=None)
)
Output:
A B C D
0 2 1 0 0
1 0 1 1 0
2 0 0 0 1

Dropping rows of dataframe in a for-loop in Python

I have multiple dataframes with multiple columns as this:
DF =
A B C metadata_Colunm
r1 6 3 9 r1
r2 2 1 1 r2
r3 5 7 2 r3
How can I use a for-loop to iterate over each column to make a new dataframe and then remove rows where values are below 5 for each new dataframe?
The result should look like this:
DF_A=
A metadata_Colunm
6 r1
5 r1
DF_B=
B metadata_Colunm
7 r3
DF_C=
C metadata_Colunm
9 r1
What I have done so far is to make a list over the columns I will use (all excluding metadata) and then go trough the columns as new dataframes. Since I also need to preserve the metadata I add the metadata-column as part of the new dataframe:
DF = DF.drop("metadata_Colunm")
ColList = list(DF)
for item in ColList:
locals()[f"DF_{str(item)}"] = DF[[item, "metadata_Colunm"]]
locals()[f"DF_{str(item)}"] = locals()[f"DF_{str(item)}"].drop(locals()[f"DF_{str(item)}"][locals()[f"DF_{str(item)}"].item > 0.5].index, inplace=True)
But using this I get "AttributeError: 'DataFrame' object has no attribute 'item'.
Any suggestions for making this work, or any other solutions, would be greatly appreciated!
Thanks in advance!

dfs = {}
for col in df.columns[:-1]:
df_new = df[[col, 'metadata_Colunm']]
dfs[col] = df_new[df_new[col] >= 5]

I would make a dictionary to add your new dataframes to, like this:
dictionary = {}
for col in df.columns[:-1]: # all columns but last
new_df = df.loc[:, (col, 'metadata_column')] # make slices
for index, row in new_df.iterrows():
if new_df.loc[index, col] < 5: # remove < 5
new_df.drop(index=index, inplace=True)
dictionary[col] = new_df # add to dictionary so you can refer to later
You can then call each dataframe via e.g. dictionary['A'].
According to this its best practice to slice the dataframe using df.loc() as opposed to df[].

you can apply a filter to the dataframe(s) instead of using a loop
def filter(df, threshold=5):
for column in df.columns:
df = df[df[column]>=threshold]
Then apply the filer to all your dataframes:
dfs = [df1, df2, df3...]
for df in dfs:
filter(df)

How to iterate over a list of dataframes in pandas?

I have multiple dataframes, on which I want to run this function which mainly drops unnecessary columns from the dataframe and returns a dataframe:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
#first drop nan columns
df = df.loc[:, df.columns.notnull()]
#then search for columns with unnamed
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
return df
Now I iterate over the list of dataframes: [df1, df2, df3]
dfsublist = [df1, df2, df3]
for index in enumerate(dfsublist):
dfsublist[index] = dropunnamednancols(dfsublist[index])
Whereas the items of dfsublist have been changed, the original dataframes df1, df2, df3 still retain the unnecessary columns. How could I achieve this?

If I understand correctly you want to apply a function to multiple dataframes seperately.
The underlaying issue is that in your function you return a new dataframe and replace the stored dataframe in the list with a new own instead of modifying the old orignal one.
If you want to modify the orignal one you have to use the inplace=True parameters of the pandas functions. This is possible, but not recommended, as seen here.
Your code could therefore look like this:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
cols = [col for col in df.columns if (col is None) | (col.startswith('Unnamed'))]
df.drop(cols, axis=1, inplace=True)
As example on sample data:
import pandas as pd
df_1 = pd.DataFrame({'a':[0,1,2,3], 'Unnamed':[9,8,7,6]})
df_2 = pd.DataFrame({'Unnamed':[9,8,7,6], 'b':[0,1,2,3]})
lst_dfs = [df_1, df_2]
[dropunnamednancols(df) for df in lst_dfs]
# df_1
# Out[55]:
# a
# 0 0
# 1 1
# 2 2
# 3 3
# df_2
# Out[56]:
# b
# 0 0
# 1 1
# 2 2
# 3 3

The reason is probably because your are using enumerate wrong. In your case, you just want the index, so what you should do is:
for index in range(len(dfsublist)):
...
Enumerate returns a tuple of an index and the actual value in your list. So in your code, the loop variable index will actually be asigned:
(0, df1) # First iteration
(1, df2) # Second iteration
(2, df3) # Third iteration
So either, you use enumerate correctly and unpack the tuple:
for index, df in enumerate(dfsublist):
...
or you get rid of it altogether because you access the values with the index either way.

Filtering a list of pandas dataframes to only include row indexes ending in a

Hi I have a list of pd dataframes (1377 of them). I need to split each dataframe into cases where the row index ends in a and where the row index ends in c.
I have looked at other stack overflow pages where this is suggested
(df.iloc[all_dfs[0].index.str.endswith('a',na=False)])
however this transposes my dataframe and then reduces the number of rows (previously columns before transposing)
Here is a short section from my first dataframe if that helps.

You can pass tuple of test values to str.endswith with boolean indexing for filtering:
df = pd.DataFrame({'a':range(5)},
index=['_E031a','_E031b','_E031c','_E032a','_E032b'])
df1 = df[df.index.str.endswith(('a', 'c'),na=False)]
print (df1)
a
_E031a 0
_E031c 2
_E032a 3
Or get last values of strings by indexing [-1] and test membership by Index.isin:
df1 = df[df.index.str[-1].isin(['a', 'c'])]
print (df1)
a
_E031a 0
_E031c 2
_E032a 3
For looping in list of DataFrames use:
all_dfs = [df[df.index.str.endswith(('a', 'c'),na=False)] for df in all_dfs]
If want only test a:
all_dfs = [df[df.index.str.endswith('a',na=False)] for df in all_dfs]

Retrieve multiple lookup values in large dataset?

I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)

IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate to find the repeat values in Pandas dataframe - python

Pandas has a handy .duplicated() method that can help you identify duplicates. df.duplicated() By passing the duplicate vector into a selection you can get the duplicate record: df[df.duplicated()] You can get the sum of the duplicated records by using .sum() df.duplicated().sum()

Related

Pandas Series of lists of strings how to count and append to df per row

Dropping rows of dataframe in a for-loop in Python

How to iterate over a list of dataframes in pandas?

Filtering a list of pandas dataframes to only include row indexes ending in a

Retrieve multiple lookup values in large dataset?

Categories

Resources