Window 10, Python 3.6
I have a dataframe df
df=pd.DataFrame({'name':['boo', 'foo', 'too', 'boo', 'roo', 'too'],
'zip':['30004', '02895', '02895', '30750', '02895', '02895']})
I want to find the repeat record that has same 'name' and 'zip', and record the repeat times. The idea output is
name repeat zip
0 too 1 02895
Because my dataframe is much more than six rows, I need to use a iterate method. I appreciate any tips.
I believe you need groupby all columns and use GroupBy.size:
#create DataFrame from online source
#df = pd.read_csv('someonline.csv')
#df = pd.read_html('someurl')[0]
#L = []
#for x in iterator:
#in loop added data to list
# L.append(x)
##created DataFrame from contructor
#df = pd.DataFrame(L)
df = df.groupby(df.columns.tolist()).size().reset_index(name='repeat')
#if need specify columns
#df = df.groupby(['name','zip']).size().reset_index(name='repeat')
print (df)
name zip repeat
0 boo 30004 1
1 boo 30750 1
2 foo 02895 1
3 roo 02895 1
4 too 02895 2
Pandas has a handy .duplicated() method that can help you identify duplicates.
df.duplicated()
By passing the duplicate vector into a selection you can get the duplicate record:
df[df.duplicated()]
You can get the sum of the duplicated records by using .sum()
df.duplicated().sum()
Related
Summarize the problem
I have a Dataframe with a Series of lists of strings.
I can use a Counter from 'collections' to iterate over each row of the series and count the times each string appeared easy enough:
for list_of_strings in df['info']:
counter = Counter()
for string in list_of_strings:
counter[string]+=1
What I want is to mutate the df to now have as many columns as there were unique strings found in all of the lists of strings in the df['info'] series. The column names would be the string value.
For each row in the df, the new columns values would either be 0 if the string was not found or N where N is the count of how many times the string was found in that rows list_of_strings
Describe what you’ve tried
I was able to make a df whose # and name of columns matches the unique strings found, but I cant figure out how to append the counts per row, and Im not even sure if crafting and appending to a brand new dataframe is the right way to go about it?:
unique_df = pd.Series(copy_df['info'].explode().unique()).to_frame().T
I tried doing something using df.at() for each counter key but it exploded my juptyr notebook :\
Any help is appreciated, let me know what other info I can provide.
Assuming this input Series:
s = pd.Series([['A','B','A'],['C','B'],['D']])
You can explode, groupby.value_counts and unstack with a fill_value of 0:
(s.explode()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0)
.reindex(s.index, fill_value=0)
)
Or use crosstab:
s2 = s.explode()
(pd.crosstab(s2.index, s2)
.rename_axis(index=None, columns=None)
)
Output:
A B C D
0 2 1 0 0
1 0 1 1 0
2 0 0 0 1
I have multiple dataframes with multiple columns as this:
DF =
A B C metadata_Colunm
r1 6 3 9 r1
r2 2 1 1 r2
r3 5 7 2 r3
How can I use a for-loop to iterate over each column to make a new dataframe and then remove rows where values are below 5 for each new dataframe?
The result should look like this:
DF_A=
A metadata_Colunm
6 r1
5 r1
DF_B=
B metadata_Colunm
7 r3
DF_C=
C metadata_Colunm
9 r1
What I have done so far is to make a list over the columns I will use (all excluding metadata) and then go trough the columns as new dataframes. Since I also need to preserve the metadata I add the metadata-column as part of the new dataframe:
DF = DF.drop("metadata_Colunm")
ColList = list(DF)
for item in ColList:
locals()[f"DF_{str(item)}"] = DF[[item, "metadata_Colunm"]]
locals()[f"DF_{str(item)}"] = locals()[f"DF_{str(item)}"].drop(locals()[f"DF_{str(item)}"][locals()[f"DF_{str(item)}"].item > 0.5].index, inplace=True)
But using this I get "AttributeError: 'DataFrame' object has no attribute 'item'.
Any suggestions for making this work, or any other solutions, would be greatly appreciated!
Thanks in advance!
dfs = {}
for col in df.columns[:-1]:
df_new = df[[col, 'metadata_Colunm']]
dfs[col] = df_new[df_new[col] >= 5]
I would make a dictionary to add your new dataframes to, like this:
dictionary = {}
for col in df.columns[:-1]: # all columns but last
new_df = df.loc[:, (col, 'metadata_column')] # make slices
for index, row in new_df.iterrows():
if new_df.loc[index, col] < 5: # remove < 5
new_df.drop(index=index, inplace=True)
dictionary[col] = new_df # add to dictionary so you can refer to later
You can then call each dataframe via e.g. dictionary['A'].
According to this its best practice to slice the dataframe using df.loc() as opposed to df[].
you can apply a filter to the dataframe(s) instead of using a loop
def filter(df, threshold=5):
for column in df.columns:
df = df[df[column]>=threshold]
Then apply the filer to all your dataframes:
dfs = [df1, df2, df3...]
for df in dfs:
filter(df)
I have multiple dataframes, on which I want to run this function which mainly drops unnecessary columns from the dataframe and returns a dataframe:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
#first drop nan columns
df = df.loc[:, df.columns.notnull()]
#then search for columns with unnamed
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
return df
Now I iterate over the list of dataframes: [df1, df2, df3]
dfsublist = [df1, df2, df3]
for index in enumerate(dfsublist):
dfsublist[index] = dropunnamednancols(dfsublist[index])
Whereas the items of dfsublist have been changed, the original dataframes df1, df2, df3 still retain the unnecessary columns. How could I achieve this?
If I understand correctly you want to apply a function to multiple dataframes seperately.
The underlaying issue is that in your function you return a new dataframe and replace the stored dataframe in the list with a new own instead of modifying the old orignal one.
If you want to modify the orignal one you have to use the inplace=True parameters of the pandas functions. This is possible, but not recommended, as seen here.
Your code could therefore look like this:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
cols = [col for col in df.columns if (col is None) | (col.startswith('Unnamed'))]
df.drop(cols, axis=1, inplace=True)
As example on sample data:
import pandas as pd
df_1 = pd.DataFrame({'a':[0,1,2,3], 'Unnamed':[9,8,7,6]})
df_2 = pd.DataFrame({'Unnamed':[9,8,7,6], 'b':[0,1,2,3]})
lst_dfs = [df_1, df_2]
[dropunnamednancols(df) for df in lst_dfs]
# df_1
# Out[55]:
# a
# 0 0
# 1 1
# 2 2
# 3 3
# df_2
# Out[56]:
# b
# 0 0
# 1 1
# 2 2
# 3 3
The reason is probably because your are using enumerate wrong. In your case, you just want the index, so what you should do is:
for index in range(len(dfsublist)):
...
Enumerate returns a tuple of an index and the actual value in your list. So in your code, the loop variable index will actually be asigned:
(0, df1) # First iteration
(1, df2) # Second iteration
(2, df3) # Third iteration
So either, you use enumerate correctly and unpack the tuple:
for index, df in enumerate(dfsublist):
...
or you get rid of it altogether because you access the values with the index either way.
Hi I have a list of pd dataframes (1377 of them). I need to split each dataframe into cases where the row index ends in a and where the row index ends in c.
I have looked at other stack overflow pages where this is suggested
(df.iloc[all_dfs[0].index.str.endswith('a',na=False)])
however this transposes my dataframe and then reduces the number of rows (previously columns before transposing)
Here is a short section from my first dataframe if that helps.
You can pass tuple of test values to str.endswith with boolean indexing for filtering:
df = pd.DataFrame({'a':range(5)},
index=['_E031a','_E031b','_E031c','_E032a','_E032b'])
df1 = df[df.index.str.endswith(('a', 'c'),na=False)]
print (df1)
a
_E031a 0
_E031c 2
_E032a 3
Or get last values of strings by indexing [-1] and test membership by Index.isin:
df1 = df[df.index.str[-1].isin(['a', 'c'])]
print (df1)
a
_E031a 0
_E031c 2
_E032a 3
For looping in list of DataFrames use:
all_dfs = [df[df.index.str.endswith(('a', 'c'),na=False)] for df in all_dfs]
If want only test a:
all_dfs = [df[df.index.str.endswith('a',na=False)] for df in all_dfs]
I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)
IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog