I am joining different values from a dataset into one column in Pandas Dataframe, however there are lots of duplication, how can I get rid of them without deleting any row?:
example:
newCol
------
123,456,129,123,123
237,438,365,432,438
using df.newCol.drop_duplicates() removes the entire rows but I want the result to be:
newCol
------
123,456,129
237,438,365,432
...
thank you
You need first split, apply set and then join:
df.newCol = df.newCol.apply(lambda x: ','.join(set(str(x).split(','))))
print (df)
newCol
0 129,123,456
1 432,365,438,237
But you can apply set in join before:
print (df)
0 1 2 3 4
0 123 456 129 123 123
1 237 438 365 432 438
df = df.apply(lambda x: ','.join(set(x.astype(str))), axis=1)
print (df)
0 129,123,456
1 432,365,438,237
dtype: object
Or unique:
df = df.apply(lambda x: ','.join((x.astype(str)).unique()), axis=1)
print (df)
0 123,456,129
1 237,438,365,432
dtype: object
Related
df1 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','124','','656','']})
df2 = pd.DataFrame({'call_sign': ['ASD','CDSF','BSD','GFDFD','FHHH'],'frn':['1234','','124','','765']})
need to get a new df like
df2 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','','124','656','765']})
I need to take frn from df2 if it's missing in df1 and create a new df
Replace empty strings to missing values and use DataFrame.set_index with DataFrame.fillna, because need ordering like df2.call_sign add DataFrame.reindex:
df = (df1.set_index('call_sign').replace('', np.nan)
.fillna(df2.set_index('call_sign').replace('', np.nan))
.reindex(df2['call_sign']).reset_index())
print(df)
call_sign frn
0 ASD 123
1 CDSF NaN
2 BSD 124
3 GFDFD 656
4 FHHH 765
If you want to update df2 you can use boolean indexing:
# is frn empty string?
m = df2['frn'].eq('')
# update those rows from the value in df1
df2.loc[m, 'frn'] = df2.loc[m, 'call_sign'].map(df1.set_index('call_sign')['frn'])
Updated df2:
call_sign frn
0 ASD 1234
1 CDSF
2 BSD 124
3 GFDFD 656
4 FHHH 765
temp = df1.merge(df2,how='left',on='call_sign')
df1['frn']=temp.frn_x.where(temp.frn_x!='',temp.frn_y)
call_sign frn
0 ASD 123
1 BSD
2 CDSF 124
3 GFDFD 656
4 FHHH 765
This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0
I have a pandas dataframe (this is an example, actual dataframe is a lot larger):
data = [['345', 1, '2022_Jan'], ['678', 1, '2022_Jan'], ['123', 1, '2022_Feb'], ['123', 1, '2022_Feb'], ['345', 0, '2022_Mar'], ['678', 1, '2022_Mar'], ['901', 0, '2022_Mar'], ['678', 1, '2022_Mar']]
df = pd.DataFrame(data, columns = ['ID', 'Error Count', 'Year_Month'])
The question I want answered is :How many IDs have errors?
I want to get an output that groups by 'Year_Month' and counts 1 for each ID that occurs in each month. In other words, I want to count only 1 for each ID in a single month.
When I group by 'Year_Month' & 'ID': df.groupby(['Year_Month', 'ID']).count()
it will give me the following output (current output link below) with the total Error Count for each ID, but I only want to count each ID once. I also want the Year_Month to be ordered chronologically, not sure why it's not when my original dataframe is in order by month in the Year_Month column.
My current output
Desired output
Are these actually duplicate records? Are you sure you don't want to record that user 123 had two errors in February?
If so, drop duplicates first, then groupby and sum over Error Count. The .count() method doesn't do what you think it does:
df.drop_duplicates(["ID", "Year_Month"]) \
.groupby(["Year_Month", "ID"])["Error Count"] \
.sum()
Output:
In [3]: counts = df.drop_duplicates(["ID", "Year_Month"]) \
...: .groupby(["Year_Month", "ID"])["Error Count"] \
...: .sum()
In [4]: counts
Out[4]:
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
As far as sorting, you'd want to convert "Year_Month" to a datetime object, because right now they're just being sorted as strings:
In [5]: "2022_Feb" < "2022_Jan"
Out[5]: True
Here's how you could do that:
In [6]: counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b"))
Out[6]:
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
Here's one way:
(df
.groupby(['Year_Month', 'ID']) # group by the two columns
.sum('Error Count')['Error Count'] # aggregate the sum over error count
.apply(lambda x: int(bool(x)))) # convert to boolean and back to int
.to_frame('Error Count') # add name back to applied column
)
Error Count
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Here is another way to do that
Converting the sum to a boolean with astype(bool) to return True or False, based on values being 0 or non-zero, and then to 0 or 1 with astype(int)
df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
To sort, assign the outcome to a dataframe and then apply the ddejohn solution to sort
counts = df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b")) # ddejohn: https://stackoverflow.com/a/71927886/3494754 answer above
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
I have dataframe as follows:
df =
DATA TYPE_1 TYPE_2 TYPE_3 EVALUATED
WW A234 456 456 0
AA 456 123A 567 1
BB 456 123A 456 1
df = pd.DataFrame({"DATA":["WW","AA","BB"],"TYPE_1":["A234","456","456"],"TYPE_2":["456","123A","123A"],"TYPE_3":["456","567","456"],"EVALUATED":[0,1,1]})
I am calculating the number of times TYPE* apeared with EVALUATED equal to 0 and 1. This is the code that I am using:
grouped = (df.set_index('EVALUATED')
.filter(like='TYPE_')
.stack()
.to_frame('TYPE')
.reset_index()
.pivot_table(index='TYPE', columns='EVALUATED', aggfunc='size', fill_value=0)
).reset_index()
I need to slightly improve this code by counting only unique occurences of TYPE* per row.
For example, 456 appeared two times in row 1, but it must be counted only once.
The result should be this one:
grouped =
TYPE 0 1
------------
A234 1 0
456 1 2
123A 0 2
567 0 1
My solution is similar as your, only added .apply(lambda x: pd.Series(x.unique()), axis=1) for remove duplicates in rows and another possible solution for fast size with groupby and unstack:
grouped = df.set_index('EVALUATED')
.filter(like='TYPE_')
.apply(lambda x: pd.Series(x.unique()), axis=1)
.stack()
.to_frame('TYPE')
.reset_index()
.groupby(['TYPE', 'EVALUATED'])
.size()
.unstack(fill_value=0)
print (grouped)
EVALUATED 0 1
TYPE
123A 0 2
456 1 2
567 0 1
A234 1 0
Your solution:
grouped = (df.set_index('EVALUATED')
.filter(like='TYPE_')
.apply(lambda x: pd.Series(x.unique()), axis=1) #added row for unique rows
.stack()
.to_frame('TYPE')
.reset_index()
.pivot_table(index='TYPE', columns='EVALUATED', aggfunc='size', fill_value=0)
).reset_index()
print (grouped)
EVALUATED TYPE 0 1
0 123A 0 2
1 456 1 2
2 567 0 1
3 A234 1 0
Steps:
1) Subset columns starting with the char "TYPE_X".
2) Construct a new DF by taking thier unique values across columns. Let it's new index be the contents present in EVALUATED column.
3) Perform Groupby w.r.t the index axis by specifying level=0. Using apply, stack the frame so that a series gets created. Take it's distinct counts using value_counts. Unstack the obtained DF with providing fill_value=0. Finally, transpose the resulting dataframe.
TYPE_cols = df.filter(like="TYPE_")
d = pd.DataFrame([pd.unique(x) for x in TYPE_cols.values], df['EVALUATED'].values)
result = d.groupby(level=0).apply(lambda x: pd.value_counts(x.values.ravel()))
result.unstack(fill_value=0).T.reset_index().rename(columns={"index":"TYPE"})
I have the following dataframe named ttm:
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12 1 60 3 1728
1 11 1 240 3 1331
3 5 1 5 3 125
4 6 1 16 2 216
2 10 3 270 3 1000
5 8 3 18 2 512
When i do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):
clienthostid LoginDaysSum
0 1 4
1 3 2
But when I do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
I get:
0 1.0
1 1.5
Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?
Thank you,
For return DataFrame after groupby are 2 possible solutions:
parameter as_index=False what works nice with count, sum, mean functions
reset_index for create new column from levels of index, more general solution
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False and instead add reset_index:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1 1.0
3 1.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class 'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
0 1 1.0
1 3 1.5
Why some columns are gone?
I think there can be problem automatic exclusion of nuisance columns:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 1 60 3 1728
1 11aa 1 240 3 1331
3 5aa 1 5 3 125
4 6aa 1 16 2 216
2 10aa 3 270 3 1000
5 8aa 3 18 2 512
#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1 321 11 3400
3 288 5 1512
What is the difference between size and count in pandas?
count is a built in method for the groupby object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.
# For a built in method, when
# you don't want the group column
# as the index, pandas keeps it in
# as a column.
# |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
clienthostid LoginDaysSum
0 1 4
1 3 2
# For a built in method, when
# you do want the group column
# as the index, then...
# |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
# |-----||||-----|
# the single brackets tells
# pandas to operate on a series
# in this case, count the series
clienthostid
1 4
3 2
Name: LoginDaysSum, dtype: int64
ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
# |------||||------|
# the double brackets tells pandas
# to operate on the dataframe
# specified by these columns and will
# return a dataframe
LoginDaysSum
clienthostid
1 4
3 2
When you used apply pandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use apply you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True to keep the grouping column information in the index. Then follow it up with a reset_index to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index you'll have a dataframe again.
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
0 1.0
1 1.5
dtype: float64
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
clienthostid LoginDaysSum
0 1 1.0
1 3 1.5
Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.
Try fill the 'null' with some value.
Like this:
df.fillna('')
You simply need this instead:
ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
The double [[]] will turn the output into a pd.Dataframe instead of a pd.Series.