pandas.DataFrame.groupby loses index and messes up the data - python

I have a pandas.DataFrame (named df) with the following data:
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelA Some Text 12345678
3 labelA Some Text 12345678
4 labelB Some Text 12345678
5 labelB Some Text 12345678
6 labelB Some Text 12345678
7 labelC Some Text 12345678
8 labelC Some Text 12345678
9 labelC Some Text 12345678
10 labelC Some Text 12345678
11 labelC Some Text 12345678
12 labelC Some Text 12345678
when I perform group by with the following (the goal is to take 2 samples from each label), the index is lost:
grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2))
print(result)
The output becomes:
labels texts
labels
labelA 0 labelA Some Text 12345678
0 labelA Some Text 12345678
0 labelB Some Text 12345678
0 labelB Some Text 12345678
0 labelC Some Text 12345678
0 labelC Some Text 12345678
I would like the output becomes:
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelB Some Text 12345678
3 labelB Some Text 12345678
4 labelC Some Text 12345678
5 labelC Some Text 12345678
How should I make the changes?
I tried to use result.dropout(0).reset_index() according to this answer, but it becomes:
index labels texts
0 0 labelA Some Text 12345678
1 0 labelA Some Text 12345678
2 0 labelB Some Text 12345678
3 0 labelB Some Text 12345678
4 0 labelC Some Text 12345678
5 0 labelC Some Text 12345678

Add group_keys parameter to DataFrame.groupby:
grouped = df.groupby('labels', group_keys=False)
result = grouped.apply(lambda x: x.sample(n=2))
print(result)
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
4 labelB Some Text 12345678
6 labelB Some Text 12345678
9 labelC Some Text 12345678
8 labelC Some Text 12345678
Another idea is remove all index and replace by original default RangeIndex:
grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2)).reset_index(drop=True)
print(result)
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelB Some Text 12345678
3 labelB Some Text 12345678
4 labelC Some Text 12345678
5 labelC Some Text 12345678

Related

How to match between row values

serial
name
match
1
John
5,6,8
2
Steve
1,7
3
Kevin
4
4
Kevin
3
5
John
1,6,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
Need to check and match the name of each row with the name of the row serial number mentioned in it's match column. Keep the serials matching else remove it. If nothing is matching then put null.
serial
name
match
updated_match
1
John
5,6,8
5,8
2
Steve
1,7
3
Kevin
4
4
4
Kevin
3
3
5
John
1,6,8
1,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
1,5
Convert values of serial column to strings,then mapping aggregate sets to Series wth same size like original, split column match and get difference with intersection of sets, last sorting and join back to strings:
s = df['serial'].astype(str)
sets = df['name'].map(s.groupby(df['name']).agg(set))
match = df['match'].str.split(',')
df['updated_match'] = [','.join(sorted(b.difference([c]).intersection(a)))
for a, b, c in zip(match, sets, s)]
print (df)
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
You can use mappings to determine which match have the same names:
# ensure we have strings in serial
df = df.astype({'serial': 'str'})
# split and explode he individual serials
s = df['match'].str.split(',').explode()
# make a mapper: serial -> name
mapper = df.set_index('serial')['name']
# get exploded names
s2 = df.loc[s.index, 'serial'].map(mapper)
# keep only match with matching names
# aggregate back as string
df['updated_match'] = (s[s.map(mapper).eq(s2)]
.groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='')
)
output:
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
Alternative with a groupby.apply:
df['updated_match'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g['match'].str.split(',').explode()
.loc[lambda x: x.isin(g['serial'])]
.groupby(level=0).agg(','.join))
.reindex(df.index, fill_value='')
)

Pandas - Data transformation of column using now delimiters

I have a pandas dataframe which consists of players names and statistics from a sporting match. The only source of data lists them in the following format:
# PLAYER M FG 3PT FT REB AST STL PTS
34 BLAKE Brad 38 17 5 6 3 0 3 0 24
12 JONES Ben 42 10 2 6 1 0 4 1 12
8 SMITH Todd J. 16 9 1 4 1 0 3 2 18
5 MAY-DOUGLAS James 9 9 0 3 1 0 2 1 6
44 EDLIN Taylor 12 6 0 5 1 0 0 1 8
The players names are in reverse order: Surname Firstname. I need to transform the names to the current order of firstname lastname. So, specifically:
BLAKE Brad -> Brad BLAKE
SMITH Todd J. -> Todd J. SMITH
MAY-DOUGLAS James -> James MAY-DOUGLAS
The case of the letters do not matter, however I thought potentially they could be used to differentiate the first and lastname. I know all lastnames with always be in uppercase even if they include a hyphen. The first name will always be sentence case (first letter uppercase and the rest lowercase). However some names include the middle name to differentiate players with the same name. I see how a space character can be used a delemiter and potentially use a "split" transformation but it guess difficult with the middle name character.
Is there any suggestions of a function from Pandas I can use to achieve this?
The desired out put is:
# PLAYER M FG 3PT FT REB AST STL PTS
34 Brad BLAKE 38 17 5 6 3 0 3 0 24
12 Ben JONES 42 10 2 6 1 0 4 1 12
8 Todd J. SMITH 16 9 1 4 1 0 3 2 18
5 James MAY-DOUGLAS 9 9 0 3 1 0 2 1 6
44 Taylor EDLIN 12 6 0 5 1 0 0 1 8
Try to split by first whitespace, then reverse the list and join list values with whitespace.
df['PLAYER'] = df['PLAYER'].str.split(' ', 1).str[::-1].str.join(' '))
To reverse only certain names, you can use isin then boolean indexing
names = ['BLAKE Brad', 'SMITH Todd J.', 'MAY-DOUGLAS James']
mask = df['PLAYER'].isin(names)
df.loc[mask, 'PLAYER'] = df.loc[mask, 'PLAYER'].str.split('-', 1).str[::-1].str.join(' ')

How to distribute list of even/noteven elements to index pandas

Is there any way to produce below output as I desire
I have pd.dataframe as below:
df1
data
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
and i want to add column of list
lst1 = ['TEXT1','TEXT2']
lst2 = ['text1','text2','text3']
For lst1 case it is working on df1 because index value and passed value are dividable
I
Input:
lst1 = ['TEXT1' , 'TEXT2']
df1['text'] = np.repeat(lst1, len(df2) // len(lst1))
Output on lst1:
1 TEXT1
2 TEXT1
3 TEXT1
4 TEXT1
5 TEXT2
6 TEXT2
7 TEXT2
8 TEXT2
My desired output for lst2:
1 text1
2 text1
3 text1
4 text2
5 text2
6 text2
7 text3
8 text3
You don't have to group here, you can just repeat and then cut off whatever you don't need:
df1['text'] = np.repeat(lst2, np.ceil(len(df1) / len(lst2)))[:len(df1)]
df1
data text
0 1 text1
1 2 text1
2 3 text1
3 4 text2
4 5 text2
5 6 text2
6 7 text3
7 8 text3
You can group by index and then use .ngroup() as index to your list:
import pandas as pd
d = {'data': [1,2,3,4,5,6,7,8]}
lst2 = ['text1','text2','text3']
df = pd.DataFrame(d)
df['txt'] = df.groupby(df.index // len(lst2)).ngroup().map(lst2.__getitem__) # or .apply(lambda x: lst2[x])
print(df)
Prints:
data txt
0 1 text1
1 2 text1
2 3 text1
3 4 text2
4 5 text2
5 6 text2
6 7 text3
7 8 text3

Write to file .txt

This is how the output should look like in the text file
Ray Holt
5 5 0 0 100 15
Jessica Jones
12 0 6 6 50 6
Johnny Rose
6 2 0 4 20 10
Gina Linetti
7 4 0 3 300 15
The number is the result from the game that I have to create.
My question is, how can I write to the text file with both string and integer result ?
I have tried this
def write_to_file(filename, player_list):
output = open(filename, "w")
for player in player_list:
output.write(str(player))
but the output is
Ray Holt 5 5 0 0 100 15Jessica Jones 12 0 6 6 50 6Johnny Rose 6 2 0 4 20 10Gina Linetti 7 4 0 3 300 15Khang 0 0 0 0 100 0
They are in 1 line
Please help me!
Thanks a lot guys
Use this:
def write_to_file(filename, player_list):
output = open(filename, "w")
for player in player_list:
output.write(str(player)+'\n')
output.close()

Get top rows from column value count with pandas

Let's say I have this kind of data. It's a set of reviews of some products.
prod_id text rating
AB123 some text 5
AB123 some text 2
AB123 some text 4
AC456 some text 3
AC456 some text 2
AD777 some text 2
AD777 some text 5
AD777 some text 5
AD777 some text 4
AE999 some text 4
AF000 some text 5
AG222 some text 5
AG222 some text 3
AG222 some text 3
I want to know which product has the most reviews (the most rows), so I use the following code to get the top 3 products (I only need 3 top most reviewed products).
s = df['prod_id'].value_counts().sort_values(ascending=False).head(3)
And then I will get this result.
AD777 4
AB123 3
AG222 3
But what I actually need is the rows with the ids as above. I need the whole rows of all AD777, AB123, and AG222, like below.
product_id text rating
AD777 some text 2
AD777 some text 5
AD777 some text 5
AD777 some text 4
AB123 some text 5
AB123 some text 2
AB123 some text 4
AG222 some text 5
AG222 some text 3
AG222 some text 3
How do I do that? I tried the print(df.iloc[s]), but of course it's not working. As I read on the documentation, value_counts return series and not dataframe. Any idea? Thanks
I think you need merge with left join with DataFrame created with index of s:
df = pd.DataFrame({'prod_id':s.index}).merge(df, how='left')
print (df)
prod_id text rating
0 AD777 some text 2
1 AD777 some text 5
2 AD777 some text 5
3 AD777 some text 4
4 AB123 some text 5
5 AB123 some text 2
6 AB123 some text 4
7 AG222 some text 5
8 AG222 some text 3
9 AG222 some text 3
Here is a one-liner solution which doesn't use a helper series:
In [63]: df.assign(rank=df.groupby('prod_id')['prod_id']
...: .transform('size')
...: .rank(method='dense', ascending=False)) \
...: .sort_values('rank') \
...: .query("rank <= 3") \
...: .drop('rank', 1)
Out[63]:
prod_id text rating
5 AD777 some text 2
6 AD777 some text 5
7 AD777 some text 5
8 AD777 some text 4
0 AB123 some text 5
1 AB123 some text 2
2 AB123 some text 4
11 AG222 some text 5
12 AG222 some text 3
13 AG222 some text 3
3 AC456 some text 3
4 AC456 some text 2
But if you already have your s series, then #jezrael's solution looks much more elegant.
Try this ?
df[df.prod_id.isin(df.prod_id.value_counts().head(3).index)]
EDIT:
Thanks for #jezrael point out the order problem.
df.assign(Forsort=df.prod_id.map(df.prod_id.value_counts().head(3))).\
dropna().sort_values('Forsort',ascending=False).drop('Forsort',axis=1)
Out[150]:
prod_id text rating
5 AD777 some 2
6 AD777 some 5
7 AD777 some 5
8 AD777 some 4
0 AB123 some 5
1 AB123 some 2
2 AB123 some 4
11 AG222 some 5
12 AG222 some 3
13 AG222 some 3
This was the easiest solution that worked for me:
Df.groupby('prod_id').first()

Categories