Get top rows from column value count with pandas - python

Let's say I have this kind of data. It's a set of reviews of some products.
prod_id text rating
AB123 some text 5
AB123 some text 2
AB123 some text 4
AC456 some text 3
AC456 some text 2
AD777 some text 2
AD777 some text 5
AD777 some text 5
AD777 some text 4
AE999 some text 4
AF000 some text 5
AG222 some text 5
AG222 some text 3
AG222 some text 3
I want to know which product has the most reviews (the most rows), so I use the following code to get the top 3 products (I only need 3 top most reviewed products).
s = df['prod_id'].value_counts().sort_values(ascending=False).head(3)
And then I will get this result.
AD777 4
AB123 3
AG222 3
But what I actually need is the rows with the ids as above. I need the whole rows of all AD777, AB123, and AG222, like below.
product_id text rating
AD777 some text 2
AD777 some text 5
AD777 some text 5
AD777 some text 4
AB123 some text 5
AB123 some text 2
AB123 some text 4
AG222 some text 5
AG222 some text 3
AG222 some text 3
How do I do that? I tried the print(df.iloc[s]), but of course it's not working. As I read on the documentation, value_counts return series and not dataframe. Any idea? Thanks

I think you need merge with left join with DataFrame created with index of s:
df = pd.DataFrame({'prod_id':s.index}).merge(df, how='left')
print (df)
prod_id text rating
0 AD777 some text 2
1 AD777 some text 5
2 AD777 some text 5
3 AD777 some text 4
4 AB123 some text 5
5 AB123 some text 2
6 AB123 some text 4
7 AG222 some text 5
8 AG222 some text 3
9 AG222 some text 3

Here is a one-liner solution which doesn't use a helper series:
In [63]: df.assign(rank=df.groupby('prod_id')['prod_id']
...: .transform('size')
...: .rank(method='dense', ascending=False)) \
...: .sort_values('rank') \
...: .query("rank <= 3") \
...: .drop('rank', 1)
Out[63]:
prod_id text rating
5 AD777 some text 2
6 AD777 some text 5
7 AD777 some text 5
8 AD777 some text 4
0 AB123 some text 5
1 AB123 some text 2
2 AB123 some text 4
11 AG222 some text 5
12 AG222 some text 3
13 AG222 some text 3
3 AC456 some text 3
4 AC456 some text 2
But if you already have your s series, then #jezrael's solution looks much more elegant.

Try this ?
df[df.prod_id.isin(df.prod_id.value_counts().head(3).index)]
EDIT:
Thanks for #jezrael point out the order problem.
df.assign(Forsort=df.prod_id.map(df.prod_id.value_counts().head(3))).\
dropna().sort_values('Forsort',ascending=False).drop('Forsort',axis=1)
Out[150]:
prod_id text rating
5 AD777 some 2
6 AD777 some 5
7 AD777 some 5
8 AD777 some 4
0 AB123 some 5
1 AB123 some 2
2 AB123 some 4
11 AG222 some 5
12 AG222 some 3
13 AG222 some 3

This was the easiest solution that worked for me:
Df.groupby('prod_id').first()

Related

Adding multiple columns randomly to a dataframe from columns in another dataframe

I've looked everywhere but can't find a solution.
Let's say I have two tables:
Year
1
2
3
4
and
ID Value
1 10
2 50
3 25
4 20
5 40
I need to pick randomly from both columns of the 2nd table to add to the first table - so if ID=3 is picked randomly as a column to add to the first table, I also add Value=25 i.e. end up with something like:
Year ID Value
1 3 25
2 1 10
3 1 10
4 5 40
5 2 50
IIUC, do you want?
df_year[['ID', 'Value']] = df_id.sample(n=len(df_year), replace=True).to_numpy()
Output:
Year ID Value
0 1 4 20
1 2 4 20
2 3 2 50
3 4 3 25

How to append a specific string according to each value in a string pandas dataframe column?

Let's take these sample dataframes :
df = pd.DataFrame({'Id':['1','2','3','4','5'], 'Value':[9,8,7,6,5]})
Id Value
0 1 9
1 2 8
2 3 7
3 4 6
4 5 5
df_name = pd.DataFrame({'Id':['1','2','4'], 'Name':['Andrew','Jason','John']})
Id Name
0 1 Andrew
1 2 Jason
2 4 John
I would like to add in the Id column of df the Name of the person (obtainable in df_name) if it exists, in brackets. I know how to do this with a for loop over the Id column of df but it is inefficient with large dataframes. Do you know please a better way do to this ?
Expected output :
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
Use Series.map for match values, add () and replace non matche values by original column in Series.fillna:
df['Id'] = ((df['Id'] + ' (' + df['Id'].map(df_name.set_index('Id')['Name']) + ')')
.fillna(df['Id']))
print (df)
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5

how to read the text file read in data frame with split function?

I have this data set in an Excel file. I want to keep the data which have only length 6 and delete rest and export it in the split of single values stored in a separate column.
Please tell me if we have any function to split the numeric values in the file to read it and split
From your shared data it seems it has spaces between numbers so they will already be in str
you can try below code:
your df looks like this:
a
0 11
1 2
2 3 2 4
3 5
4 1
5 6
6 1 1
7 6
8 6 7 7 7 6 6 8 8 8
9 6 8 7 9 5 2 1 44 6 55
10 6 8 7 9 5 2 1 44 6 55 4 4 4 4
filter rows with len equal to 6
df=df[df['a'].str.len()==6]
then split them using split() method like this
df['a'].str.split(" ", expand = True)
output:
0 1 2 3
2 3 2 4
EDIT:
for having trouble with memory while reading a large file you can refer to this SO post
OR read the file in chunks and append/save the output in new file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

I want to add new column on the basis of another column data in pandas

I have multiple csv file which i merged together after that in order to identify individual csv data in all merged csv file i wish to create a new column in pandas where the new column should be called serial.
I want a new column serial in the pandas and it should me numbered on the basis of data in Sequence column (For example-111111111,2222222222,33333333 for every new one in csv ).I had Attached snapshot of csv file also.
Sequence Number
1
2
3
4
5
1
2
1
2
3
4
I want output Like this-
Serial Sequence Number
1 1
1 2
1 3
1 4
1 5
2 1
2 2
3 1
3 2
3 3
3 4
Use DataFrame.insert for column in first position filled with boolean mask for compare by 1 with Series.eq (==) and cumulative sum by Series.cumsum:
df.insert(0, 'Serial', df['Sequence Number'].eq(1).cumsum())
print (df)
Serial Sequence Number
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3
10 3 4

Python sort CSV File

Hey I have a CSV file with many rows but one of the row constantly repeats. Is it possible to only keep the first name for that row and keep all other data. I tried with pandas but pandas asks for a function such as aggregate sum. My data in the CSV file is like.
H1 h2 h3 h4
A 1 2 3 4
A 2 3 4 5
A 3 4 5 6
B 1 2 3 4
B 2 3 4 5
B 3 4 5 6
C 1 2 3 4
C 2 3 4 5
C 3 4 5 6
Each one of these has a header. Which are shown by h1-h4.
My data is not like this, it contains real text values.
I want to rearrange the data so it looks like this.
A
1 2 3 4
2 3 4 5
3 4 5 6
B
1 2 3 4
2 3 4 5
3 4 5 6
C
1 2 3 4
2 3 4 5
3 4 5 6
Or
A 1 2 3 4
2 3 4 5
3 4 5 6
B 1 2 3 4
2 3 4 5
3 4 5 6
C 1 2 3 4
2 3 4 5
3 4 5 6
So basically I want it to group by the first header name which is h1. Any help would be appreciated thanks.
The following should work, it assumes your source data is space delimited (as you have shown), if it uses commas or tabs, you will need to change the delimiter I have used.
import csv
with open("input.csv", "r") as f_input, open("output.csv", "wb") as f_output:
csv_input = csv.reader(f_input, delimiter=" ")
csv_output = csv.writer(f_output)
headers = next(csv_input)
cur_row = ""
for cols in csv_input:
if cur_row != cols[0]:
cur_row = cols[0]
csv_output.writerow([cur_row])
csv_output.writerow(cols[1:])
Giving you an output CSV file as follows:
A
1,2,3,4
2,3,4,5
3,4,5,6
B
1,2,3,4
2,3,4,5
3,4,5,6
C
1,2,3,4
2,3,4,5
3,4,5,6
Tested using Python 2.7
To add the headers for each group, change the first writerow line as follows:
csv_output.writerows([[cur_row], headers])
Giving the following output:
A
H1,h2,h3,h4
1,2,3,4
2,3,4,5
3,4,5,6
B
H1,h2,h3,h4
1,2,3,4
2,3,4,5
3,4,5,6
C
H1,h2,h3,h4
1,2,3,4
2,3,4,5
3,4,5,6

Categories