Transform dataframe/pandas - python

I have a dataframe with tag as column and movieId as index. movieId represents movies.
So there may be multiple tags for the same movieId.
I want to transform this dataframe and there are as many columns as there are tags and that moviesId become the lines but there must be one line per movieId.
And that for each movieId if there was a tag there is a 1 in the tag column at otherwise 0
I can't do it when I try, the moviesID appear several times.
Would someone help me
Thank you so much.

We can use ".crosstab()" function to get the required output:
I have created a sample dataframe as 'df':
movieId tag
260 Best movie ever
1240 scifi
2133 Best movie ever
1097 scifi
260 scifi
250 scifi
By using .crosstab() function:
pd.crosstab(df.movieId, df.tag, dropna = False)
The output will be like this:
tag Bestmovie ever scifi
movieId
250 0 1
260 1 1
1097 0 1
1240 0 1
2133 1 0

i hope this fixes the problem
import pandas as pd
import numpy as np
df = pd.DataFrame([[260, "best"],[520,"sci"],[260,"sci"]],columns=['movieId','tag'])
print("Dummy DataFrame: \n", df)
movieId, tags= list(df['movieId'].unique()), list(df['tag'].unique())
dfmatrix= pd.DataFrame(np.zeros((len(movieId),len(tags)+1),dtype=int), columns=['movieID']+tags)
# dfmatrix['movieID'][1]= 54
for i, movie in enumerate(movieId):
listoftag = df.tag[df['movieId']==movie]
dfmatrix.movieID[i]= movie
for tag in listoftag:
dfmatrix[tag][i]=1
print("\n \n dfmatrix \n",dfmatrix)
the output is :
Dummy DataFrame:
movieId tag
0 260 best
1 520 sci
2 260 sci
dfmatrix
movieID best sci
0 260 1 1
1 520 0 1

Related

Incorrect output with np.random.choice

I am trying to randomly select records from 17mm dataframe using np.random.choice as it runs faster compared to other methods but I am getting incorrect value in output against each record...example below:
data = {
"calories":[420,380,390,500,200,100],
"Duration":[50,40,45,600,450,210],
"Id":[1,1 2,3,2,3],
"Yr":[2003,2003,2009,2003,2012,2003],
"Mth":[3,6,9,12,3,6],
}
df = PD.dataframe(data)
df2=df.groupby(['id','yr'],as_index=False).agg(np.random.choice)
Output:
Id yr calories duration mth
1 2003 420 50 6
2 2009 390 45 9
2 2012 200 450 3
3 2003 500 210 6
Problem in the output is for Id 3 for calories 500, duration and mth should be 600 and 12 instead of 210 and 6...can anyone please help why it is choosing value from different row ?
Expected output:
Same row value should be retained after random selection
This doesn't work because Pandas applies aggregates across each column independently, try putting a print statement in, e.g.:
def fn(x):
print(x)
return np.random.choice(x)
df.groupby(['id','yr'],as_index=False).agg(fn)
would let you see when the function was called and what it was called with.
I'm not an expert in Pandas, but using GroupBy.apply seems to be the easiest way I've found of keeping rows together.
Something like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"calories":[420,380,390,500,200,100],
"duration":[50,40,45,600,450,210],
"id":[1,1,2,3,2,3],
"yr":[2003,2003,2009,2003,2012,2003],
"mth":[3,6,9,12,3,6],
})
df.groupby(['id', 'yr'], as_index=False).apply(lambda x: x.sample(1))
produces:
calories duration id yr mth
0 1 380 40 1 2003 6
1 2 390 45 2 2009 9
2 4 200 450 2 2012 3
3 5 100 210 3 2003 6
the two numbers at the beginning are because you end up with a multi-index. If you want to know where the rows were selected from this would contain useful information, otherwise you could discard the index.
Note that there are warnings in the docs that this might not be very performant, but don't know the details.
Update: I've just had more of a read of the docs, and noticed that there's a GroupBy.sample method, so you could instead just do:
df.groupby(['id', 'yr']).sample(1)
which would presumably be performant as well as being much shorter!

Pandas for each primary key keep only the row having the max value into another column

How can I keep for each element in Customer_ID only the Col2_ID and Qta with the maximum value of the Qta column and discard all the rest?
I'm stuck here:
df1 = df.groupby(["Customer_ID", "Col2_ID"]).Qta.sum()
Customer_ID Col2_ID Qta
0 536544 600
536546 1
536550 1
536553 3
536555 1
...
18283 579673 134
580872 142
18287 554065 488
570715 990
573167 108
After grouping I have for each customer multiple (Col2_ID, Qta), but for each Customer I only want the (Col2_ID, Qta) with maximum value.
For example instead of the output given by my program, the output I need would be
Customer_ID Col2_ID Qta
0 536544 600
...
18283 580872 142
18287 570715 990
I'm new to pandas and in the documentation I can't find what I need
You can chain it with df.max which takes level as parameter. At level 0 it takes max from every Customer_ID.
df.groupby(["Customer_ID", "Col2_ID"]).Qta.sum().max(level=0)
Here, you will index as Customer_ID only to get both Customer_id and Col2_ID as index try this.
out = df.groupby(["Customer_ID", "Col2_ID"]).Qta.sum().reset_index(level=1)
idx = out['Qta'].max(level=0).index
out.loc[idx].set_index('Col2_ID', append=True)
Now, here index of out is MultiIndex with Customer_ID and Col2_ID

Group data by two columns and count it using pandas

I am having the following data.
songs
play_event
In songs the data is as below:
song_id total_plays
1 2000
2 4532
3 9999
4 2343
And in play event the data is as below:
user_id song_id
102 1
103 4
102 1
102 3
104 2
102 1
For each time a song was played, there is a new entry, even is a song is played again.
With this data I want to:
Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:
user_id song_id count
102 1 3
102 3 1
103 4 1
104 2 1
I am thinking of using Pandas to do this. But I want to know if pandas is the right choice.
If its not pandas, then what should be my way forward.
If Pandas is the right choice, then:
The below code allows me to get the count either grouped by user or grouped by user_id how do we get the count grouped by user_id & song_id? See a sample code I tried below:
import pandas as pd
#Load data from csv file
data = pd.DataFrame.from_csv('play_events.csv')
# Gives how many entries per user
data['user_id'].value_counts()
# Gives how many entries per songs
data['song_id'].value_counts()
For your first problem, a simple groupby and value_counts does the trick. Note that everything after value_counts() in the code below is just to get it to an actual dataframe in the same format as your desired output.
counts = play_events.groupby('user_id')['song_id'].value_counts().to_frame('count').reset_index()
>>> counts
user_id song_id count
0 102 1 3
1 102 3 1
2 103 4 1
3 104 2 1
Then for your second problem (which you have deleted in your edited post, but I will leave just in case it is useful to you), you can loop through counts, grouping by user_id, and save each as csv:
for user, data in counts.groupby('user_id', as_index=False):
data.to_csv(str(user)+'_events.csv')
For your example dataframes, this gives you 3 csvs: 102_events.csv, 103_events.csv, and 103_events.csv. The first looks like:
user_id song_id count
0 102 1 3
1 102 3 1

Simplify query to just 'where' by making new column with pandas?

I have a column with SQL queries to a column. These are implemented on a function called Select_analysis
Form:
Select_analysis (input_shapefile, output_name, {where_clause}) # it takes until where.
Example:
SELECT * from OT # OT is a dataset
GROUP BY OT.CA # CA is a number that may exist many times.Therefore we group by that field.
HAVING ((Count(OT.OBJECTID))>1) # an id that appears more than once.
OT dataset
objectid CA
1 125
2 342
3 263
1 125
We group by CA.
About having: it is applied to the rows that have objectid more than once. Which is the objectid 1 in this example.
My idea is to make another column that will store a result that will be accessed with a simple where clause in the select_analysis function
example: OT dataset
objectid CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 263 1
1 125 2
So then can be:
Select_analysis(roads.shp,output.shp, count_of_objectid_aftergroupby > '1')
Notes
it has to be in such a way so that select analysis function is used in the end.
Assuming that you are pulling the data into pandas since it's tagged pandas, here's one possible solution:
df=pd.DataFrame({'objectID':[1,2,3,1],'CA':[125,342,463,125]}).set_index('objectID')
objectID CA
1 125
2 342
3 463
1 125
df['count_of_objectid_aftergroupby']=[df['CA'].value_counts().loc[x] for x in df['CA']]
objectID CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 463 1
1 125 2
The list comp does basically this:
pull the value counts for each item in df['CA'] as a series.
Use loc to index into the series at each value of 'CA' to find the count of that value
Put that item into a list
append that list as a new column

Counting the number of customers by values in a second series

I have imported a list of customers into python to run some RFM analysis, this adds a new field to the data for the RFM Class, so now my data looks like this:
customer RFMClass
0 0001914f-4655-4148-a1dc-1f25ca6d1f15 343
1 0002e50a-5551-4d9a-8734-76307dfe2131 341
2 00039977-512e-47ad-b929-170f18a1b14a 442
3 000693ff-2c61-425c-97c1-0286c874dd2f 443
4 00095dc2-7f37-48b0-894f-910d90cbbee2 142
5 000b748b-7ea0-48f2-a875-5f6cb95561d9 141
...
I'd like to plot a histogram showing the number of customers in each RFM Class, how can I get a count of the number of distinct customers ID's per class?
I tried adding a 1 to every row with summary['number'] = 1 thinking that it might be easier to count these rather than the customer ID's, as these have already been de-duped in my code, but I can't figure out how to sum these per RFM Class either.
Any thoughts on how I could do this?
I worked this out by using .groupby on my RFM class and summing the 'number' I assigned to each row:
byhour = df.groupby(['Hour']).agg({'Orders': 'sum'})
print(byhour)
This then produces the desired output:
Orders
Hour
0 902
1 438
2 307
3 162
4 149
5 233
6 721

Categories