Categorizing pandas dataframe series values

Categorizing pandas dataframe series values - python

I am new to pandas and trying to automatically create categories and group the values.
My dataframe:
df = pd.DataFrame({'Slug': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
'Position': ['0', '1', '2', '3', '4', '0', '1', '2', '3', '0', '1', '2'],
'Brand': ['Mazda', 'BMW', 'Ford', 'Fiat', 'Dodge', 'Mazda', 'BMW', 'Ford', 'Fiat', 'BMW', 'Ford', 'Fiat'],
'Sessions': ['70', '', '', '', '', '60', '', '', '', '50', '', ''],
'Transactions': ['1', '', '', '', '', '2', '', '', '', '3', '', ''],
'Ecommerce': ['1', '', '', '', '', '3', '', '', '', '4', '', ''],
'CTR': ['10', '', '', '', '', '15', '', '', '', '5', '', ''],
'All': ['11', '', '', '', '', '1', '', '', '', '4', '', '']})
I am trying to answer a question: which layout of brands has the best conversion. Position column declares the way brands are writtien down on the site:
Example:
0 A #Ford
1 B #BMW
2 C #Fiat
3 D #Dodge
The question is maybe having Ford in second place and BMW at the first would lead to more conversions.
The first thing I am trying to do is to generate categories for each unique group, there are about 10 different brands and 100 different ways the way they are set up.
For example:
Group1 could be:
0 A #Ford
1 B #BMW
2 C #Fiat
3 D #Dodge
Group2 could be:
0 B #BMW
1 A #Ford
2 C #Fiat
3 D #Dodge
Then my DataFrame would look like this:
Slug Group Sessions Transactions Ecommerce CTR All
a 1 70 1 1 10 10
b 2 60 2 3 15 11
c 1 60 2 3 15 11
d 3 60 2 3 15 11
e 2 60 2 3 15 11
Groups are categorized by the position and the brand column.
Slug could be understood as a country. For example in Country a, having a layout of group 1 70 sessions are achieved, in country b having a layout of group 2 60 sessions are achieved and so on.
And so on, then I could compare the group performance measure in sessions, transactions and other columns values which I have in my DataFrame.
The parameters like transactions, session and others are for the entire layout of brands, for example:
0 Ford
1 BMW
2 Fiat
3 Dodge
# this layout achieved 70 sessions and 5 conversions
So my question could be divided in to 3 separate parts:
1) How could I generate groups of position and brand
2) Maybe some of you had bumped in to a similar tas and knows any methods of determining the best layout of brands
3) I've tried a bit of machine learning, maybe you could suggest me which model I could apply to my problem
Thank you for your suggestions.

Related

How to count the number of rows after applying a condition another column, while grouping?

df1 = [[aa, '21/01/2022', ''], [aa, '22/01/2022', '22/01/2022'],
[aa, '22/01/2022', ''], [aa, '22/01/2022', ''],
[bb, '25/01/2022', '25/01/2022'],[bb, '26/01/2022', ''],
[bb, '26/01/2022', ''],[cc, '21/01/2022', ''],
[cc, '21/01/2022', '22/01/2022'], [cc, '21/01/2022', '']]
df = pd.DataFrame(df1, columns =['userid', 'Created', 'Signed_up'])
I have the above dataframe, and what I'm looking to do is count the number of plans 'Created' after previously having 'Signed up' with another Plan.
Meaning, each row in the dataframe is a Plan generated by a user, and I want to count the number of plans that each user generated after having previously having signed up, taking into account that each user can have on signed up plans ,which simplifies the task a bit.
My assumption would be to use the combination of groupby() and cumsum() or cumcount(), but what I am having trouble with is incorporating the condition of having a previously notna() 'Signed_up' column.
Desired Output:
df2 = [[aa, '21/01/2022', '', ''], [aa, '22/01/2022', '22/01/2022', ''],
[aa, '22/01/2022', '', '1'], [aa, '22/01/2022', '', '2'],
[bb, '25/01/2022', '25/01/2022', ''],[bb, '26/01/2022', '', '1'],
[bb, '26/01/2022', '', '2'],[cc, '21/01/2022', '', ''],
[cc, '21/01/2022', '22/01/2022', ''], [cc, '21/01/2022', '', '1']]
df_3 = pd.DataFrame(df2, columns =['userid', 'Created', 'Signed_up', 'count'])
Any help and suggestions are appreciated! Thanks in advance for any answers.

Code:
import numpy as np
df = df.replace(r'', np.NaN)
df['CouNT'] = df.groupby(df.groupby(['userid'])['Signed_up'].ffill()+df['userid']).cumcount()
Updated code:
Here, you can handle the group of NaN, by fill with unique value or index value, However I have fill with empty string and later concate with userid, which also helped to bring your desire output.
df.groupby((df.groupby(['userid'])['Signed_up'].ffill()).fillna('')+df['userid']).cumcount()
Output:
userid Created Signed_up cnt
0 aa 21/01/2022 NaN 0
1 aa 22/01/2022 22/01/2022 0
2 aa 22/01/2022 NaN 1
3 aa 22/01/2022 NaN 2
4 bb 25/01/2022 25/01/2022 0
5 bb 26/01/2022 NaN 1
6 bb 26/01/2022 NaN 2
7 cc 21/01/2022 NaN 0
8 cc 21/01/2022 22/01/2022 0
9 cc 21/01/2022 NaN 1

Conditional grouping pandas DataFrame

I have a DataFrame that has below columns:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
In each batch a name gets arbitrary many tries to get the greatest lenght.
What I want to do is create a column win that has the value 1 for greatest lenght in a batch and 0 otherwise, with the following conditions.
If one name hold the greatest lenght in a batch in multiple try only the first try will have the value 1 in win(See "Abe in example above")
If two separate name holds equal greatest lenght then both will have value 1 in win
What I have managed to do so far is:
df.groupby(['Batch', 'name'])['lenght'].apply(lambda x: (x == x.max()).map({True: 1, False: 0}))
But it doesn't support all the conditions, any insight would be highly
Expected outout:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0],
'win':[0,1,0,1,0,0,0,0,0]})
appreciated.
Many thanks,

Use GroupBy.transform for max values per groups compared by Lenght column by Series.eq for equality and for map to True->1 and False->0 cast values to integers by Series.astype:
#added first row data by second row
df = pd.DataFrame({'Name': ['Karl', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['12.5', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
df['Lenght'] = df['Lenght'].astype(float)
m1 = df.groupby('Batch')['Lenght'].transform('max').eq(df['Lenght'])
df1 = df[m1]
m2 = df1.groupby('Name')['Try'].transform('nunique').eq(1)
m3 = ~df1.duplicated(['Name','Batch'])
df['new'] = ((m2 | m3) & m1).astype(int)
print (df)
Name Lenght Try Batch new
0 Karl 12.5 0 0 1
1 Karl 12.5 0 0 1
2 Billy 11.0 0 0 0
3 Abe 12.5 1 0 1
4 Karl 12.0 1 0 0
5 Billy 11.0 1 0 0
6 Abe 12.5 2 0 0
7 Karl 10.0 2 0 0
8 Billy 5.0 2 0 0

Create a dictionary from lists, overwrite duplicate keys

I have my code below. I am trying to create a dictionary from my lists extracted from a txt file but the loop overwrites the previous information:
f = open('data.txt','r')
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('data.txt')]
columns=lines.pop(0)
for i in range(len(lines)):
lines[i]=lines[i].split(',')
dictt={}
for line in lines:
dictt[line[0]]=line[1:]
print('\n')
print(lines)
print('\n')
print(dictt)
I know I have to play with:
for line in lines:
dictt[line[0]] = line[1:]
part but what can I do , do I have to use numpy? If so, how?
My lines list is :
[['USS-Enterprise', '6', '6', '6', '6', '6'],
['USS-Voyager', '2', '3', '0', '4', '1'],
['USS-Peres', '10', '4', '0', '0', '5'],
['USS-Pathfinder', '2', '0', '0', '1', '2'],
['USS-Enterprise', '2', '2', '2', '2', '2'],
['USS-Voyager', '2', '1', '0', '1', '1'],
['USS-Peres', '8', '5', '0', '0', '4'],
['USS-Pathfinder', '4', '0', '0', '2', '1']]
My dict becomes:
{'USS-Enterprise': ['2', '2', '2', '2', '2'],
'USS-Voyager': ['2', '1', '0', '1', '1'],
'USS-Peres': ['8', '5', '0', '0', '4'],
'USS-Pathfinder': ['4', '0', '0', '2', '1']}
taking only the last ones, I want to add the values together. I am really confused.

You are trying to append multiple values for the same key. You can use defaultdict for that, or modify your code and utilize the get method for dictionaries.
for line in lines:
dictt[line[0]] = dictt.get(line[0], []).extend(line[1:])
This will look for each key, assign the line[1:] if the key is unique, and if it is duplicate, simply append those values onto the previous values.

dict_output = {}
for line in list_input:
if line[0] not in dict_output:
dict_output[line[0]] = line[1:]
else:
dict_output[line[0]] += line[1:]

EDIT: You subsequently clarified in comments that your input has duplicate keys, and you want later rows to overwrite earlier ones.
ORIGINAL ANSWER: The input is not a dictionary, it's a CSV file. Just use pandas.read_csv() to read it:
import pandas as pd
df = pd.read_csv('my.csv', sep='\s+', header=None)
df
0 1 2 3 4 5
0 USS-Enterprise 6 6 6 6 6
1 USS-Voyager 2 3 0 4 1
2 USS-Peres 10 4 0 0 5
3 USS-Pathfinder 2 0 0 1 2
4 USS-Enterprise 2 2 2 2 2
5 USS-Voyager 2 1 0 1 1
6 USS-Peres 8 5 0 0 4
7 USS-Pathfinder 4 0 0 2 1
Seems your input didn't have a header row. If your input columns had names, you can add them with df.columns = ['Ship', 'A', 'B', 'C', 'D', 'E'] or whatever.
If you really want to write a dict output (beware of duplicate keys being suppressed), see df.to_dict()

Python pandas: Nested List of Dictionary into Dataframe

Here is how the data structure is below... It is a List with inner List that contains two Dictionaries each.
I want it into dataframe with these headings: hasPossession, score and spread.
[[{'hasPossession': '0', 'score': '23', 'spread': '-0'},
{'hasPossession': '0', 'score': '34', 'spread': '0.0'}],
[{'hasPossession': '0', 'score': '', 'spread': '-7.5'},
{'hasPossession': '0', 'score': '', 'spread': '7.5'}],
[{'hasPossession': '0', 'score': '', 'spread': '-1'},
{'hasPossession': '0', 'score': '', 'spread': '1.0'}]]
Generally, above structure is a List that contains 3 Lists and each List contains 2 Dictionary with 2 elements.
How do I transform such into pandas dataframe?

flatten the list and use the default constructor
pd.DataFrame([k for item in initial_list for k in item])
hasPossession score spread
0 0 23 -0
1 0 34 0.0
2 0 -7.5
3 0 7.5
4 0 -1
5 0 1.0

Python: split by (different) n spaces

I have lines like this:
2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000
6 30 139 "guid" Other name^7 0 ip.a.dd.res:port 932 25000
I would like to split this, but the problem is that there is different number of spaces between this "words"...
How can I do this?

Python's split function doesn't care about the number of spaces:
>>> ' 2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000'.split()
['2', '20', '164', '"guid"', 'Some', 'name^7', '0', 'ip.a.dd.res:port', '-21630', '25000']

Have you tried split()? It will "compress" spaces, so after split you will get:
'2', '20', '164', '"guid'" etc.

>>> l = "1 2 4 'ds' 5 66"
>>> l
"1 2 4 'ds' 5 66"
>>> l.split(' ')
['1', '', '', '2', '', '', '4', "'ds'", '5', '', '66']
>>> [x for x in l.split()]
['1', '2', '4', "'ds'", '5', '66']

Just use split() function. The delimiter is \s+ that is any kind and any number of space

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Categorizing pandas dataframe series values - python

Related

How to count the number of rows after applying a condition another column, while grouping?

Conditional grouping pandas DataFrame

Create a dictionary from lists, overwrite duplicate keys

Python pandas: Nested List of Dictionary into Dataframe

Python: split by (different) n spaces

Categories

Resources