Data Matching using pandas cumulative columns - python

I am trying to solve this problem
I have two data tables for example
names age salary vehicle
jeff 20 100 missing
shinji 24 120 missing
rodger 18 150 missing
eric 25 160 missing
romeo 30 170 missing
and this other data table
names age salary vehicle industry
jeff 20 100 car video games
jeff 20 100 car cell phone
jeff 20 100 motorcycle soft drink
jeff 20 100 boat pharmaceuticals
shinji 24 120 car robots
shinji 24 120 car animation
rodger 18 150 car cars
rodger 18 150 motorcycle glasses
eric 25 160 boat video games
eric 25 160 car arms
romeo 30 70 boat vaccines
so for my first row I want vehicle instead of missing I want "CMB" for car,boat and motorcycle because jeff has all 3. For Shinji I would only want C because he has a car. For Rodger I want CM because he only has a boat.For eric I want CB because he CB because he has a car and boat.
For romeo B because he only has a boat.
So for I want to go down the vehicle column of my second table and find all the vehicle the person.
But I am not sure the logic on how to to this. I know I can match them by age name and salary.

Try this:
tmp = (
# Find the unique vehicloes for each person
df2[['names', 'vehicle']].drop_duplicates()
# Get the first letter of each vehicle in capital form
.assign(acronym=lambda x: x['vehicle'].str[0].str.upper())
# For each person, join the acronyms of all vehicles
.groupby('names')['acronym'].apply(''.join)
)
result = df1.merge(tmp, left_on='names', right_index=True)

Related

How to separate a combined column, but with incongruent data

I'm preparing for a new job where I'll be receiving data submissions in varying quality, often times dates/chars/etc are combined together nonsensically and must be separated before analysis. Thinking ahead of how might this be solved.
Using a fictitious example below, I combined region, rep, and product together.
file['combine'] = file['Region'] + file['Sales Rep'] + file['Product']
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 EastShirlenePencil
1 3 South Anderson Folder 17 69 SouthAndersonFolder
2 3 West Shelli Folder 17 185 WestShelliFolder
3 3 South Damion Binder 30 159 SouthDamionBinder
4 3 West Shirlene Stapler 25 41 WestShirleneStapler
Assuming no other data, the question is, how can the 'combine' column be split up?
Many thanks in advance!
If you want space between the strings, you can do:
df["combine"] = df[["Region", "Sales Rep", "Product"]].apply(" ".join, axis=1)
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 East Shirlene Pencil
1 3 South Anderson Folder 17 69 South Anderson Folder
2 3 West Shelli Folder 17 185 West Shelli Folder
3 3 South Damion Binder 30 159 South Damion Binder
4 3 West Shirlene Stapler 25 41 West Shirlene Stapler
Or: if you want to split the already combined string:
import re
df["separated"] = df["combine"].apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x))
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine separated
0 3 East Shirlene Pencil 5 71 EastShirlenePencil [East, Shirlene, Pencil]
1 3 South Anderson Folder 17 69 SouthAndersonFolder [South, Anderson, Folder]
2 3 West Shelli Folder 17 185 WestShelliFolder [West, Shelli, Folder]
3 3 South Damion Binder 30 159 SouthDamionBinder [South, Damion, Binder]
4 3 West Shirlene Stapler 25 41 WestShirleneStapler [West, Shirlene, Stapler]

How to compare two data row before concatenating them?

I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)

Can I have a two line caption in pandas dataframe?

Can I have a two line caption in pandas dataframe?
Create dataframe with:
df = pd.DataFrame({'Name' : ['John','Harry','Gary','Richard','Anna','Richard','Gary','Richard'], 'Age' : [25,32,37,43,44,56,37,22], 'Zone' : ['East','West','North','South','East','West','North', 'South']})
df=df.drop_duplicates('Name',keep='first')
df.style.set_caption("Team Members Per Zone")
which outputs:
Team Members Per Zone
Name Age Zone
0 John 25 East
1 Harry 32 West
4 Anna 44 East
6 Gary 37 North
7 Richard 22 South
However I'd like it to look like:
Team Members
Per Zone
Name Age Zone
0 John 25 East
1 Harry 32 West
4 Anna 44 East
6 Gary 37 North
7 Richard 22 South
Using a break works for me in JupyterLab:
df.style.set_caption('This is line one <br> This is line two')
Have you tried with \n ? (Sorry too low reputation to just comment.

Pandas - groupby where each row has multiple values stored in list

I'm working with last.fm listening data and have a DataFrame that looks like this:
Artist Plays Genres
0 John Coltrane 10 [jazz, modal jazz, hard bop]
1 Miles Davis 15 [jazz, cool jazz, modal jazz, hard bop]
2 Charlie Parker 20 [jazz, bebop]
I want to group the data by the genres and then aggregate by the sum of plays for each genre, to get something like this:
Genre Plays
0 jazz 45
1 modal jazz 25
2 hard bop 25
3 bebop 20
4 cool jazz 15
Been trying to figure this out for a while now but can't seem to find the solution. Do I need to change the way that the genre data is stored?
I was able to find this post which addresses a similar question, but that user was only looking to get the count of each list value. This gets me about halfway there, but I couldn't figure out how to use that to aggregate another column in the dataframe.
In general, you should not store lists in a DataFrame, so yes, probably best to change how they are stored. With this you can use some join + str.get_dummies + .multiply. Choose a sep that doesn't appear in any of your strings.
sep = '*'
df.Genres.apply(sep.join).str.get_dummies(sep=sep).multiply(df.Plays, axis=0).sum()
Output
bebop 20
cool jazz 15
hard bop 25
jazz 45
modal jazz 25
dtype: int64
An easier form to work with would be if your lists were split across lines as in:
import pandas as pd
df1 = pd.concat([pd.DataFrame(df.Genres.values.tolist()).stack().reset_index(1, drop=True).to_frame('Genres'),
df[['Plays', 'Artist']]], axis=1)
Genres Plays Artist
0 jazz 10 John Coltrane
0 modal jazz 10 John Coltrane
0 hard bop 10 John Coltrane
1 jazz 15 Miles Davis
1 cool jazz 15 Miles Davis
1 modal jazz 15 Miles Davis
1 hard bop 15 Miles Davis
2 jazz 20 Charlie Parker
2 bebop 20 Charlie Parker
Making it a simple sum within genres:
df1.groupby('Genres').Plays.sum()
Genres
bebop 20
cool jazz 15
hard bop 25
jazz 45
modal jazz 25
Name: Plays, dtype: int64

Pandas : Adding new rows to existing dataframe keeping the same distribution across all columns

I am working with pandas dataframe. I want to increase the size of my dataframe say from 1000 to 4432 (not exactly n times, n being a natural number) . I want to make sure that the distribution of value in each column remains the same after increasing the size.
For Example, if I have column name Car with given distribution existing 100 rows.
Maruti 30%
Ford 10%
Tata 40%
Others 10%
I would like to keep this share same after increasing the size to 4432
The column could range, numeric, categorical.
On more example would be Age with a distribution like
20-30 20%
30-40 40%
40-50 25%
50-60 15%
Again I would like to keep this distribution same while increasing the size of Dataframe.
The following function rounds target number of rows per unique value, so the distribution is closer to the desired one than if you just duplicate the whole dataframe. In the following example, for multiplier 1.5 you can actually preserve the distribution, even though the simple concat won't give you 1.5x of the original dataframe.
def increase_df(df, column, multiplier):
new_value_counts = (df[column].value_counts() * multiplier).apply(lambda value: int(round(value)))
values = sum(([value] * count for value, count in new_value_counts.to_dict().items()), [])
return pd.DataFrame(values)
df = pd.DataFrame(["Mumbai"] * 4 + ["Kolkata"] * 2 + ["Chennai"] * 2 + ["Delhi"] * 4, columns=['city'])
print df
city
0 Mumbai
1 Mumbai
2 Mumbai
3 Mumbai
4 Kolkata
5 Kolkata
6 Chennai
7 Chennai
8 Delhi
9 Delhi
10 Delhi
11 Delhi
# here the distribution can be preserved exactly
print increase_df(df, 'city', 1.5)
0
0 Kolkata
1 Kolkata
2 Kolkata
3 Chennai
4 Chennai
5 Chennai
6 Delhi
7 Delhi
8 Delhi
9 Delhi
10 Delhi
11 Delhi
12 Mumbai
13 Mumbai
14 Mumbai
15 Mumbai
16 Mumbai
17 Mumbai
# here it can't, because the target number of rows per value is fractional.
# The function rounds that number to the nearest int, so the distribution is as close to the original one as it can get.
print increase_df(df, 'city', 1.8)
0
0 Kolkata
1 Kolkata
2 Kolkata
3 Kolkata
4 Chennai
5 Chennai
6 Chennai
7 Chennai
8 Delhi
9 Delhi
10 Delhi
11 Delhi
12 Delhi
13 Delhi
14 Delhi
15 Mumbai
16 Mumbai
17 Mumbai
18 Mumbai
19 Mumbai
20 Mumbai
21 Mumbai
A trivial way would be to duplicate all the rows a certain number of times to reach the required number of observations.
Let's say you have a dataframe df and you want num_reqd observations. All rows duplicated (num_reqd//df.shape[0]) times should give you a little under num_reqd observations.
import pandas as pd
new_df = pd.concat([df] * (num_reqd//df.shape[0]), axis=1)
But if you wanted to mix the data up a bit further, you can use numpy to shuffle the values in your columns.
import numpy as np
new_df.apply(np.random.shuffle, axis=0)
You can concat the values from df if you want to keep the original observations too.
new_df = pd.concat([df, new_df], axis=1)

Categories