Split dataframe column with second column as delimiter - python

I want to split a column into two columns by using the value of a second column from the same row, hence the second column value serves as the split delimiter.
I'm receiving the error TypeError: 'Series' objects are mutable, thus they cannot be hashed which makes sense its receiving a series, not a single value, but I'm unsure how to isolate to the single row value of that second column.
Sample data:
title_location delimiter
0 Doctor - ABC - Los Angeles, CA - ABC -
1 Lawyer - ABC - Atlanta, GA - ABC -
2 Athlete - XYZ - Jacksonville, FL - XYZ -
Code:
bigdata[['title', 'location']] = bigdata['title_location'].str.split(bigdata['delimiter'], expand=True)
Desired output:
title_location delimiter title location
0 Doctor - ABC - Los Angeles, CA - ABC - Doctor Los Angeles, CA
1 Lawyer - ABC - Atlanta, GA - ABC - Lawyer Atlanta, GA
2 Athlete - XYZ - Jacksonville, FL - XYZ - Athlete Jacksonville, FL

Try apply.
bigdata[['title', 'location']]=bigdata.apply(func=lambda row: row['title_location'].split(row['delimiter']), axis=1, result_type="expand")

Let us try zip for then join back
df = df.join(pd.DataFrame([x.split(y) for x ,y in zip(df.title_location,df.delimiter)],index=df.index,columns=['Title','Location']))
df
Out[200]:
title_location delimiter Title Location
0 Doctor - ABC - Los Angeles, CA - ABC - Doctor Los Angeles, CA
1 Lawyer - ABC - Atlanta, GA - ABC - Lawyer Atlanta, GA
2 Athlete - XYZ - Jacksonville, FL - XYZ - Athlete Jacksonville, FL

Related

Searching one pandas DataFrame column based on the search criteria made up of the content of two columns for another df separated by a wild card

I want to lookup the value of a the lookup_table value column based on the combination of text and two different columns from the data table. See example below:
Data:
VMType
Location
DSv3
East Europe
ESv3
East US
ESv3
East Asia
DSv4
Central US
Ca2
Central US
lookup_table:
Type
Code
Dv3/DSv3 - Gen Purpose East Europe
abc123
Dv3/D1 - Gen Purpose West US
abc321
Dav4/DSv4 - Gen Purpose Central US
bbb321
Eav3/ESv3 - Hi Tech East Asia
def321
Eav3/ESv3 - Hi Tech East US
xcd321
Csv2/Ca2 - Hi Tech Central US
xcc321
I want to do something like
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] == Data['VMType'] + '*' + Data['Location']
or to remove the wild card it could be evaluated as follows:
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] contains Data['VMType'] AND lookup_table['Type'] contains Data['Location']
Resulting in:
Data:
VMType
Location
new_column
DSv3
East Europe
abc123
ESv3
East US
xcd321
ESv3
East Asia
def321
DSv4
Central US
abc321
Ca2
Central US
xcc321
Ideally this can be done without iterating through the df.
First, extract columns VMType and Location from the lookup_table. Then merge with your data dataframe:
lookup_table['VMType'] = lookup_table['Type'].str[:2]
lookup_table['Location'] = lookup_table['Type'].str.split().str[-1]
lookup_table = lookup_table[['VMType', 'Location', 'Code']]
data.merge(lookup_table)
Output:
VMType Location Code
0 D1 Europe abc123
1 E3 US xcd321
2 E3 Asia def321
3 D1 US abc321
4 C2 US xcc321

How to search in row and add those row values in python

Percentage
NaN
1.576020
Redmond
4.264524
England
4.975278
England - Street XY
5.346106
Denmark Street x
7.601978
England – Street wy
11.773795
England – Street AU
13.936959
Redmond street COX
50.525340
Baharin
0
I need to create another data frame which sums all rows starting with Redmond Percentage, all all rows starting with England followed by street namePercentage, all rows starting with England only Percentage and all all rows starting with Redmond.
How to do it using python.
In above case output should be
Percentage
NaN
1.576020
Redmond
50.525340
England
4.975278
England with street
11.773795
Denmark
7.60
Baharin
0
One way to do this:
df = df.reset_index()
m = df['index'].astype(str).str.contains('Street')
street_df = df.loc[m]
street_df = street_df.groupby(street_df['index'].str.split(' ').str[0]).agg({'Percentage': sum}).reset_index()
street_df['index'] = street_df['index'] + ' with street'
result = pd.concat([df[~m],street_df])

How to get certain value from column and add as new column in Python / Panda?

I have a dataframe with information data1 and would like to add a column data2 containing only the names from data1:
data1 data2
0 info name: Michael Jackson New York Michael Jackson
1 info 12 name: Michael Jordan III Los Angeles Michael Jordan III
Do you know how I can do this?
Without an unambiguous delimiter, this isn't trivial, since you have both spaces within names, multiple lengths of names (2 words, 3 words), and a trailing column which also may have multiple words with spaces.
Splitting the string you can achieve this partial solution:
df['data2'] = df['data1'].str.split(': ').str[-1]
>>> print(df)
data1 data2
0 info name: Michael Jackson New York Michael Jackson New York
1 info 12 name: Michael Jordan III Los Angeles Michael Jordan III Los Angeles
If you had a list of 'cities' you might be able to accomplish the complete solution:
def replace(string, substitutions):
"""Replaces multiple substrings in a string."""
substrings = sorted(substitutions, key=len, reverse=True)
regex = re.compile('|'.join(map(re.escape, substrings)))
return regex.sub(lambda match: substitutions[match.group(0)], string)
# List of cities to remove from strings
cities = ['New York', 'Los Angeles']
# Dictionary matching each city with the empty string
substitutions = {city:'' for city in cities}
# Splitting to create new column as above
df['data2'] = df['data1'].str.split(': ').str[-1]
# Applying replacements to new column
df['data2'] = df['data2'].map(lambda x: replace(x, substitutions).strip())
>>>print(df)
data1 data2
0 info name: Michael Jackson New York Michael Jackson
1 info 12 name: Michael Jordan III Los Angeles Michael Jordan III
Credit to carlsmith for the replace function.

Merge 2 columns in python

I need to do the same as what I can do with my function: df_g['Bidfloor'] = df_g[['Sitio', 'Country']].merge(df_seg, how='left').Precio but on the Country instead of the exactly same row only the first 2 keys because I can't change the language of the data. So I want to read only the 2 first keys of Countrycolumn instead of all keys of Countrycolumn
df_g:
Sitio,Country
Los Andes Online,HN - Honduras
Guarda14,US - Estados Unidos
Guarda14,PE - Peru
df_seg:
Sitio,Country,Precio
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
What I need:
Sitio,Country,Bidfloor
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
Guarda14,PE - Peru,NULL
You need additional key for help the merge , I am using cumcount to distinguish the repeat value
df1.assign(key=df1.groupby('Sitio').cumcount()).\
merge(df2.assign(key=df2.groupby('Sitio').cumcount()).
drop('Country',1),
how='left',
on=['Sitio','key'])
Out[1491]:
Sitio Country key Precio
0 Los Andes Online HN - Honduras 0 0.5
1 Guarda14 US - Estados Unidos 0 2.1
2 Guarda14 PE - Peru 1 NaN
Just add and drop a merge column and you are done:
df_seg['merge_col'] = df_seg.Country.apply(lambda x: x.split('-')[0])
df_g['merge_col'] = df_g.Country.apply(lambda x: x.split('-')[0])
then do:
df = pd.merge(df_g, df_seg[['merge_col', 'Precio']], on='merge_col', how='left').drop('merge_col', 1)
returns
Sitio Country Precio
0 Los Andes Online HN - Honduras 0.5
1 Guarda14 US - Estados Unidos 2.1
2 Guarda14 PE - Peru NaN

Convert dataframe to dictionary of list of tuples

I have a dataframe that looks like the following
user item \
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e The Cove - Jack Johnson
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e Entre Dos Aguas - Paco De Lucia
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e Stronger - Kanye West
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e Constellations - Jack Johnson
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e Learn To Fly - Foo Fighters
rating
0 1
1 2
2 1
3 1
4 1
and would like to achieve the following structure:
dict-> list of tuples
user-> (item, rating)
b80344d063b5ccb3212f76538f3d9e43d87dca9e -> list((The Cove - Jack
Johnson, 1), ... , )
I can do:
item_set = dict((user, set(items)) for user, items in \
data.groupby('user')['item'])
But that only gets me halfways. How do I get the corresponding "rating" value from the groupby?
Set user as index, convert to tuple using df.apply, groupby index using df.groupby(level=0) and get a list using dfGroupBy.agg and convert to dictionary using df.to_dict:
In [1417]: df
Out[1417]:
user item \
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e The Cove - Jack Johnson
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e Entre Dos Aguas - Paco De Lucia
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e Stronger - Kanye West
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e Constellations - Jack Johnson
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e Learn To Fly - Foo Fighters
rating
0 1
1 2
2 2
3 2
4 2
In [1418]: df.set_index('user').apply(tuple, 1)\
.groupby(level=0).agg(lambda x: list(x.values))\
.to_dict()
Out[1418]:
{'b80344d063b5ccb3212f76538f3d9e43d87dca9e': [('The Cove - Jack Johnson', 1),
('Entre Dos Aguas - Paco De Lucia', 2),
('Stronger - Kanye West', 2),
('Constellations - Jack Johnson', 2),
('Learn To Fly - Foo Fighters', 2)]}

Categories