Issues converting wide to long on multiple columns [duplicate] - python

I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.

Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN

Related

Replace column value in one Panda Dataframe with column in another Panda Dataframe with conditions

I have the following 3 Panda Dataframe. I want to replace company and division columns the ID from their respective company and division dataframe.
pd_staff:
id name company division
P001 John Sunrise Headquarter
P002 Jane Falcon Digital Research & Development
P003 Joe Ashford Finance
P004 Adam Falcon Digital Sales
P004 Barbara Sunrise Human Resource
pd_company:
id name
1 Sunrise
2 Falcon Digital
3 Ashford
pd_division:
id name
1 Headquarter
2 Research & Development
3 Finance
4 Sales
5 Human Resource
This is the end result that I am trying to produce
id name company division
P001 John 1 1
P002 Jane 2 2
P003 Joe 3 3
P004 Adam 2 4
P004 Barbara 1 5
I have tried to combine Staff and Company using this code
pd_staff.loc[pd_staff['company'].isin(pd_company['name']), 'company'] = pd_company.loc[pd_company['name'].isin(pd_staff['company']), 'id']
which produces
id name company
P001 John 1.0
P002 Jane NaN
P003 Joe NaN
P004 Adam NaN
P004 Barbara NaN
You can do:
pd_staff['company'] = pd_staff['company'].map(pd_company.set_index('name')['id'])
pd_staff['division'] = pd_staff['division'].map(pd_division.set_index('name')['id'])
print(pd_staff):
id name company division
0 P001 John 1 1
1 P002 Jane 2 2
2 P003 Joe 3 3
3 P004 Adam 2 4
4 P004 Barbara 1 5
This will achieve the desired results
df_merge = df.merge(df2, how = 'inner', right_on = 'name', left_on = 'company', suffixes=('', '_y'))
df_merge = df_merge.merge(df3, how = 'inner', left_on = 'division', right_on = 'name', suffixes=('', '_z'))
df_merge = df_merge[['id', 'name', 'id_y', 'id_z']]
df_merge.columns = ['id', 'name', 'company', 'division']
df_merge.sort_values('id')
first, lets modify df company and df division a little bit
df2.rename(columns={'name':'company'},inplace=True)
df3.rename(columns={'name':'division'},inplace=True)
Then
df1=df1.merge(df2,on='company',how='left').merge(df3,on='division',how='left')
df1=df1[['id_x','name','id_y','id']]
df1.rename(columns={'id_x':'id','id_y':'company','id':'division'},inplace=True)
Use apply, you can have a function thar will replace the values. from the second excel you will pass the field to look up to and what's to replace in this. Here I am replacing Sunrise by 1 because it is in the second excel.
import pandas as pd
df = pd.read_excel('teste.xlsx')
df2 = pd.read_excel('ids.xlsx')
def altera(df33, field='Sunrise', new_field='1'): # for showing pourposes I left default values but they are to pass from the second excel
return df33.replace(field, new_field)
df.loc[:, 'company'] = df['company'].apply(altera)

How to search a substring from one df in another df?

I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num
city
address_line
1
Toronto
123 Fake St
2
Montreal
456 Sample Ave
df2:
DB_Num
Address
AB1
Toronto 123 Fake St
AB3
789 Random Drive, Toronto
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num
city
address_line
DB_Num
Address
1
Toronto
123 Fake St
AB1
Toronto 123 Fake St
Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index
file_num
City
address_line
DB_num
Address
0
1.0
Toronto
123 Fake St
AB1
Toronto 123 Fake St

Using df1 as a lookup table for df2, df2 has more unique values than df1 in Python

I have a df with US citizens state and I would like to use that as a lookup for world citizens
df1=
[Sam, New York;
Nick, California;
Sarah, Texas]
df2 =
[Sam;
Phillip;
Will;
Sam]
I would like to either df2.replace() with the states or create df3 where my output is:
[New York;
NaN;
NaN;
New York]
I have tried mapping with set_index and dict(zip()) but have had no luck so far.
Thank you.
How about this method:
import pandas as pd
df1 = pd.DataFrame([['Sam','New York'],['Nick','California'],['Sarah','Texas']],\
columns = ['name','state'])
display(df1)
df2 = pd.DataFrame(['Sam','Phillip','Will','Sam'],\
columns = ['name'])
display(df2)
df2.merge(right=df1,left_on='name',right_on='name',how='left')
resulting in
name state
0 Sam New York
1 Nick California
2 Sarah Texas
name
0 Sam
1 Phillip
2 Will
3 Sam
name state
0 Sam New York
1 Phillip NaN
2 Will NaN
3 Sam New York
you can then filter for just the state column in the merged dataframe

Merge two data frames and retain unique columns

I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")

Manipulations with Lat-Lon and Time Series Pandas

I am trying to do some file merging with Latitude and Longitude.
Input File1.csv
Name,Lat,Lon,timeseries(n)
London,80.5234,121.0452,523
London,80.5234,121.0452,732
London,80.5234,121.0452,848
Paris,90.4414,130.0252,464
Paris,90.4414,130.0252,829
Paris,90.4414,130.0252,98
New York,110.5324,90.0023,572
New York,110.5324,90.0023,689
New York,110.5324,90.0023,794
File2.csv
Name,lat,lon,timeseries1
London,80.5234,121.0452,500
Paris,90.4414,130.0252,400
New York,110.5324,90.0023,700
Now Expected output is
File2.csv
Name,lat,lon,timeseries1,timeseries(n) #timeseries is 24 hrs format 17:45:00
London,80.5234,121.0452,500,2103 #Addition of all three values
Paris,90.4414,130.0252,400,1391
New York,110.5324,90.0023,700,2055
With python, numpy and dictionaries it would be straight as key = sum of values but I want to use Pandas
Please suggest me how to start with or may be a point me to some example. I have not see anything like Dictionary types with Pandas with Latitude and Longitude.
Perform a groupby aggregation on the first df, call sum and then merge this with the other df:
In [12]:
gp = df.groupby('Name')['timeseries(n)'].sum().reset_index()
df1.merge(gp, on='Name')
Out[14]:
Name Lat Lon timeseries1 timeseries(n)
0 London 80.5234 121.0452 500 2103
1 Paris 90.4414 130.0252 400 1391
2 New York 110.5324 90.0023 700 2055
the aggregation looks like this:
In [15]:
gp
Out[15]:
Name timeseries(n)
0 London 2103
1 New York 2055
2 Paris 1391
Your csv files can loaded using read_csv so something like:
df = pd.read_csv('File1.csv')
df1 = pd.read_csv('File2.csv')

Categories