Merge dataframes on same row - python

I have a python code that gets links from a dataframe (df1) , collect data from a website and return output in a new dataframe
df1:
id Name link Country Continent
1 Company1 www.link1.com France Europe
2 Company2 www.link2.com France Europe
3 Company3 www.Link3.com France Europe
The ouput from the code is df2:
link numberOfPPL City
www.link1.com 8 Paris
www.link1.com 9 Paris
www.link2.com 15 Paris
www.link2.com 1 Paris
I want to join these 2 dataframes in one (dfinal). My code:
dfinal = df1.append(df2, ignore_index=True)
I got dfinal:
link numberOfPPL City id Name Country Continent
www.link1.com 8 Paris
www.link1.com 9 Paris
www.link2.com 15 Paris
www.link2.com 1 Paris
www.link1.com 1 Company1 France Continent
..
..
I Want my final dataframe to be like this:
link numberOfPPL City id Name Country Continent
www.link1.com 8 Paris 1 Company1 France Europe
www.link1.com 9 Paris 1 Company1 France Europe
www.link2.com 15 Paris 1 Company1 France Europe
www.link2.com 1 Paris 2 Company2 France Europe
Can anyone help please ??

You can merge the two dataframes on 'link':
outputDF = df2.merge(df1, how='left', on=['link'])

Related

Fastest way to "unpack' a pandas dataframe

Hope the title is not misleading.
I load an Excel file in a pandas dataframe as usual
df = pd.read_excel('complete.xlsx')
and this is what's inside (usually is already ordered - this is a really small sample)
df
Out[21]:
Country City First Name Last Name Ref
0 England London John Smith 34
1 England London Bill Owen 332
2 England Brighton Max Crowe 25
3 England Brighton Steve Grant 55
4 France Paris Roland Tomas 44
5 France Paris Anatole Donnet 534
6 France Lyon Paulin Botrel 234
7 Spain Madrid Oriol Abarquero 34
8 Spain Madrid Alberto Olloqui 534
9 Spain Barcelona Ander Moreno 254
10 Spain Barcelona Cesar Aranda 222
what I need to do is automating an export of the data creating a sqlite db for every country, (i.e. 'England.sqlite') which will contain a table for evey city (i.e. London and Brighton) and every table will have the related personnel info.
The sqlite is not a problem, I'm only trying to figure how to "unpack" the dataframe in the most rapid and "pythonic way
Thanks
You can loop by DataFrame.groupby object:
for i, subdf in df.groupby('Country'):
print (i)
print (subdf)
#processing

(pandas)I want add to count,percent at groupby

i do road csv file's and grouping 2 headers in csv file so i want to
each other count about 1 headers value and percent count/total and add
dataframe
have a lot of data in test.csv
==example==
country city name
KOREA busan Kim
KOREA busan choi
KOREA Seoul park
USA LA Jane
Spain Madrid Torres
(name is not overlap)
==========
csv_file = pd.read_csv("test.csv")
need_group = csv_file.groupby(['category','city names'])
returns
country city names
0 KOREA Seoul, Busan, ...
1 KOREA Daegu, Seoul
2 USA LA, New York...
2 USA LA, ...
want to
- count is cf name's
country city names count percent
0 KOREA Seoul 2 20%
1 KOREA Daegu 1 10%
2 USA LA 2 20%
3 USA New York 1 10%
4 Spain Madrid 4 40%
I believe you need counts per country and name by GroupBy.size and then percentage divide by length of DataFrame:
print (csv_file)
country city name
0 KOREA busan Kim
1 KOREA busan Dongs
2 KOREA Seoul park
3 USA LA Jane
4 Spain Madrid Torres
df = csv_file.groupby(['country','city']).size().reset_index(name='count')
df['percent'] = df['count'].div(df['count'].sum()).mul(100)

How can i find unique record in python by row count?

df:
Country state item
0 Germany Augsburg Car
1 Spain Madrid Bike
2 Italy Milan Steel
3 Paris Lyon Bike
4 Italy Milan Steel
5 Germany Augsburg Car
In the above dataframe, if we take unique record Appearance.
Country state item Appeared
0 Germany Augsburg Car 1
1 Spain Madrid Bike 1
2 Italy Milan Steel 1
3 Paris Lyon Bike 1
4 Italy Milan Steel 2
5 Germany Augsburg Car 2
Since row no. 4 and 5 appeared for the second time, i want to change their item name to differentiate both record.If a record is appeared more than once in the data, item name should be rename as Item_A for 1st appearance and Item_B for the second appearance...
Output:
Country state item Appeared
0 Germany Augsburg Car_A 1
1 Spain Madrid Bike 1
2 Italy Milan Steel_A 1
3 Paris Lyon Bike 1
4 Italy Milan Stee_B 2
5 Germany Augsburg Car_B 2
You can first get the Appreared column by groupby().cumcount, then add the suffixes:
# unique values
duplicates = df.duplicated(keep=False)
# Appearance count
df['Appeared'] = df.groupby([*df]).cumcount().add(1)
# add the suffixes
suffixes = np.array(list('ABC'))
df.loc[duplicates, 'item'] = df['item'] + '_' + suffixes[df.Appeared-1]
Output:
Country state item Appeared
0 Germany Augsburg Car_A 1
1 Spain Madrid Bike 1
2 Italy Milan Steel_A 1
3 Paris Lyon Bike 1
4 Italy Milan Steel_B 2
5 Germany Augsburg Car_B 2

How to use python to group by two columns, sum them and use one of the columns to sort and get the n highest per group in pandas

I have a dataframe and I'm trying to group by the Name and Destination columns and calculate the sum of the sales for that Destination for the particular Name and then get the top 2 for each name.
data=
Name Origin Destination Sales
John Italy China 2
Dan UK China 3
Dan UK India 2
Sam UK India 5
Sam Italy Malaysia 1
John Italy Malaysia 1
Dan France India 4
Dan Italy China 2
Sam Italy Malaysia 2
John France Malaysia 1
Sam Italy China 2
Dan UK Malaysia 4
Dan France India 2
John France Malaysia 4
John Italy China 4
John UK Malaysia 1
Sam UK China 4
Sam France China 5
I have tried to do this but I keep getting it sorted by the Destination and not the Sales. Below is the code I tried.
data.groupby(['Name', 'Destination'])['Sales'].sum().groupby(level=0).head(2).reset_index(name='Total_Sales')
This code gives me this dataframe:
Name Destination Total_Sales
Dan China 5
Dan India 8
John China 6
John Malaysia 7
Sam China 11
Sam India 5
But it is sorted on the wrong column (Destination) but I would like to sort by the sum of the sales (Total_Sales).
The expected result I want I want to achieve is:
Name Destination Total_Sales
Dan India 8
Dan China 5
John Malaysia 7
John China 6
Sam China 11
Sam India 5
Your code:
grouped_df = data.groupby(['Name', 'Destination'])['Sales'].sum().groupby(level=0).head(2).reset_index(name='Total_Sales')
To sort the result:
sorted_df = grouped_df.sort_values(by=['Name','Total_Sales'], ascending=(True,False))
print(sorted_df)
Output:
Name Destination Total_Sales
1 Dan India 8
0 Dan China 5
3 John Malaysia 7
2 John China 6
4 Sam China 11
5 Sam India 5

Formatting pandas output data

I have a dataframe and want the output to be formatted to save paper for printing.
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8
London 40
France 2 20
France 2 22
France 3
France 3
France 3
USA 10
Is there a way to format the dataframe to look like this:
GameA GameB
Country
London 5 London 20
London 5 London 10
London 3 London 5
London 3 London 6
London London 8
London London 40
GameA GameB
France 2 France 20
France 2 France 22
France 3
France 3
France 3
GameA
USA 10
The formatting is off a bit because of how it copy and pasted the text results above (due to the missing values), but this should work with your actual data.
countries = df.index.unique()
for country in countries:
print(df.loc[df.index == country])
print(' ')
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8 NaN
London 40 NaN
GameA GameB
Country
France 2 20
France 2 22
France 3 NaN
France 3 NaN
France 3 NaN
GameA GameB
Country
USA 10 NaN

Categories