(pandas)I want add to count,percent at groupby - python

i do road csv file's and grouping 2 headers in csv file so i want to
each other count about 1 headers value and percent count/total and add
dataframe
have a lot of data in test.csv
==example==
country city name
KOREA busan Kim
KOREA busan choi
KOREA Seoul park
USA LA Jane
Spain Madrid Torres
(name is not overlap)
==========
csv_file = pd.read_csv("test.csv")
need_group = csv_file.groupby(['category','city names'])
returns
country city names
0 KOREA Seoul, Busan, ...
1 KOREA Daegu, Seoul
2 USA LA, New York...
2 USA LA, ...
want to
- count is cf name's
country city names count percent
0 KOREA Seoul 2 20%
1 KOREA Daegu 1 10%
2 USA LA 2 20%
3 USA New York 1 10%
4 Spain Madrid 4 40%

I believe you need counts per country and name by GroupBy.size and then percentage divide by length of DataFrame:
print (csv_file)
country city name
0 KOREA busan Kim
1 KOREA busan Dongs
2 KOREA Seoul park
3 USA LA Jane
4 Spain Madrid Torres
df = csv_file.groupby(['country','city']).size().reset_index(name='count')
df['percent'] = df['count'].div(df['count'].sum()).mul(100)

Related

Fastest way to "unpack' a pandas dataframe

Hope the title is not misleading.
I load an Excel file in a pandas dataframe as usual
df = pd.read_excel('complete.xlsx')
and this is what's inside (usually is already ordered - this is a really small sample)
df
Out[21]:
Country City First Name Last Name Ref
0 England London John Smith 34
1 England London Bill Owen 332
2 England Brighton Max Crowe 25
3 England Brighton Steve Grant 55
4 France Paris Roland Tomas 44
5 France Paris Anatole Donnet 534
6 France Lyon Paulin Botrel 234
7 Spain Madrid Oriol Abarquero 34
8 Spain Madrid Alberto Olloqui 534
9 Spain Barcelona Ander Moreno 254
10 Spain Barcelona Cesar Aranda 222
what I need to do is automating an export of the data creating a sqlite db for every country, (i.e. 'England.sqlite') which will contain a table for evey city (i.e. London and Brighton) and every table will have the related personnel info.
The sqlite is not a problem, I'm only trying to figure how to "unpack" the dataframe in the most rapid and "pythonic way
Thanks
You can loop by DataFrame.groupby object:
for i, subdf in df.groupby('Country'):
print (i)
print (subdf)
#processing

Merge dataframes on same row

I have a python code that gets links from a dataframe (df1) , collect data from a website and return output in a new dataframe
df1:
id Name link Country Continent
1 Company1 www.link1.com France Europe
2 Company2 www.link2.com France Europe
3 Company3 www.Link3.com France Europe
The ouput from the code is df2:
link numberOfPPL City
www.link1.com 8 Paris
www.link1.com 9 Paris
www.link2.com 15 Paris
www.link2.com 1 Paris
I want to join these 2 dataframes in one (dfinal). My code:
dfinal = df1.append(df2, ignore_index=True)
I got dfinal:
link numberOfPPL City id Name Country Continent
www.link1.com 8 Paris
www.link1.com 9 Paris
www.link2.com 15 Paris
www.link2.com 1 Paris
www.link1.com 1 Company1 France Continent
..
..
I Want my final dataframe to be like this:
link numberOfPPL City id Name Country Continent
www.link1.com 8 Paris 1 Company1 France Europe
www.link1.com 9 Paris 1 Company1 France Europe
www.link2.com 15 Paris 1 Company1 France Europe
www.link2.com 1 Paris 2 Company2 France Europe
Can anyone help please ??
You can merge the two dataframes on 'link':
outputDF = df2.merge(df1, how='left', on=['link'])

How can i find unique record in python by row count?

df:
Country state item
0 Germany Augsburg Car
1 Spain Madrid Bike
2 Italy Milan Steel
3 Paris Lyon Bike
4 Italy Milan Steel
5 Germany Augsburg Car
In the above dataframe, if we take unique record Appearance.
Country state item Appeared
0 Germany Augsburg Car 1
1 Spain Madrid Bike 1
2 Italy Milan Steel 1
3 Paris Lyon Bike 1
4 Italy Milan Steel 2
5 Germany Augsburg Car 2
Since row no. 4 and 5 appeared for the second time, i want to change their item name to differentiate both record.If a record is appeared more than once in the data, item name should be rename as Item_A for 1st appearance and Item_B for the second appearance...
Output:
Country state item Appeared
0 Germany Augsburg Car_A 1
1 Spain Madrid Bike 1
2 Italy Milan Steel_A 1
3 Paris Lyon Bike 1
4 Italy Milan Stee_B 2
5 Germany Augsburg Car_B 2
You can first get the Appreared column by groupby().cumcount, then add the suffixes:
# unique values
duplicates = df.duplicated(keep=False)
# Appearance count
df['Appeared'] = df.groupby([*df]).cumcount().add(1)
# add the suffixes
suffixes = np.array(list('ABC'))
df.loc[duplicates, 'item'] = df['item'] + '_' + suffixes[df.Appeared-1]
Output:
Country state item Appeared
0 Germany Augsburg Car_A 1
1 Spain Madrid Bike 1
2 Italy Milan Steel_A 1
3 Paris Lyon Bike 1
4 Italy Milan Steel_B 2
5 Germany Augsburg Car_B 2

How to use python to group by two columns, sum them and use one of the columns to sort and get the n highest per group in pandas

I have a dataframe and I'm trying to group by the Name and Destination columns and calculate the sum of the sales for that Destination for the particular Name and then get the top 2 for each name.
data=
Name Origin Destination Sales
John Italy China 2
Dan UK China 3
Dan UK India 2
Sam UK India 5
Sam Italy Malaysia 1
John Italy Malaysia 1
Dan France India 4
Dan Italy China 2
Sam Italy Malaysia 2
John France Malaysia 1
Sam Italy China 2
Dan UK Malaysia 4
Dan France India 2
John France Malaysia 4
John Italy China 4
John UK Malaysia 1
Sam UK China 4
Sam France China 5
I have tried to do this but I keep getting it sorted by the Destination and not the Sales. Below is the code I tried.
data.groupby(['Name', 'Destination'])['Sales'].sum().groupby(level=0).head(2).reset_index(name='Total_Sales')
This code gives me this dataframe:
Name Destination Total_Sales
Dan China 5
Dan India 8
John China 6
John Malaysia 7
Sam China 11
Sam India 5
But it is sorted on the wrong column (Destination) but I would like to sort by the sum of the sales (Total_Sales).
The expected result I want I want to achieve is:
Name Destination Total_Sales
Dan India 8
Dan China 5
John Malaysia 7
John China 6
Sam China 11
Sam India 5
Your code:
grouped_df = data.groupby(['Name', 'Destination'])['Sales'].sum().groupby(level=0).head(2).reset_index(name='Total_Sales')
To sort the result:
sorted_df = grouped_df.sort_values(by=['Name','Total_Sales'], ascending=(True,False))
print(sorted_df)
Output:
Name Destination Total_Sales
1 Dan India 8
0 Dan China 5
3 John Malaysia 7
2 John China 6
4 Sam China 11
5 Sam India 5

How do I create a new column in a dataframe from an existing column using conditions?

I have one column containing all the data which looks something like this (values that need to be separated have a mark like (c)):
UK (c)
London
Wales
Liverpool
US (c)
Chicago
New York
San Francisco
Seattle
Australia (c)
Sydney
Perth
And I want it split into two columns looking like this:
London UK
Wales UK
Liverpool UK
Chicago US
New York US
San Francisco US
Seattle US
Sydney Australia
Perth Australia
Question 2: What if the countries did not have a pattern like (c)?
Step by step with endswith and ffill + str.strip
df['country']=df.loc[df.city.str.endswith('(c)'),'city']
df.country=df.country.ffill()
df=df[df.city.ne(df.country)]
df.country=df.country.str.strip('(c)')
extract and ffill
Start with extract and ffill, then remove redundant rows.
df['country'] = (
df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill())
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)
data country
0 London UK
1 Wales UK
2 Liverpool UK
3 Chicago US
4 New York US
5 San Francisco US
6 Seattle US
7 Sydney Australia
8 Perth Australia
Where,
df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill()
0 UK
1 UK
2 UK
3 UK
4 US
5 US
6 US
7 US
8 US
9 Australia
10 Australia
11 Australia
Name: country, dtype: object
The pattern '(.*)\s+\(c\)' matches strings of the form "country (c)" and extracts the country name. Anything not matching this pattern is replaced with NaN, so you can conveniently forward fill on rows.
split with np.where and ffill
This splits on "(c)".
u = df['data'].str.split(r'\s+\(c\)')
df['country'] = pd.Series(np.where(u.str.len() == 2, u.str[0], np.nan)).ffill()
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)
data country
0 London UK
1 Wales UK
2 Liverpool UK
3 Chicago US
4 New York US
5 San Francisco US
6 Seattle US
7 Sydney Australia
8 Perth Australia
You can first use str.extract to locate the cities ending in (c) and extract the country name, and ffill to populate a new country column.
The same extracted matches can be use to locate the rows to be dropped, i.e. rows which are notna:
m = df.city.str.extract('^(.*?)(?=\(c\)$)')
ix = m[m.squeeze().notna()].index
df['country'] = m.ffill()
df.drop(ix)
city country
1 London UK
2 Wales UK
3 Liverpool UK
5 Chicago US
6 New York US
7 San Francisco US
8 Seattle US
10 Sydney Australia
11 Perth Australia
You can use np.where with str.contains too:
mask = df['places'].str.contains('(c)', regex = False)
df['country'] = np.where(mask, df['places'], np.nan)
df['country'] = df['country'].str.replace('\(c\)', '').ffill()
df = df[~mask]
df
places country
1 London UK
2 Wales UK
3 Liverpool UK
5 Chicago US
6 New York US
7 San Francisco US
8 Seattle US
10 Sydney Australia
11 Perth Australia
The str contains looks for (c) and if present will return True for that index. Where this condition is True, the country value will be added to the country columns
You could do the following:
data = ['UK (c)','London','Wales','Liverpool','US (c)','Chicago','New York','San Francisco','Seattle','Australia (c)','Sydney','Perth']
df = pd.DataFrame(data, columns = ['city'])
df['country'] = df.city.apply(lambda x : x.replace('(c)','') if '(c)' in x else None)
df.fillna(method='ffill', inplace=True)
df = df[df['city'].str.contains('\(c\)')==False]
Output
+-----+----------------+-----------+
| | city | country |
+-----+----------------+-----------+
| 1 | London | UK |
| 2 | Wales | UK |
| 3 | Liverpool | UK |
| 5 | Chicago | US |
| 6 | New York | US |
| 7 | San Francisco | US |
| 8 | Seattle | US |
| 10 | Sydney | Australia |
| 11 | Perth | Australia |
+-----+----------------+-----------+

Categories