python pandas pivot table of numerical range from dataframe - python

Hello i would like to make a pivot table out of a dataframe that list out the companies according to their number of uploads on a website. Here is what I have:
df
Company Uploads
Nike 11
Adidas 26
Apple 55
Tesla 3
Amazon 97
Ralph Lauren 54
Tiffany 19
Walmart 77
Target 18
Facebook 48
Google 81
Desired output
Range Company Uploads
>10 Tesla 3
11-50 Adidas 26
Tiffany 19
Target 18
Nike 11
51-100 Amazon 97
Google 81
Walmart 77
Apple 55
Ralph Lauren 54
I'm thinking adding a 'Range' column in df using np.where. Then make a pivot table using pd.pivot_table or .groupby. Then .sort_value for the descending upload number in the pivot table.
I'm not very sure if this would work. Can anyone please help me on this please? I appreciate any assistance. Thanks in advance!!

You can use pd.cut(), which has the capability of binning, to classify a segment and use the name output by a label.
import pandas as pd
import numpy as np
import io
data = '''
Company Uploads
Nike 11
Adidas 26
Apple 55
Tesla 3
Amazon 97
"Ralph Lauren" 54
Tiffany 19
Walmart 77
Target 18
Facebook 48
Google 81
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['category'] = pd.cut(df['Uploads'], [0,10,50,100], labels=['>10','11-50','51-100'])
df.sort_values(['category','Uploads'], ascending=[True, True], inplace=True)
df.set_index(['category','Company'],inplace=True)
df
Uploads
category Company
>10 Tesla 3
11-50 Nike 11
Target 18
Tiffany 19
Adidas 26
Facebook 48
51-100 Ralph Lauren 54
Apple 55
Walmart 77
Google 81
Amazon 97

What you want is a MultiIndex instead of a groupby()
First create a column that bins your uploads like you proposed:
df = df.sort_values('Uploads',ascending=False)
df['Range'] = np.digitize(df['Uploads'],[0,11,51,100]) #bins <=10, 11-50, 50-100
#only handles up to 100, if there are values above 100 you need to expand that second list
Now we map the resulting values of our bin to a more descriptive string
df = df.sort_values('Range')
key_range = np.vectorize(lambda x: {1:'<10',2:'11-50',3:'51-100'}[x])
df['Range'] = k(df['Range'])
Create a multiIndex to get your desired df
df.set_index(['Range','Company'])
output:
Uploads
Range Company
<10 Tesla 3
11-50 Facebook 48
Adidas 26
Tiffany 19
Target 18
Nike 11
51-100 Amazon 97
Google 81
Walmart 77
Apple 55
Ralph 54

Related

Fill blank cells of a pandas dataframe column by matching with another datafame column

I have a pandas dataframe, lets call it df1 that looks like this (the follow is just a sample to give an idea of the dataframe):
Ac
Tp
Id
2020
2021
2022
Efecty
FC
IQ_EF
100
200
45
Asset
FC
52
48
15
Debt
P&G
IQ_DEBT
45
58
15
Tax
Other
48
45
78
And I want to fill the blank spaces using a in the 'Id' column using the next auxiliar dataframe, lets call it df2 (again, this is just a sample):
Ac
Tp
Id
Efecty
FC
IQ_EF
Asset
FC
IQ_AST
Debt
P&G
IQ_DEBT
Tax
Other
IQ_TAX
Income
BAL
IQ_INC
Invest
FC
IQ_INV
To get df1 dataframe, looking like this:
Ac
Tp
Id
2020
2021
2022
Efecty
FC
IQ_EF
100
200
45
Asset
FC
IQ_AST
52
48
15
Debt
P&G
IQ_DEBT
45
58
15
Tax
Other
IQ_TAX
48
45
78
I tried with this line of code but it did not work:
df1['Id'] = df1['Id'].mask(df1('nan')).fillna(df1['Ac'].map(df2('Ac')['Id']))
Can you guys help me?
Merge the two frames on Ac and Tp columns and assign the Id column from this result to df1.Id. This works similar to Excel Vlookup functionality.
ac_tp = ['Ac', 'Tp']
df1['Id'] = df1[ac_tp].merge(df2[[*ac_tp, 'Id']])['Id']
In a similar vein you could also try:
df['Id'] = (df.merge(df2, on = ['Ac', 'Tp'])
.pipe(lambda d: d['Id_x'].mask(d['Id_x'].isnull(), d['Id_y'])))
Ac Tp Id 2020 2021 2022
0 Efecty FC IQ_EF 100 200 45
1 Asset FC IQ_AST 52 48 15
2 Debt P&G IQ_DEBT 45 58 15
3 Tax Other IQ_TAX 48 45 78

How do i create code for a vlookup in python?

df
Season
Date
Team
Team_Season_Code
TS
L
Opponent
Opponent_Season_Code
OS
2019
20181109
Abilene_Chr
1_2019
94
Home
Arkansas_St
15_2019
73
2019
20181115
Abilene_Chr
1_2019
67
Away
Denver
82_2019
61
2019
20181122
Abilene_Chr
1_2019
72
N
Elon
70_2019
56
2019
20181123
Abilene_Chr
1_2019
73
Away
Pacific
224_2019
71
2019
20181124
Abilene_Chr
1_2019
60
N
UC_Riverside
306_2019
48
Overall_Season_Avg
Team_Season_Code
Team
TS
OS
MOV
15_2019
Arkansas_St
70.909091
65.242424
5.666667
70_2019
Elon
73.636364
71.818182
1.818182
82_2019
Denver
74.03125
72.15625
1.875
224_2019
Pacific
78.333333
76.466667
1.866667
306_2019
UC_Riverside
79.545455
78.060606
1.484848
I have these two dataframes and I want to be able to look up the Opponent_Season_Code from df in Overall_Season_Avg - "Team_Season_Code" and bring back "TS" and "OS" to create a new column in df called "OOS" and "OTS"
So a new column for row 1 in df should have Column name OOS with data - 65.24... and Column name OTS with data 70.90...
In excel its a simple vlookup but i haven't been able to use the solutions that i have found to the vlookup question on overflow so i decided to post my own question. I will also say that the Overall_Season_Avg dataframe was created through by Overall_Season_Avg = df.groupby(['Team_Season_Code', 'Team']).agg({'TS': np.mean, 'OS': np.mean, 'MOV': np.mean})
You can use a merge, after reworking a bit Overall_Season_Avg :
df.merge(Overall_Season_Avg
.set_index(['Team_Season_Code', 'Team'])
[['OS', 'TS']].add_prefix('O'),
left_on=['Opponent_Season_Code', 'Opponent'],
right_index=True, how='left'
)
Output:
Season Date Team Team_Season_Code TS L Opponent Opponent_Season_Code OS OOS OTS
0 2019 20181109 Abilene_Chr 1_2019 94 Home Arkansas_St 15_2019 73 65.242424 70.909091
1 2019 20181115 Abilene_Chr 1_2019 67 Away Denver 82_2019 61 72.156250 74.031250
2 2019 20181122 Abilene_Chr 1_2019 72 N Elon 70_2019 56 71.818182 73.636364
3 2019 20181123 Abilene_Chr 1_2019 73 Away Pacific 224_2019 71 76.466667 78.333333
4 2019 20181124 Abilene_Chr 1_2019 60 N UC_Riverside 306_2019 48 78.060606 79.545455
merging only on Opponent_Season_Code/Team_Season_Code:
df.merge(Overall_Season_Avg
.set_index('Team_Season_Code')
[['OS', 'TS']].add_prefix('O'),
left_on=['Opponent_Season_Code'],
right_index=True, how='left'
)
Output:
Season Date Team Team_Season_Code TS L Opponent Opponent_Season_Code OS OOS OTS
0 2019 20181109 Abilene_Chr 1_2019 94 Home Arkansas_St 15_2019 73 65.242424 70.909091
1 2019 20181115 Abilene_Chr 1_2019 67 Away Denver 82_2019 61 72.156250 74.031250
2 2019 20181122 Abilene_Chr 1_2019 72 N Elon 70_2019 56 71.818182 73.636364
3 2019 20181123 Abilene_Chr 1_2019 73 Away Pacific 224_2019 71 76.466667 78.333333
4 2019 20181124 Abilene_Chr 1_2019 60 N UC_Riverside 306_2019 48 78.060606 79.545455
df.merge(Overall_Season_Avg, on=['Team_Season_Code', 'Team'], how='left')
and rename column's names
or use transform instead agg when make Overall_Season_Avg.
but i don remain transform code becuz you don provide reproducible example
make simple and reproducible example plz
https://stackoverflow.com/help/minimal-reproducible-example

Doing vlookup-like things on Python with multiple lookup values

Many of us know that the syntax for a Vlookup function on Excel is as follows:
=vlookup([lookup value], [lookup table/range], [column selected], [approximate/exact match (optional)])
I want to do something on Python with a lookup table (in dataframe form) that looks something like this:
Name Date of Birth ID#
Jack 1/1/2003 0
Ryan 1/8/2003 1
Bob 12/2/2002 2
Jack 3/9/2003 3
...and so on. Note how the two Jacks are assigned different ID numbers because they are born on different dates.
Say I have something like a gradebook (again, in dataframe form) that looks like this:
Name Date of Birth Test 1 Test 2
Jack 1/1/2003 89 91
Ryan 1/8/2003 92 88
Jack 3/9/2003 93 79
Bob 12/2/2002 80 84
...
How do I make it so that the result looks like this?
ID# Name Date of Birth Test 1 Test 2
0 Jack 1/1/2003 89 91
3 Ryan 1/8/2003 92 88
1 Jack 3/9/2003 93 79
2 Bob 12/2/2002 80 84
...
It seems to me that the "lookup value" would involve multiple columns of data ('Name' and 'Date of Birth'). I kind of know how to do this in Excel, but how do I do it in Python?
Turns out that I can just do
pd.merge([lookup value], [lookup table], on = ['Name', 'Date of Birth']
which produces
Name Date of Birth Test 1 Test 2 ID#
Jack 1/1/2003 89 91 0
Ryan 1/8/2003 92 88 3
Jack 3/9/2003 93 79 1
Bob 12/2/2002 80 84 2
...
Then everything needed is to move the last column to the front.

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.
Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.
Sample Data
import pandas as pd
df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225],
"TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
"Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
print(df)
Sample Dataframe
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
7 225 77 Bad Debt
8 225 25 Charity
9 225 77 Bad Debt
10 225 77 Bad Debt
Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
df.drop(df[mask].index[1:],inplace=True)
print(df)
Desired Result (Each patient should have a maximum of one Bad Debt transaction)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Alternate Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
mask = mask & (mask.cumsum() > 1)
df.loc[mask, 'Bucket'] = 'DELETE'
df = df[df['Bucket'] != 'DELETE]
Attempted using Dask
I thought maybe Dask would be able to help me out, but I got the following error codes:
Using Dask on first solution - "NotImplementedError: Series getitem in only supported for other series objects with matching partition structure"
Using Dask on second solution - "TypeError: '_LocIndexer' object does not support item assignment"
You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:
df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Create a boolean mask without loop:
mask = df[df['Bucket'].eq('Bad Debt')].duplicated('PatientAccountNumber')
df.drop(mask[mask].index, inplace=True)
>>> df
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity

How to grab a complete table hidden beyond 'Show all' by web scraping in Python

According to the reply I found in my previous question, I am able to grab the table by web scraping in Python from the URL: https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html But it only grabs partially until the row "Show all" is appeared.
How can I grab the complete table in Python which is hidden beyond "Show all" ?
Here is the code I am using:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#
vaccineDF = pd.read_html('https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html')[0]
vaccineDF = vaccineDF.reset_index(drop=True)
print(vaccineDF.head(100))
The output only grabs 15 rows (until Show All):
Unnamed: 0_level_0 Doses administered ... Unnamed: 8_level_0 Unnamed: 9_level_0
Unnamed: 0_level_1 Per 100 people ... Unnamed: 8_level_1 Unnamed: 9_level_1
0 World 11 ... NaN NaN
1 Israel 116 ... NaN NaN
2 Seychelles 116 ... NaN NaN
3 U.A.E. 99 ... NaN NaN
4 Chile 69 ... NaN NaN
5 Bahrain 66 ... NaN NaN
6 Bhutan 63 ... NaN NaN
7 U.K. 62 ... NaN NaN
8 United States 61 ... NaN NaN
9 San Marino 60 ... NaN NaN
10 Maldives 59 ... NaN NaN
11 Malta 55 ... NaN NaN
12 Monaco 53 ... NaN NaN
13 Hungary 45 ... NaN NaN
14 Serbia 44 ... NaN NaN
15 Show all Show all ... Show all Show all
Below is the screen shot of the partial table until "Show all" in the web (left part) and corresponding inspect elements (right part):
You can't print whole data directly because you can see your full data after clicking the Show all Button. So, from this scenario, we can understand that first of all we have to first create an on click() event for clicking the Show all Button then we can fetch the whole table.
I have used Selenium Library for the on click event for pressing the Show all Button. For this particular scenario, I have used Firefox() Webdriver of Selenium for fetching all data from url. Kindly refer to the code given below for fetching the whole table of the given COVID Dataset URL:-
# Import all the Important Libraries
from selenium import webdriver # This module help to fetch data and on-click event purpose
from pandas.io.html import read_html # This module will help to read 'html' source. So, we can __scrape__ data from it
import pandas as pd # This Module will help to Convert Our Data into 'DataFrame'
# Create 'FireFox' Webdriver Object
driver = webdriver.Firefox()
# Get Website
driver.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
# Find 'Show all' Button Using 'XPath'
show_all_button = driver.find_element_by_xpath("/html/body/div[1]/main/article/section/div/div/div[4]/div[1]/div/table/tbody/tr[16]")
# Click 'Show all' Button
show_all_button.click()
# Get 'HTML' Content of Page
html_data = driver.page_source
After fetching the whole data, let's see how many tables are there in our COVID Dataset URL
covid_data_tables = read_html(html_data, attrs = {"class":"g-summary-table svelte-2wimac"}, header = None)
# Print Number of Tables Extracted
print ("\nExtracted {num} COVID Data Table".format(num = len(covid_data_tables)), "\n")
# Output of Above Cell:-
Extracted 1 COVID Data Table
Now, let's fetch the Data Table:-
# Print Table Data
covid_data_tables[0].head(20)
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Pct. of population
Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
0 World 11 877933955 – –
1 Israel 116 10307583 60% 56%
2 Seychelles 116 112194 68% 47%
3 U.A.E. 99 9489684 – –
4 Chile 69 12934282 41% 28%
5 Bahrain 66 1042463 37% 29%
6 Bhutan 63 478219 63% –
7 U.K. 62 41505768 49% 13%
8 United States 61 202282923 38% 24%
9 San Marino 60 20424 35% 25%
10 Maldives 59 303752 53% 5.6%
11 Malta 55 264658 38% 17%
12 Monaco 53 20510 30% 23%
13 Hungary 45 4416581 32% 14%
14 Serbia 44 3041740 26% 17%
15 Qatar 43 1209648 – –
16 Uruguay 38 1310591 30% 8.3%
17 Singapore 30 1667522 20% 9.5%
18 Antigua and Barbuda 28 27032 28% –
19 Iceland 28 98672 20% 8.1%
As you can see it was not showing show all in our dataset. So, Now we can Convert this Data Table to DataFrame. For doing this task we have to store this Data into CSV Format and we can reupload it and store it in DataFrame. The code for the Same was stated below:-
# HTML Table to CSV Format Conversion For COVID Dataset
covid_data_file = 'covid_data.csv'
covid_data_tables[0].to_csv(covid_data_file, sep = ',')
# Read CSV Data From Data Table for Further Analysis
covid_data = pd.read_csv("covid_data.csv")
So, after Storing all the Data into csv Format let's convert data into DataFrame Format and Print Whole data:-
# Store 'CSV' Data into 'DataFrame' Format
vaccineDF = pd.DataFrame(covid_data)
vaccineDF = vaccineDF.drop(columns=["Unnamed: 0"], axis = 1) # 'drop' Unneccesary Columns from the Dataset
# Print Whole Dataset
vaccineDF
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Doses administered.1 Pct. of population Pct. of population.1
0 Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
1 World 11 877933955 – –
2 Israel 116 10307583 60% 56%
3 Seychelles 116 112194 68% 47%
4 U.A.E. 99 9489684 – –
... ... ... ... ... ...
154 Syria <0.1 2500 <0.1% –
155 Papua New Guinea <0.1 1081 <0.1% –
156 South Sudan <0.1 947 <0.1% –
157 Cameroon <0.1 400 <0.1% –
158 Zambia <0.1 106 <0.1% –
159 rows × 5 columns
From above Output we can see that we have successfully fetched whole data table. Hope this Solution will help you.
OWID provides this data, which effectively comes from JHU
if you want latest vaccination data by country, it's simple to use CSV interface
import requests, io
dfraw = pd.read_csv(io.StringIO(requests.get("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv").text))
dfraw["date"] = pd.to_datetime(dfraw["date"])
dfraw.sort_values(["iso_code","date"]).groupby("iso_code", as_index=False).last()

Categories