How to data clean in groups - python

I have an extremely long dataframe with a lot of data which I have to clean so that I can proceed with data visualization. There are several things I have in mind that needs to be done and I can do each of them to a certain extent but I don't know how to, or if it's even possible, to do them together.
This is what I have to do:
Find the highest arrival count every year and see if the mode of transport is by air, sea or land.
period arv_count Mode of arrival
0 2013-01 984350 Air
1 2013-01 129074 Sea
2 2013-01 178294 Land
3 2013-02 916372 Air
4 2013-02 125634 Sea
5 2013-02 179359 Land
6 2013-03 1026312 Air
7 2013-03 143194 Sea
8 2013-03 199385 Land
... ... ... ...
78 2015-03 940077 Air
79 2015-03 133632 Sea
80 2015-03 127939 Land
81 2015-04 939370 Air
82 2015-04 118120 Sea
83 2015-04 151134 Land
84 2015-05 945080 Air
85 2015-05 123136 Sea
86 2015-05 154620 Land
87 2015-06 930642 Air
88 2015-06 115631 Sea
89 2015-06 138474 Land
This is an example of what the data looks like. I don't know if it's necessary but I have created another column just for year like so:
def year_extract(year):
return year.split('-')[0].strip()
df1 = pd.DataFrame(df['period'])
df1 = df1.rename(columns={'period':'Year'})
df1 = df1['Year'].apply(year_extract)
df1 = pd.DataFrame(df1)
df = pd.merge(df, df1, left_index= True, right_index= True)
I know how to use groupby and I know how to find maximum but I don't know if it is possible to find maximum in a group like finding the highest arrival count in 2013, 2014, 2015 etc
The data above is the total arrival count for all countries based on the mode of transport and period but the original data also had hundreds of additional rows of which region and country stated are stated but I dropped because I don't know how to use or clean them. It looks like this:
period region country moa arv_count
2013-01 Total Total Air 984350
2013-01 Total Total Sea 129074
2013-01 Total Total Land 178294
2013-02 Total Total Air 916372
... ... ... ... ...
2015-12 AMERICAS USA Land 2698
2015-12 AMERICAS Canada Land 924
2013-01 ASIA China Air 136643
2013-01 ASIA India Air 55369
2013-01 ASIA Japan Air 51178
I would also like to make use of the region data if it is possible. Hoping to create a clustered column chart with the 7 regions as the x axis and arrival count as y axis and each region showing the arrival count via land, sea and air but I feel like there are too much excess data that I don't know how to deal with right now.
For example, I don't know how to deal with the period and the country because all I need is the total arrival count of land, sea and air based on region and year regardless of country and months.

I used this dataframe to test the code (the one in your question):
df = pd.DataFrame([['2013-01', 'Total', 'Total', 'Air', 984350],
['2013-01', 'Total', 'Total', 'Sea', 129074],
['2013-01', 'Total', 'Total', 'Land', 178294],
['2013-02', 'Total', 'Total', 'Air', 916372],
['2015-12', 'AMERICAS', 'USA', 'Land', 2698],
['2015-12', 'AMERICAS', 'Canada', 'Land', 924],
['2013-01', 'ASIA', 'China', 'Air', 136643],
['2013-01', 'ASIA', 'India', 'Air', 55369],
['2013-01', 'ASIA', 'Japan', 'Air', 51178]],
columns = ['period', 'region', 'country', 'moa', 'arv_count'])
Here is the code to get the sum of arrival counts, by year, region and type (sea, land air):
First add a 'year' column:
df['year'] = pd.to_datetime(df['period']).dt.year
Then group by (year, region, moa) and sum arv_count in each group:
df.groupby(['region', 'year', 'moa']).arv_count.sum()
Here is the output:
region year moa
AMERICAS 2015 Land 3622
ASIA 2013 Air 243190
Total 2013 Air 1900722
Land 178294
Sea 129074
I hope this is what you were looking for!

Related

Create multiple new pandas column based on other columns in a loop

Assuming I have the following toy dataframe, df:
Country Population Region HDI
China 100 Asia High
Canada 15 NAmerica V.High
Mexico 25 NAmerica Medium
Ethiopia 30 Africa Low
I would like to create new columns based on the population, region, and HDI of Ethiopia in a loop. I tried the following method, but it is time-consuming when a lot of columns are involved.
df['Population_2'] = df['Population'][df['Country'] == "Ethiopia"]
df['Region_2'] = df['Region'][df['Country'] == "Ethiopia"]
df['Population_2'].fillna(method='ffill')
My final DataFrame df should look like:
Country Population Region HDI Population_2 Region_2 HDI_2
China 100 Asia High 30 Africa Low
Canada 15 NAmerica V.High 30 Africa Low
Mexico 25 NAmerica Medium 30 Africa Low
Ethiopia 30 Africa Low 30 Africa Low
How about this?
for col in ['Population', 'Region', 'HDI']:
df[col + '_2'] = df.loc[df.Country=='Ethiopia', col].iat[0]
I don't quite understand the broader point of what you're trying to do, and if Ethiopia could have multiple values the solution might be different. But this works for the problem as you presented it.
You can use:
# select Ethiopia row and add suffix "_2" to the columns (except Country)
s = (df.drop(columns='Country')
.loc[df['Country'].eq('Ethiopia')].add_suffix('_2').squeeze()
)
# broadcast as new columns
df[s.index] = s
output:
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low
You can use assign and also assuming that you have only row corresponding to Ethiopia:
d = dict(zip(df.columns.drop('Country').map('{}_2'.format),
df.set_index('Country').loc['Ethiopia']))
df = df.assign(**d)
print(df):
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low

How do I print a simple Python statement based on Pandas dataframe?

Date
Train Number
Station 1
Station 2
Equipment Available?
2022-06-16
1111
North
Central
Y
2022-06-20
1111
North
Central
Y
2022-06-01
2222
North
South
Y
2022-06-02
2222
North
South
Y
2022-06-03
2222
North
South
Y
2022-06-04
2222
North
South
Y
2022-06-05
2222
North
South
Y
2022-06-06
2222
North
South
Y
2022-06-07
2222
North
South
Y
2022-06-08
2222
North
South
Y
I have a Pandas dataframe that looks like the one above that is sorted by Train Number and then Date. I would like to print a simple Python statement that says:
"For Train Number 1111 North to Central, we have equipment available on June 16th and June 20th.
For Train Number 2222 North to South, we have equipment available from June 1st to June 8th."
How am I able to do this?????
I've made a little function which you can call on whatever df you want.
I find this solution more readable and flexible for further requests.
def equip_avail(df):
for i in df['Train Number'].unique():
date_start = df.Date.loc[(df['Train Number']==i)].min()
date_end = df.Date.loc[(df['Train Number']==i)].max()
from_start = df.Station1.loc[(df['Train Number']==i)].values[0]
to_end = df.Station2.loc[(df['Train Number']==i)].values[0]
print(f'For Train Number {i} {from_start} to {to_end}, we have equipment available from {date_start} to {date_end}.')
Then you call it like this:
equip_avail(df)
Result:
For Train Number 1111 North to Central, we have equipment available from 2022-06-16 to 2022-06-20.
For Train Number 2222 North to South, we have equipment available from 2022-06-01 to 2022-06-08.
You could get the min and max values for each Train's Date with a groupby, dedupe the DataFrame to get the other columns (as they are repeated) and then print the results with some datetime formatting
df.loc[:, 'Date'] = pd.to_datetime(df['Date'])
g = df.groupby(['Train Number']).agg(date_min=pd.NamedAgg(column='Date', aggfunc='min'), date_max=pd.NamedAgg(column='Date', aggfunc='max'))
g = g.join(df_deduped, how='inner')
df_deduped = df.loc[:, 'Train Number':].drop_duplicates().set_index('Train Number')
for index, values in g.reset_index().iterrows():
print(f'For Train Number {values["Train Number"]}, {values["Station 1"]} to {values["Station 2"]}, we have equipment available from {values["date_min"].strftime("%b %d")} to {values["date_max"].strftime("%b %d")}')
The output is -
For Train Number 1111, North to Central, we have equipment available from Jun 16 to Jun 20
For Train Number 2222, North to South, we have equipment available from Jun 01 to Jun 08
here is one way to do it. Group by Train, station1, station2, taking both min and max of the dates
Finally printing them out from the resulting df from groupby
df2=df.groupby(['TrainNumber', 'Station1', 'Station2'])['Date'].aggregate([min, max]).reset_index()
for idx, row in df2.iterrows():
print("For Train Number {0} {1} to {2}, we have equipment available on {3} and {4}".format(
row[0],row[1],row[2], row[3] , row[4] ))
For Train Number 1111 North to Central, we have equipment available on 2022-06-16 and 2022-06-20
For Train Number 2222 North to South, we have equipment available on 2022-06-01 and 2022-06-08

How can I calculate percentage of a groupby column and sort it by descending order?

Question: How can I calculate percentage of a groupby column and sort it by descending order ?
Desired output:
country count percentage
United States 2555 45%
India 923 12%
United Kingdom 397 4%
Japan 226 3%
South Korea 183 2%
I did some research, looked at the Pandas Documentation, looked at other questions here on Stackoverflow
without luck.
I tried the following:
#1 Try:
Df2 = df.groupby('country')['show_id'].count().nlargest()
df3 = df2.groupby(level=0).apply(lambda x: x/x.sum() * 100)
Output:
director
A. L. Vijay 100.0
A. Raajdheep 100.0
A. Salaam 100.0
A.R. Murugadoss 100.0
Aadish Keluskar 100.0
...
Çagan Irmak 100.0
Ísold Uggadóttir 100.0
Óskar Thór Axelsson 100.0
Ömer Faruk Sorak 100.0
Şenol Sönmez 100.0
Name: show_id, Length: 4049, dtype: float64
#2 Try:
df2 = df.groupby('country')['show_id'].count()
df2['percentage'] = df2['show_id']/6000
Output:
KeyError: 'show_id'
Sample of the dataset:
import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
This doesn't address rows where there are multiple countries in the "country" field, but the lines below should work for the other parts of the question:
Create initial dataframe:
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',0,0],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
Groupby country:
df2 = df.groupby(by="country", as_index=False)['show_id']\
.agg('count')
Rename agg column:
df2 = df2.rename(columns={'show_id':'count'})
Create percentage column:
df2['percent'] = (df2['count']*100)/df2['count'].sum()
Sort descending:
df2 = df2.sort_values(by='percent', ascending=False)
Part of the issue in your Attempt #1 may have been that you didn't include the "by" parameter in your groupby function.
newDF = pd.DataFrame(DF.Country.value_counts())
newDF['percentage'] = round(pd.DataFrame(DF.Country.value_counts(normalize = \
True).mul(100)),2)
newDF.columns = ['count', 'percentage']
newDF

create unique identifier in dataframe based on combination of columns, but only for duplicated rows

A corollary of the question here:
create unique identifier in dataframe based on combination of columns
In the foll. dataframe,
id Lat Lon Year Area State
50319 -36.0629 -62.3423 2019 90 Iowa
18873 -36.0629 -62.3423 2017 90 Iowa
18876 -36.0754 -62.327 2017 124 Illinois
18878 -36.0688 -62.3353 2017 138 Kansas
I want to create a new column which assigns a unique identifier based on whether the columns Lat, Lon and Area have the same values. E.g. in this case rows 1 and 2 have the same values in those columns and will be given the same unique identifier 0_Iowa where Iowa comes from the State column. However, if there is no duplicate for a row, then I just want to use the state name. The end result should look like this:
id Lat Lon Year Area State unique_id
50319 -36.0629 -62.3423 2019 90 Iowa 0_Iowa
18873 -36.0629 -62.3423 2017 90 Iowa 0_Iowa
18876 -36.0754 -62.327 2017 124 Illinois Illinois
18878 -36.0688 -62.3353 2017 138 Kansas Kansas
You can use an np.where:
df['unique_id'] = np.where(df.duplicated(['Lat','Lon'], keep=False),
df.groupby(['Lat','Lon'], sort=False).ngroup().astype('str') + '_' + df['State'],
df['State'])
Or similar idea with pd.Series.where:
df['unique_id'] = (df.groupby(['Lat','Lon'], sort=False)
.ngroup().astype('str')
.add('_' + df['State'])
.where(df.duplicated(['Lat','Lon'], keep=False),
df['State']
)
)
Output:
id Lat Lon Year Area State unique_id
0 50319 -36.0629 -62.3423 2019 90 Iowa 0_Iowa
1 18873 -36.0629 -62.3423 2017 90 Iowa 0_Iowa
2 18876 -36.0754 -62.3270 2017 124 Illinois Illinois
3 18878 -36.0688 -62.3353 2017 138 Kansas Kansas

Plotting graph from csv flie

I'm new to python. I've to plot graphs from a csv file that I've created.
a) Monthly sales vs Product Price
b) Geographic Region vs No of customer
The code that I've implemented was
import pandas as pd
import matplotlib.pyplot as plot
import csv
data = pd.read_csv('dataset_books.csv')
data.hist(bins=90)
plot.xlim([0,115054])
plot.title("Data")
x = plot.xlabel("Monthly Sales")
y = plot.ylabel("Product Price")
plot.show()
the output that I'm getting is not what I expected and is not approved.
I need a Horizontal Histogram with line plot.
Book ID Product Name Product Price Monthly Sales Shipping Type Geographic Region No Of Customer Who Bought the Product Customer Type
1 The Path to Power 486 2566.08 Free Gatton 4 Old
2 Touching Darkness (Midnighters, #2) 479 1264.56 Paid Hooker Creek 2 New
3 Star Wars: Lost Stars 456 1203.84 Paid Gladstone 2 New
4 Winter in Madrid 454 599.28 Paid Warruwi 1 New
5 Hairy Maclary from Donaldson's Dairy 442 2333.76 Free Mount Gambier 4 Old
6 Stand on Zanzibar 413 3816.12 Free Cessnock 7 Old
7 Marlfox 411 3797.64 Free Edinburgh 7 Old
8 The Matlock Paper 373 3446.52 Free Gladstone 7 Old
9 Tears of a Tiger 361 1906.08 Free Melbourne 4 Old
10 Star Wars: Vision of the Future 355 937.2 Paid Wagga Wagga 2 New
11 Nefes Nefese 344 454.08 Paid Gatton 1 New
this is my CSV file.
Can anyone help me?
Try this to check your columns name:
df.columns
>> Index(['Book ID', 'Product Name', 'Product Price', 'Monthly Sales',
'Shipping Type', 'Geographic Region',
'No Of Customer Who Bought the Product', 'Customer Type'],
dtype='object')
Next to plot horizontal plot you need 'barh'.
df['Product Price'].plot(kind='barh')
Another option to chose column is 'iloc'
df.iloc[:, 2].plot(kind='barh')
It will generate the same output

Categories