Pandas Data clean up

Pandas Data clean up - python

import pandas as pd
import numpy as np
import sys
auto = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
names=['MPG', 'Cylinders', 'Displacement', 'Horse power',
'Weight', 'Acceleration', 'Model Year', 'Origin', 'Car Name']
)
auto.head()
I need to clean up this data but i keep getting this out put and need a bit of help. Beginner here and i can't figure it out

If you look at the file, the separators are not constant but a variation of spaces. sep = '\s+' is giving the desired output.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
df = pd.read_csv(url, sep = '\s+',names = ['MPG','Cylinders','Displacement','Horse power','Weight','Acceleration','Model Year','Origin','Car Name'])
df.head()
MPG Cylinders Displacement Horse power Weight Acceleration Model Year Origin Car Name
0 18 8 307 130.0 3504 12.0 70 1 chevrolet chevelle malibu
1 15 8 350 165.0 3693 11.5 70 1 buick skylark 320
2 18 8 318 150.0 3436 11.0 70 1 plymouth satellite
3 16 8 304 150.0 3433 12.0 70 1 amc rebel sst
4 17 8 302 140.0 3449 10.5 70 1 ford torino

Use delim_whitespace argument:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
cols = ['MPG', 'Cylinders', 'Displacement', 'Horse power', 'Weight',
'Acceleration', 'Model Year', 'Origin', 'Car Name']
auto = pd.read_csv(url, names=cols, delim_whitespace=True)
auto.head()
Out:
MPG Cylinders Displacement Horse power Weight Acceleration \
0 18.0 8 307.0 130.0 3504.0 12.0
1 15.0 8 350.0 165.0 3693.0 11.5
2 18.0 8 318.0 150.0 3436.0 11.0
3 16.0 8 304.0 150.0 3433.0 12.0
4 17.0 8 302.0 140.0 3449.0 10.5
Model Year Origin Car Name
0 70 1 chevrolet chevelle malibu
1 70 1 buick skylark 320
2 70 1 plymouth satellite
3 70 1 amc rebel sst
4 70 1 ford torino

Related

Joining 2 dataframe based on a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 11 months ago.
Following is one of my dataframe structure:
strike coi chgcoi
120 200 20
125 210 15
130 230 12
135 240 9
and the other one is:
strike poi chgpoi
125 210 15
130 230 12
135 240 9
140 225 12
What I want is:
strike coi chgcoi strike poi chgpoi
120 200 20 120 0 0
125 210 15 125 210 15
130 230 12 130 230 12
135 240 9 135 240 9
140 0 0 140 225 12

First, you need to create two dataframes using pandas
df1 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
df2 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
Then you can use outer join
df1.merge(df2, on='common_column_name', how='outer')

db1
strike coi chgcoi
0 120 200 20
1 125 210 15
2 130 230 12
3 135 240 9
db2
strike poi chgpoi
0 125 210 15
1 130 230 12
2 135 240 9
3 140 225 12
merge = db1.merge(db2,how="outer",on='strike')
merge
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 NaN NaN
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 NaN NaN 225.0 12.0
merge.fillna(0)
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 0.0 0.0
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 0.0 0.0 225.0 12.0
This is your expected result with the only difference that 'strike' is not repeated

Returning of the sum of units in the particular year

I have a simple task however struggling with it in python.
I have a df with "Freq" column (the sum at the beginning) every year some units will be removed from this, could you help me to build a for loop to return the amount for a particular year:
df = pd.DataFrame({'Delivery Year' : [1976,1977,1978,1979], "Freq" : [120,100,80,60],
"1976" : [10,float('nan'),float('nan'),float('nan')],
"1977" : [5,3,float('nan'),float('nan')],
"1978" : [10,float('nan'),8,float('nan')],
"1979" : [13,10,5,14]
})
df
My attempt, however not working..
# Remaining in use
for i in df.columns[2:len(df.columns)]:
df[i] = df[i-1] - df[i]
Desired output:
df = pd.DataFrame({'Delivery Year' : [1976,1977,1978,1979], "Freq" : [120,100,80,60],
"1976" : [110,100,80,60],
"1977" : [105,97,80,60],
"1978" : [95,97,72,60],
"1979" : [82,87,67,46]
})
df

You can calculate the cumulative sum along the columns axis then subtract this sum from the Freq column to get available amounts for each year
s = df.iloc[:, 2:].fillna(0).cumsum(1).rsub(df['Freq'], axis=0)
df.assign(**s)
Delivery Year Freq 1976 1977 1978 1979
0 1976 120 110.0 105.0 95.0 82.0
1 1977 100 100.0 97.0 97.0 87.0
2 1978 80 80.0 80.0 72.0 67.0
3 1979 60 60.0 60.0 60.0 46.0

Try:
df[df.columns[2:]] = df[df.columns[1:]].apply(lambda x: pd.Series(x['Freq'] - x[1:].cumsum()).ffill().fillna(x['Freq']), axis=1)
Output:
Delivery Year Freq 1976 1977 1978 1979
0 1976 120 110.0 105.0 95.0 82.0
1 1977 100 100.0 97.0 97.0 87.0
2 1978 80 80.0 80.0 72.0 67.0
3 1979 60 60.0 60.0 60.0 46.0
Here is how you would do it in loop, but as #Shubham Sharma suggested there is no need for looping when you can use pandas directly:
cols = df.columns[2:len(df.columns)]
for index, col in enumerate(cols):
sub_from = df.columns[2+(index-1)]
print('col: ', col, 'Sub From: ', sub_from)
df[col] = (df[sub_from] - df[col]).fillna(df[sub_from])

NaN in single column while importing data from URL

I am trying to import all 9 columns of the popular MPG dataset from UCI from a URL. The problem is , instead of the string values showing, Carname (the ninth column) is populated by NaN.
What is going wrong and how can one fix this? The link to the repository shows that the original dataset has 9 columns, so this should work.
From the URL and we find that the data looks like
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
with unique string values on the Carname but when we import it as
import pandas as pd
# Import raw dataset from URL
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower',
'Weight', 'Acceleration', 'Model Year', 'Origin', 'Carname']
data = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
data.head(3)
yielding (with NaN values on Carname)
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 NaN
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 NaN

It’s literally in your read_csv call: comment='\t'. The only tabs are before the Carname field, which means the way you read the fle explicitely ignores that column.
You can remove the comment parameter and use the more generic separator \s+ instead to split on any whitespace (one or more spaces, a tab, etc.):
>>> pd.read_csv(url, names=column_names, na_values='?', sep='\s+')
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
.. ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790.0 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52.0 2130.0 24.6 82 2 vw pickup
395 32.0 4 135.0 84.0 2295.0 11.6 82 1 dodge rampage
396 28.0 4 120.0 79.0 2625.0 18.6 82 1 ford ranger
397 31.0 4 119.0 82.0 2720.0 19.4 82 1 chevy s-10
[398 rows x 9 columns]

How to get top n records from each category in a Python dataframe?

The data is sorted in descending order on column 'id' in the following dataframe -
id Name version copies price
6 MSFT 10.0 5 100
6 TSLA 10.0 10 200
6 ORCL 10.0 15 300
5 MSFT 10.0 20 400
5 TSLA 10.0 25 500
5 ORCL 10.0 30 600
4 MSFT 10.0 35 700
4 TSLA 10.0 40 800
4 ORCL 10.0 45 900
3 MSFT 5.0 50 1000
3 TSLA 5.0 55 1100
3 ORCL 5.0 60 1200
2 MSFT 5.0 65 1300
2 TSLA 5.0 70 1400
2 ORCL 5.0 75 1500
1 MSFT 15.0 80 1600
1 TSLA 15.0 85 1700
1 ORCL 15.0 90 1800
...
Based on the input 'n', I would like to filter above data such that, if input is '2', the resulting dataframe should look like -
Name version copies price
MSFT 10.0 5 100
TSLA 10.0 10 200
ORCL 10.0 15 300
MSFT 10.0 20 400
TSLA 10.0 25 500
ORCL 10.0 30 600
MSFT 5.0 50 1000
TSLA 5.0 55 1100
ORCL 5.0 60 1200
MSFT 5.0 65 1300
TSLA 5.0 70 1400
ORCL 5.0 75 1500
MSFT 15.0 80 1600
TSLA 15.0 85 1700
ORCL 15.0 90 1800
Basically, only the top 'n' groups of 'id' for a specific version should be present in the resulting dataframe. If a version has id's < n (e.g. in version 15.0 there is only one group with id = 1), then all the groups of id's should be present.
I tried using groupy and head, but it didn't work for me. I absolutely have no other clue in getting this to work.
I really appreciate any help with this, thank you.

you can use groupby.transform on the column version, and factorize the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc with this mask to select the wanted rows.
n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
id Name version copies price
0 6 MSFT 10.0 5 100
1 6 TSLA 10.0 10 200
2 6 ORCL 10.0 15 300
3 5 MSFT 10.0 20 400
4 5 TSLA 10.0 25 500
5 5 ORCL 10.0 30 600
9 3 MSFT 5.0 50 1000
10 3 TSLA 5.0 55 1100
11 3 ORCL 5.0 60 1200
12 2 MSFT 5.0 65 1300
13 2 TSLA 5.0 70 1400
14 2 ORCL 5.0 75 1500
15 1 MSFT 15.0 80 1600
16 1 TSLA 15.0 85 1700
17 1 ORCL 15.0 90 1800
Another option is to use groupby.head once you drop_duplicated to keep unique version-id couples. then use select version-id in a merge.
df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))

Subtracting fix date from whole panda data frame - python

I have data
customer_id purchase_amount date_of_purchase
0 760 25.0 06-11-2009
1 860 50.0 09-28-2012
2 1200 100.0 10-25-2005
3 1420 50.0 09-07-2009
4 1940 70.0 01-25-2013
5 1960 40.0 10-29-2013
6 2620 30.0 09-03-2006
7 3050 50.0 12-04-2007
8 3120 150.0 08-11-2006
9 3260 45.0 10-20-2010
10 3510 35.0 04-05-2013
11 3970 30.0 07-06-2007
12 4000 20.0 11-25-2005
13 4180 20.0 09-22-2010
14 4390 30.0 04-15-2011
15 4750 60.0 02-12-2013
16 4840 30.0 10-14-2005
17 4910 15.0 12-13-2006
18 4950 50.0 05-19-2010
19 4970 30.0 01-12-2006
20 5250 50.0 12-20-2005
Now I want to subtract 01-01-2016 from each row of date_of_purchase
I tried the following so I should have a new column days_since with a number of days.
NOW = pd.to_datetime('01/01/2016').strftime('%m-%d-%Y')
gb = customer_purchases_df.groupby('customer_id')
df2 = gb.agg({'date_of_purchase': lambda x: (NOW - x.max()).days})
any suggestion. how I can achieve this
Thanks in advance

pd.to_datetime(df['date_of_purchase']).rsub(pd.to_datetime('2016-01-01')).dt.days
0 2395
1 1190
2 3720
3 2307
4 1071
5 794
6 3407
7 2950
8 3430
9 1899
10 1001
11 3101
12 3689
13 1927
14 1722
15 1053
16 3731
17 3306
18 2053
19 3641
20 3664
Name: date_of_purchase, dtype: int64

I'm assuming the 'date_of_purchase' column already has the datetime dtype.
>>> df
customer_id purchase_amount date_of_purchase
0 760 25.0 2009-06-11
1 860 50.0 2012-09-28
2 1200 100.0 2005-10-25
>>> df['days_since'] = df['date_of_purchase'].sub(pd.to_datetime('01/01/2016')).dt.days.abs()
>>> df
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 2009-06-11 2395
1 860 50.0 2012-09-28 1190
2 1200 100.0 2005-10-25 3720

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Data clean up - python

Related

Joining 2 dataframe based on a column [duplicate]

Returning of the sum of units in the particular year

NaN in single column while importing data from URL

How to get top n records from each category in a Python dataframe?

Subtracting fix date from whole panda data frame - python

Categories

Resources