I am trying to import all 9 columns of the popular MPG dataset from UCI from a URL. The problem is , instead of the string values showing, Carname (the ninth column) is populated by NaN.
What is going wrong and how can one fix this? The link to the repository shows that the original dataset has 9 columns, so this should work.
From the URL and we find that the data looks like
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
with unique string values on the Carname but when we import it as
import pandas as pd
# Import raw dataset from URL
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower',
'Weight', 'Acceleration', 'Model Year', 'Origin', 'Carname']
data = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
data.head(3)
yielding (with NaN values on Carname)
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 NaN
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 NaN
It’s literally in your read_csv call: comment='\t'. The only tabs are before the Carname field, which means the way you read the fle explicitely ignores that column.
You can remove the comment parameter and use the more generic separator \s+ instead to split on any whitespace (one or more spaces, a tab, etc.):
>>> pd.read_csv(url, names=column_names, na_values='?', sep='\s+')
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
.. ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790.0 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52.0 2130.0 24.6 82 2 vw pickup
395 32.0 4 135.0 84.0 2295.0 11.6 82 1 dodge rampage
396 28.0 4 120.0 79.0 2625.0 18.6 82 1 ford ranger
397 31.0 4 119.0 82.0 2720.0 19.4 82 1 chevy s-10
[398 rows x 9 columns]
Related
The small reproducible example below sets up a dataframe that is 100 yrs in length containing some randomly generated values. It then inserts 3 100-day stretches of missing values. Using this small example, I am attempting to sort out the pandas commands that will fill in the missing days using average values for that day of the year (hence the use of .groupby) with a condition. For example, if April 12th is missing, how can the last line of code be altered such that only the 10 nearest April 12th's are used to fill in the missing value? In other words, a missing April 12th value in 1920 would be filled in using the mean April 12th values between 1915 to 1925; a missing April 12th value in 2000 would be filled in with the mean April 12th values between 1995 to 2005, etc. I tried playing around with adding a .rolling() to the lambda function in last line of script, but was unsuccessful in my attempt.
Bonus question: The example below extends from 1918 to 2018. If a value is missing on April 12th 1919, for example, it would still be nice if ten April 12ths were used to fill in the missing value even though the window couldn't be 'centered' on the missing day because of its proximity to the beginning of the time series. Is there a solution to the first question above that would be flexible enough to still use a minimum of 10 values when missing values are close to the beginning and ending of the time series?
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
# confirm missing vals
df.iloc[95:105]
df.iloc[35890:35900]
# set a date index (for use by groupby)
df.index = pd.DatetimeIndex(df['Date'])
df['Date'] = df.index
# Need help restricting the mean to the 10 nearest same-days-of-the-year:
df['vals'] = df.groupby([df.index.month, df.index.day])['vals'].transform(lambda x: x.fillna(x.mean()))
This answers both parts
build a DF dfr that is the calculation you want
lambda function returns a dict {year:val, ...}
make sure indexes are named in reasonable way
expand out dict with apply(pd.Series)
reshape by putting year columns back into index
merge() built DF with original DF. vals column contains NaN 0 column is value to fill
finally fillna()
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe - simplified from question...
df = pd.DataFrame({"Date":dates,"vals":vals})
df[df.isna().any(axis=1)]
ystart = df.Date.dt.year.min()
# generate rolling means for month/day. bfill for when it's start of series
dfr = (df.groupby([df.Date.dt.month, df.Date.dt.day])["vals"]
.agg(lambda s: {y+ystart:v for y,v in enumerate(s.dropna().rolling(5).mean().bfill())})
.to_frame().rename_axis(["month","day"])
)
# expand dict into columns and reshape to by indexed by month,day,year
dfr = dfr.join(dfr.vals.apply(pd.Series)).drop(columns="vals").rename_axis("year",axis=1).stack().to_frame()
# get df index back, plus vals & fillna (column 0) can be seen alongside each other
dfm = df.merge(dfr, left_on=[df.Date.dt.month,df.Date.dt.day,df.Date.dt.year], right_index=True)
# finally what we really want to do - fill tha NaNs
df.fillna(dfm[0])
analysis
taking NaN for 11-Apr-1918, default is 22 as it's backfilled from 1921
(12+2+47+47+2)/5 == 22
dfm.query("key_0==4 & key_1==11").head(7)
key_0
key_1
key_2
Date
vals
0
100
4
11
1918
1918-04-11 00:00:00
nan
22
465
4
11
1919
1919-04-11 00:00:00
12
22
831
4
11
1920
1920-04-11 00:00:00
2
22
1196
4
11
1921
1921-04-11 00:00:00
47
27
1561
4
11
1922
1922-04-11 00:00:00
47
36
1926
4
11
1923
1923-04-11 00:00:00
2
34.6
2292
4
11
1924
1924-04-11 00:00:00
37
29.4
I'm not sure how far I've gotten with the intent of your question. The approach I've taken is to satisfy two requirements
Need an arbitrary number of averages
Use those averages to fill in the NA
I have addressed the
Simply put, instead of filling in the NA with before and after dates, I fill in the NA with averages extracted from any number of years in a row.
import pandas as pd
import numpy as np
import random
# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100
# Create dataframe
df = pd.DataFrame(dict(
list(
zip(["Date", "vals"],
[dates, vals])
)
))
df['Date'] = pd.to_datetime(df['Date'])
df['mm-dd'] = df['Date'].apply(lambda x:'{:02}-{:02}'.format(x.month, x.day))
df['yyyy'] = df['Date'].apply(lambda x:'{:04}'.format(x.year))
df = df.iloc[:,1:].pivot(index='mm-dd', columns='yyyy')
df.columns = df.columns.droplevel(0)
df['nans'] = df.isnull().sum(axis=1)
df['10n_mean'] = df.iloc[:,:-1].sample(n=10, axis=1).mean(axis=1)
df['10n_mean'] = df['10n_mean'].round(1)
df.loc[df['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 NaN NaN 34.0 NaN NaN NaN 2.0 NaN NaN NaN ... NaN 49.0 NaN NaN NaN 32.0 NaN NaN 76 21.6
04-11 NaN 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 NaN 5.0 33.0 3 29.7
04-12 NaN 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 NaN 1.0 12.0 3 21.3
04-13 NaN 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 NaN 2.0 46.0 3 21.3
04-14 NaN 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 NaN 16.0 8.0 3 24.7
df_mean = df.T.fillna(df['10n_mean'], downcast='infer').T
df_mean.loc[df_mean['nans'] >= 1]
yyyy 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 ... 2011 2012 2013 2014 2015 2016 2017 2018 nans 10n_mean
mm-dd
02-29 21.6 21.6 34.0 21.6 21.6 21.6 2.0 21.6 21.6 21.6 ... 21.6 49.0 21.6 21.6 21.6 32.0 21.6 21.6 76.0 21.6
04-11 29.7 43.0 12.0 28.0 29.0 28.0 1.0 38.0 11.0 3.0 ... 17.0 35.0 8.0 17.0 34.0 29.7 5.0 33.0 3.0 29.7
04-12 21.3 19.0 38.0 34.0 48.0 46.0 28.0 29.0 29.0 14.0 ... 41.0 16.0 9.0 39.0 8.0 21.3 1.0 12.0 3.0 21.3
04-13 21.3 33.0 26.0 47.0 21.0 26.0 20.0 16.0 11.0 7.0 ... 5.0 11.0 34.0 28.0 27.0 21.3 2.0 46.0 3.0 21.3
04-14 24.7 36.0 19.0 6.0 45.0 41.0 24.0 39.0 1.0 11.0 ... 30.0 47.0 45.0 14.0 48.0 24.7 16.0 8.0 3.0 24.7
I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:
Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
Initial dataframe 'data':
print(data)
ID Surface Volume
0 2 10.0 25.0
1 2 12.0 30.0
2 2 24.0 60.0
3 2 8.0 20.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 NaN NaN
8 52 96.0 240.0
9 95 8.0 20.0
10 95 6.0 15.0
11 95 12.0 30.0
12 95 30.0 75.0
13 95 12.0 30.0
Desired output from 'df':
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 240.0
3 95 68.0 170.0
Tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"Volume": [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})
print(data)
#Tried something, but no idea how to do this actually:
df["Surface"] = data.groupby("ID").agg(sum)
df["Volume"] = data.groupby("ID").agg(sum)
print(df)
Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.
Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0
2 52 96.0 240.0
3 95 68.0 170.0
If need new columns filled by aggregate sum values use GroupBy.transform :
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
ID Surface Volume
0 2 54.0 135.0
1 2 54.0 135.0
2 2 54.0 135.0
3 2 54.0 135.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 96.0 240.0
8 52 96.0 240.0
9 95 68.0 170.0
10 95 68.0 170.0
11 95 68.0 170.0
12 95 68.0 170.0
13 95 68.0 170.0
I have a dataframe 500 rows long by 4 columns. I need to find the proper python code that would divide the current row by the row below and then multiply that value by the value in the last row for every value in each column. I need to replicate this excel formula basically.
It's not clear if your data is stored in an array as provided by Numpy, were it true you'd write, with the original data contained in a
b = a[-1]*(a[:-1]/a[+1:])
a[-1] is the last row, a[:-1] the array without the last row and a[+1:] the array without the first (index zero, that is) row
Assuming you are talking about pandas DataFrame
import pandas as pd
import random
# sample DataFrame object
df = pd.DataFrame((float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)))
for _ in range(10))
def function(col):
for i in range(len(col)-1):
col[i] = (col[i]/col[i+1])*col[len(col)-1]
print(df) # before formula apply
df.apply(function)
print(df) # after formula apply
>>>
0 1 2 3
0 10.0 78.0 27.0 23.0
1 72.0 42.0 77.0 86.0
2 82.0 12.0 58.0 98.0
3 27.0 92.0 19.0 86.0
4 48.0 83.0 14.0 43.0
5 55.0 18.0 58.0 77.0
6 20.0 58.0 20.0 22.0
7 76.0 19.0 63.0 82.0
8 23.0 99.0 58.0 15.0
9 60.0 57.0 89.0 100.0
0 1 2 3
0 8.333333 105.857143 31.207792 26.744186
1 52.682927 199.500000 118.155172 87.755102
2 182.222222 7.434783 271.684211 113.953488
3 33.750000 63.180723 120.785714 200.000000
4 52.363636 262.833333 21.482759 55.844156
5 165.000000 17.689655 258.100000 350.000000
6 15.789474 174.000000 28.253968 26.829268
7 198.260870 10.939394 96.672414 546.666667
8 23.000000 99.000000 58.000000 15.000000
9 60.000000 57.000000 89.000000 100.000000
I'm trying to create a data frame where I add duplicates as variants in a column.To further illustrate my question:
I have a pandas dataframe like this:
Case ButtonAsInteger
0 1 130
1 1 133
2 1 42
3 2 165
4 2 158
5 2 157
6 3 158
7 3 159
8 3 157
9 4 130
10 4 133
11 4 43
... ... ...
I have converted it into this form:
grouped = activity2.groupby(['Case'])
values = grouped['ButtonAsInteger'].agg('sum')
id_df = grouped['ButtonAsInteger'].apply(lambda x: pd.Series(x.values)).unstack(level=-1
0 1 2 3 4 5 6 7 8 9
Case
1 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN
2 165.0 158.0 157.0 141.0 142.0 142.0 142.0 142.0 142.0 147.0
3 158.0 159.0 157.0 147.0 166.0 170.0 169.0 130.0 133.0 133.0
4 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN
And now I want to find duplicates and mark each duplicate as a variant. So in this example, Case 1 and 4 should get variant 1. Like this:
Variants 0 1 2 3 4 5 6 7 8 9
Case
1 1 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN
2 2 165.0 158.0 157.0 141.0 142.0 142.0 142.0 142.0 142.0 147.0
3 3 158.0 159.0 157.0 147.0 166.0 170.0 169.0 130.0 133.0 133.0
4 1 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN
I have already tried this method https://stackoverflow.com/a/44999009. But it doesn't work on my data frame. Unfortunately I don't know why.
It will probably be possible to apply a double for loop. So for each line look if there is a duplicate in the record. Whether this is efficient on a large record, I don't know.
I have also added my procedure with grouping, because perhaps there is a possibility to already work with duplicates at this point?
This groups by all columns and returns the group index (+ 1 because zero based indexing is the default). I think this should be what you want.
id_df['Variant'] = id_df.groupby(
id_df.columns.values.tolist()).grouper.group_info[0] + 1
The resulting data frame, given your input data like above:
0 1 2 Variant
Case
1 130 133 42 1
2 165 158 157 3
3 158 159 157 2
4 130 133 42 1
There could be a syntactically nicer way to access the group index, but i didn't find one.
import pandas as pd
import numpy as np
import sys
auto = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
names=['MPG', 'Cylinders', 'Displacement', 'Horse power',
'Weight', 'Acceleration', 'Model Year', 'Origin', 'Car Name']
)
auto.head()
I need to clean up this data but i keep getting this out put and need a bit of help. Beginner here and i can't figure it out
If you look at the file, the separators are not constant but a variation of spaces. sep = '\s+' is giving the desired output.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
df = pd.read_csv(url, sep = '\s+',names = ['MPG','Cylinders','Displacement','Horse power','Weight','Acceleration','Model Year','Origin','Car Name'])
df.head()
MPG Cylinders Displacement Horse power Weight Acceleration Model Year Origin Car Name
0 18 8 307 130.0 3504 12.0 70 1 chevrolet chevelle malibu
1 15 8 350 165.0 3693 11.5 70 1 buick skylark 320
2 18 8 318 150.0 3436 11.0 70 1 plymouth satellite
3 16 8 304 150.0 3433 12.0 70 1 amc rebel sst
4 17 8 302 140.0 3449 10.5 70 1 ford torino
Use delim_whitespace argument:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
cols = ['MPG', 'Cylinders', 'Displacement', 'Horse power', 'Weight',
'Acceleration', 'Model Year', 'Origin', 'Car Name']
auto = pd.read_csv(url, names=cols, delim_whitespace=True)
auto.head()
Out:
MPG Cylinders Displacement Horse power Weight Acceleration \
0 18.0 8 307.0 130.0 3504.0 12.0
1 15.0 8 350.0 165.0 3693.0 11.5
2 18.0 8 318.0 150.0 3436.0 11.0
3 16.0 8 304.0 150.0 3433.0 12.0
4 17.0 8 302.0 140.0 3449.0 10.5
Model Year Origin Car Name
0 70 1 chevrolet chevelle malibu
1 70 1 buick skylark 320
2 70 1 plymouth satellite
3 70 1 amc rebel sst
4 70 1 ford torino