I have two numpy arrays of identical length (398 rows), with the first 5 values for each as follows:
y_predicted =
[[-0.85908649]
[-1.19176482]
[-0.93658361]
[-0.83557211]
[-0.80681243]]
y_norm =
mpg
0 -0.705551
1 -1.089379
2 -0.705551
3 -0.961437
4 -0.833494
That is, the first has square brackets around each value, and the second has indexing and no square brackets.
The data is a normalised version of the first column (MPG) of the Auto-MPG dataset. The y_predicted values are results of a linear regression.
https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Would anyone know how I might convert these arrays to the same type so I can plot a scatter plot of them?
Both have shape: (398, 1)
Both have type: class 'numpy.ndarray', dtype float64
Data from the link provided
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
The second of these looks like a pandas Series to me. If so you can do y_norm.values to get the underlying numpy array.
Related
I am trying to import all 9 columns of the popular MPG dataset from UCI from a URL. The problem is , instead of the string values showing, Carname (the ninth column) is populated by NaN.
What is going wrong and how can one fix this? The link to the repository shows that the original dataset has 9 columns, so this should work.
From the URL and we find that the data looks like
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
with unique string values on the Carname but when we import it as
import pandas as pd
# Import raw dataset from URL
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower',
'Weight', 'Acceleration', 'Model Year', 'Origin', 'Carname']
data = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
data.head(3)
yielding (with NaN values on Carname)
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 NaN
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 NaN
It’s literally in your read_csv call: comment='\t'. The only tabs are before the Carname field, which means the way you read the fle explicitely ignores that column.
You can remove the comment parameter and use the more generic separator \s+ instead to split on any whitespace (one or more spaces, a tab, etc.):
>>> pd.read_csv(url, names=column_names, na_values='?', sep='\s+')
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
.. ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790.0 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52.0 2130.0 24.6 82 2 vw pickup
395 32.0 4 135.0 84.0 2295.0 11.6 82 1 dodge rampage
396 28.0 4 120.0 79.0 2625.0 18.6 82 1 ford ranger
397 31.0 4 119.0 82.0 2720.0 19.4 82 1 chevy s-10
[398 rows x 9 columns]
I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:
Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
Initial dataframe 'data':
print(data)
ID Surface Volume
0 2 10.0 25.0
1 2 12.0 30.0
2 2 24.0 60.0
3 2 8.0 20.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 NaN NaN
8 52 96.0 240.0
9 95 8.0 20.0
10 95 6.0 15.0
11 95 12.0 30.0
12 95 30.0 75.0
13 95 12.0 30.0
Desired output from 'df':
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 240.0
3 95 68.0 170.0
Tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"Volume": [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})
print(data)
#Tried something, but no idea how to do this actually:
df["Surface"] = data.groupby("ID").agg(sum)
df["Volume"] = data.groupby("ID").agg(sum)
print(df)
Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.
Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0
2 52 96.0 240.0
3 95 68.0 170.0
If need new columns filled by aggregate sum values use GroupBy.transform :
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
ID Surface Volume
0 2 54.0 135.0
1 2 54.0 135.0
2 2 54.0 135.0
3 2 54.0 135.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 96.0 240.0
8 52 96.0 240.0
9 95 68.0 170.0
10 95 68.0 170.0
11 95 68.0 170.0
12 95 68.0 170.0
13 95 68.0 170.0
I have a dataframe 500 rows long by 4 columns. I need to find the proper python code that would divide the current row by the row below and then multiply that value by the value in the last row for every value in each column. I need to replicate this excel formula basically.
It's not clear if your data is stored in an array as provided by Numpy, were it true you'd write, with the original data contained in a
b = a[-1]*(a[:-1]/a[+1:])
a[-1] is the last row, a[:-1] the array without the last row and a[+1:] the array without the first (index zero, that is) row
Assuming you are talking about pandas DataFrame
import pandas as pd
import random
# sample DataFrame object
df = pd.DataFrame((float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)))
for _ in range(10))
def function(col):
for i in range(len(col)-1):
col[i] = (col[i]/col[i+1])*col[len(col)-1]
print(df) # before formula apply
df.apply(function)
print(df) # after formula apply
>>>
0 1 2 3
0 10.0 78.0 27.0 23.0
1 72.0 42.0 77.0 86.0
2 82.0 12.0 58.0 98.0
3 27.0 92.0 19.0 86.0
4 48.0 83.0 14.0 43.0
5 55.0 18.0 58.0 77.0
6 20.0 58.0 20.0 22.0
7 76.0 19.0 63.0 82.0
8 23.0 99.0 58.0 15.0
9 60.0 57.0 89.0 100.0
0 1 2 3
0 8.333333 105.857143 31.207792 26.744186
1 52.682927 199.500000 118.155172 87.755102
2 182.222222 7.434783 271.684211 113.953488
3 33.750000 63.180723 120.785714 200.000000
4 52.363636 262.833333 21.482759 55.844156
5 165.000000 17.689655 258.100000 350.000000
6 15.789474 174.000000 28.253968 26.829268
7 198.260870 10.939394 96.672414 546.666667
8 23.000000 99.000000 58.000000 15.000000
9 60.000000 57.000000 89.000000 100.000000
Let's say I've a dataframe like this -
ID Weight Height
1 80.0 180.0
2 60.0 170.0
3 NaN NaN
4 NaN NaN
5 82.0 185.0
I want the dataframe to be transormed to -
ID Weight Height
1 80.0 180.0
2 60.0 170.0
3 71.0 177.5
4 76.5 181.25
5 82.0 185.0
It takes the average of the immediate data available before and after a NaN and updates the missing/NaN value accordingly.
You can use interpolation from the pandas library by using the following:
df['Weight'], df['Height'] = df.Weight.interpolate(), df.Height.interpolate()
Check the arguments on the documentation for the method of interpolation to tune this to your problem case: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html
If I have a pandas series [a1,a2,a3,a4,...] with length = T. Each value corresponds to one day. For each day, I would like to compute the historical median. For example, the first day compute the median of [a1]; the second day compute the median of [a1,a2]; the nth day compute the median of [a1,a2,...,an]. Finally I would like to get a series with length = T as well. Do we have an efficient way to do this in pandas? Thanks!
For a Series, ser:
ser = pd.Series(np.random.randint(0, 100, 10))
If your pandas version is 0.18.0 or above, use:
ser.expanding().median()
Out:
0 0.0
1 25.0
2 50.0
3 36.5
4 33.0
5 36.0
6 33.0
7 36.0
8 33.0
9 36.0
dtype: float64
The following is for earlier versions and deprecated:
pd.expanding_median(ser)
C:\Anaconda3\envs\p3\lib\site-packages\spyder\utils\ipython\start_kernel.py:1: FutureWarning: pd.expanding_median is deprecated for Series and will be removed in a future version, replace with
Series.expanding(min_periods=1).median()
# -*- coding: utf-8 -*-
Out:
0 0.0
1 25.0
2 50.0
3 36.5
4 33.0
5 36.0
6 33.0
7 36.0
8 33.0
9 36.0
dtype: float64