I have a dataframe like this:
long lat Place
-6.779 61.9 Aarhus
-6.790 62.0 Aarhus
54.377 24.4 Dhabi
38.834 9.0 Addis
35.698 9.2 Addis
Is it possible to transform the dataframe into a format like below?
Office long + lat
Aarhus [[-6.779,61.9], [-6.790,62.0]]
Dhabi [[54.377]]
Addis [[38.834,9.0], [35.698,9.2]]
I tried different methods but still couldn't work this out. This is
what I tried to get a list for each distinct place value:
df2["index"] = df2.index
df2["long"]=df2.groupby('index')['long'].apply(list)
list 1= []
for values in ofce_list:
if df['Office'].any() == values:
list1.append(df.loc[df['Office'] == values, 'long'])
But this returned a series in a list instead which is not desired. Please help. Thank you so much.
df.groupby('Place')[['long','lat']].apply(lambda x :x.values.tolist()).\
reset_index(name='long + lat')
Out[1380]:
Place long + lat
0 Aarhus [[-6.779, 61.9], [-6.79, 62.0]]
1 Addis [[38.834, 9.0], [35.698, 9.2]]
2 Dhabi [[54.376999999999995, 24.4]]
Related
I have a data table I am pulling from excel that looks like this:
Raw pH TAPA
8.20 30
8.21 29
8.22 28.5
8.23 28
8.24 27
8.25 26.5
8.26 26
I usually have no problem looking up a number in one column based on another column. For instance, when, I code:
df = pd.read_excel('Conversions.xlsx', 'Sheet1')
df2=df.loc[df['Raw pH'] == 8.21, 'TAPA']
print(df2)
I would expect it to return "29". However, it is only working when I look for the first pH value of 8.20;
df2=df.loc[df['Raw pH'] == 8.20, 'TAPA']
print(df2)
Returns:
0 30.0
Name: TAPA, dtype: float64
When I try any other number, such as:
df2=df.loc[df['Raw pH'] == 8.23, 'TAPA']
print(df2)
I simply get an empty value:
Series([], Name: TAPA, dtype: float64)
If I try it with iloc:
df2=df.loc[df['Raw pH'] == 8.23, 'TAPA'].iloc[0]
print(df2)
I get the error IndexError: single positional indexer is out-of-bounds.
If I try the iloc() method with 8.2, it returns 30.0 as expected.
Why am I getting back empty values for everything below the first data row?
I ended up just making the excel worksheet a CSV file instead. It's working fine now. I still don't understand why -- the datatype in excel looks straightforward.
I have DataFrame with columns: City, Wind direction, Temperature. Of course each City occures only 1 time!!! and has only 1 data point of Wind direction and Temperature. For instance:
0 New York 252.0 22.0
How can create my own methon and use it in DataFrame ? For example I would like to create my own method "aa" which returns some solution (Temperature in City minus mean Temperature for entire column "Temperature") and use this created method during aggregation my DataFrame.
Currently I created method "aa" as you can see below and I use it in aggregation, nevertheless, "aa" method shows "0" everywhere. Could you write me an appropriate code? Did I make mistake id def aa(x) ?
def aa(x):
return x - np.mean(x)
file.groupby(["City"]).agg({"Wind direction":[np.mean, aa], "Temperature":["mean", aa]})
Sample data: (Taken from comments provided by OP)
file = pd.DataFrame({"City":["New York", "Berlin", "London"], "Wind direction":[225.0, 252.0, 310.0], "Temperature":[21.0, 18.5, 22.0]})
You are getting zeros because the input that aa receives is the group, not the full series, and the mean of a single-element group is the single element.
Now, it's a bit weird to use groupby when you know that each group has only a single element, but you can force it through using something like
def aa(x):
return x - file[x.name].mean()
With your given example:
In [23]: file.groupby(["City"]).agg({"Wind direction":[np.mean, aa], "Temperature":["mean", aa]})
Out[23]:
Wind direction Temperature
mean aa mean aa
City
Berlin 252.0 -10.333333 18.5 -2.0
London 310.0 47.666667 22.0 1.5
New York 225.0 -37.333333 21.0 0.5
Much more straightforward would be to simply operate on the data frame directly, e.g.
In [26]: file['Wind direction aa'] = file['Wind direction'] - file['Wind direction'].mean()
In [27]: file['Temperature aa'] = file['Temperature'] - file['Temperature'].mean()
In [28]: file
Out[28]:
City Wind direction Temperature Wind direction aa Temperature aa
0 New York 225.0 21.0 -37.333333 0.5
1 Berlin 252.0 18.5 -10.333333 -2.0
2 London 310.0 22.0 47.666667 1.5
I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)
I have a df with lots of rows.
Price Time Place
A 5.1 11.30 germany
B 4.1 08.30 lebannon
...
DY 7.49 01.15 italy
DZ 2.13 02.35 england
How could I filter the df to obtain those rows where column Price has a Nan
value?
So far I tried
df[~df['Price'].str.isnumeric()]
but didn't worked.
The desired output would be something like this:
Price Time Place
AS Nan 11.30 germany
BJ Nan 08.30 lebannon
use isnull
df[df.Price.isnull()]
You can also use np.isnan
df[np.isnan(df.Price.values)]
Try using this :
df = df.ix[df.price.isnull(),:]
or
df = df.loc[df.price.isnull(),:]
I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.