"leak" converting .csv to .nc using xarray in some points

"leak" converting .csv to .nc using xarray in some points - python

I'm trying to transform some points that are tabulated .csv in a netcdf file.
This is my .csv file: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
In my spreadsheet, I have the unique location of each point, not regular for all area but points are spaced by 0.1 degree, an SP value per year up to 100 years forward.
To work with this data, I needed something like other sources that use netcdf data tabled in sp(time, lat, lon). So, I can evaluate and visualize the values of this specific region by year (using panoply or ncview for example).
For that, I came up with this code:
import pandas as pd
import xarray as xr
import numpy as np
csv_file = 'example.csv'
df = pd.read_csv(csv_file)
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.fillna(0)
xc.to_netcdf(csv_file + '.nc')
And I got a netcdf file like this: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
At first, my code seems to work and create my netcdf file without problems, however, I noticed that in some places I am creating some "leakage" of points, or interpolating the same values in some direction (north-south and west-east) when it shouldn't happen.
If you do a simple plot before converting to xarray you can see there are 3 west segments and one south segment
xr.sp[0].plot()
And this ends up being masked a bit when I fill the NaN with 0 and plot it again:
xc.sp[0].plot()
Checking the netcdf file using panoply I got something similar as well:
So I've start to check every-step of my code to see if I miss something.. my first guess was the melt part but I not 100% sure because if I plot df I can't see any leaking or extrapolation in the same region:
joint_axes = seaborn.jointplot(
x="lon", y="lat", data=df, s=0.5
)
contextily.add_basemap(
joint_axes.ax_joint,
crs="EPSG:4326",
source=contextily.providers.CartoDB.PositronNoLabels,
);
So anyone have any idea what's happening here?
EDIT:
Now a solution that would help me at the moment would be to fill in the missing coordinates with a value equal to 0 within my domain area using the minimum and maximum latitudes and longitudes.
My first (and unconventional) idea was to create a 0.1 x 0.1 grid with values equal to zero and feed this grid with my existing values.
However, the method using reindex would help me and I would be able to execute it in a few lines. My doubt is whether I should do this before or after the df.melt in my code.
I'm in this situation:
csv_file = '/Users/helioguerraneto/Desktop/example.csv'
df = pd.read_csv(csv_file)
lonmin, lonmax = df['lon'].min(), df['lon'].max()
latmin, latmax = df['lat'].min(), df['lat'].max()
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.reindex(lat=np.arange(latmin, latmax, 0.1), lon=np.arange(lonmin, lonmax, 0.1), fill_value=0)
xc.to_netcdf(csv_file + '.nc')
Seems like reindex is the way but I need to keep original data. I was expecting some zeros but not in all area:
EDIT2:
I think I found something might help! My goal now could be same what's happing here: How to interpolate latitude/longitude and heading in Pandas
But instead of interpolation by the nearest I just could match with the exactly coordinates. Maybe the real problem here is mix 100 hundred grids in the end..
Any suggestions?

Related

Power law test using XY scatter plot

I have Daily Crude oil prices downloaded from FRED, about 10k observations, some values are blank(code cleans them). I believe that I cannot share excel sheets here, so I will just give you a screenshot of what the data looks like:
I calculate the differences and returns and clean up the data but I am kind of stuck.
Here is what the code looks like to get you started:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("DCOILWTICO.csv")
nan_value = float("NaN")
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['Previous'] = data['DCOILWTICO'].shift(1)
data.dropna(subset=['Previous'],inplace=True)
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['DCOILWTICO'] = data['DCOILWTICO'].astype(float)
data['Previous'] = data['Previous'].astype(float)
data['Diff'] = data['DCOILWTICO'] - data['Previous']
data['Return'] = (data['DCOILWTICO'] - data['Previous'])/data['Previous']
Here comes the question: I am trying to duplicate the graph below.(which I believe was generated using Mathematica) The difficult part is to be able to create the bins in the right way. Looking at the graph it looks like there are around 200 bins. On the x-axis are the returns and on the y axis are the frequencies(which have been binned).

I think you are asking how to make equally spaced bins in logspace. If so then use the np.geomspace function (geometric space), rather than np.linspace (linear space).
plt.figure()
bins = np.geomspace(data['returns'].min(), data['returns'].max(), 200)
plt.hist(data['returns'], bins = bins)

compute and plot monthly mean SST anomalies and plot with xarray multindex (pangeo tutorial gallery)

I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?

Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()

Pandas: How to detect the peak points (outliers) in a dataframe?

I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I use to remove these outliers or peak points from the data?
Example:
data points = {0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}
in this data If I see the points 4, 4, 5, 6 are completely outlier values,
before I have used the rolling mean with 5 min of window frame to smooth these values but still I am getting these type of a lot of blip points, which I want to remove, can any one suggest me any technique to get rid of these points.
I have an image which is more clear view of data:
if you see here how the data is showing some outlier points which I have to remove?
any Idea whats the possible way to get rid of these points?

I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
Originial data:
Test run 1 : Z-score = 4:
As you can see, no data has been removed because the level was set too high.
Test run 2 : Z-score = 2:
Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.
Test run 3 : Z-score = 1.2:
This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!
EDIT: Here's the whole thing for an easy copy&paste:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# your data (as a list)
data = [0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]
# initial plot
df1 = pd.DataFrame(data = data)
df1.columns = ['data']
df1.plot(style = 'o')
# Function to identify and remove outliers
def outliers(df, level):
# 1. temporary dataframe
df = df1.copy(deep = True)
# 2. Select a level for a Z-score to identify and remove outliers
df_Z = df[(np.abs(stats.zscore(df)) < level).all(axis=1)]
ix_keep = df_Z.index
# 3. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df.loc[ix_keep]
return(df_keep)
# remove outliers
level = 1.2
print("df_clean = outliers(df = df1, level = " + str(level)+')')
df_clean = outliers(df = df1, level = level)
# final plot
df_clean.plot(style = 'o')

You might cut values above a certain quantile as follows:
import numpy as np
clean_data=np.array(data_points)[(data_points<=np.percentile(data_points, 95))]
In pandas you would use df.quantile, you can find it here
Or you may use the Q3+1.5*IQR approach to eliminate the outliers, like you would do through a boxplot

How can I remove sharp jumps in data?

I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
e.g.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data

Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
plt.show()
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
Edit:
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
(df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.

Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
df_complete.plot()
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
df_cleaned.plot()

Converting between projections using pyproj in Pandas dataframe

This is undoubtedly a bit of a "can't see the wood for the trees" moment. I've been staring at this code for an hour and can't see what I've done wrong. I know it's staring me in the face but I just can't see it!
I'm trying to convert between two geographical co-ordinate systems using Python.
I have longitude (x-axis) and latitude (y-axis) values and want to convert to OSGB 1936. For a single point, I can do the following:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
x1,y1 = (-2.772048, 53.364265)
x2,y2 = pyproj.transform(inProj,outProj,x1,y1)
print(x1,y1)
print(x2,y2)
This produces the following:
-2.772048 53.364265
348721.01039783185 385543.95241055806
Which seems reasonable and suggests that longitude of -2.772048 is converted to a co-ordinate of 348721.0103978.
In fact, I want to do this in a Pandas dataframe. The dataframe contains columns containing longitude and latitude and I want to add two additional columns that contain the converted co-ordinates (called newLong and newLat).
An exemplar dataframe might be:
latitude longitude
0 53.364265 -2.772048
1 53.632481 -2.816242
2 53.644596 -2.970592
And the code I've written is:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
df = pd.DataFrame({'longitude':[-2.772048,-2.816242,-2.970592],'latitude':[53.364265,53.632481,53.644596]})
def convertCoords(row):
x2,y2 = pyproj.transform(inProj,outProj,row['longitude'],row['latitude'])
return pd.Series({'newLong':x2,'newLat':y2})
df[['newLong','newLat']] = df.apply(convertCoords,axis=1)
print(df)
Which produces:
latitude longitude newLong newLat
0 53.364265 -2.772048 385543.952411 348721.010398
1 53.632481 -2.816242 415416.003113 346121.990302
2 53.644596 -2.970592 416892.024217 335933.971216
But now it seems that the newLong and newLat values have been mixed up (compared with the results of the single point conversion shown above).
Where have I got my wires crossed to produce this result? (I apologise if it's completely obvious!)

When you do df[['newLong','newLat']] = df.apply(convertCoords,axis=1), you are indexing the columns of the df.apply output. However, the column order is arbitrary because your series was defined using a dictionary (which is inherently unordered).
You can opt to return a Series with a fixed column ordering:
return pd.Series([x2, y2])
Alternatively, if you want to keep the convertCoords output labelled, then you can use .join to combine results instead:
return pd.Series({'newLong':x2,'newLat':y2})
...
df = df.join(df.apply(convertCoords, axis=1))

Please note that the transform function of pyproj accepts also arrays, which is quite useful when it comes to large dataframes, and much faster than using lambda/apply function
import pandas as pd
from pyproj import Proj, transform
inProj, outProj = Proj(init='epsg:4326'), Proj(init='epsg:27700')
df['newLon'], df['newLat'] = transform(inProj, outProj, df['longitude'].tolist(), df['longitude'].tolist())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.