I have a Dataframe with the location of some customers (so I have a column with Customer_id and others with Lat and Lon) and I am trying to interpolate the NaN's according to each customer.
For example, if I interpolate with the nearest approach here (I made up the values here):
Customer_id Lat Lon
A 1 1
A NaN NaN
A 2 2
B NaN NaN
B 4 4
I would like the NaN for B to be 4 and not 2.
I have tried this
series.groupby('Customer_id').apply(lambda group: group.interpolate(method = 'nearest', limit_direction = 'both'))
And the number of NaN's goes down from 9003 to 94. But I'm not understanding why it is still leaving some missing values.
I checked and these 94 missing values corresponded to records from customers that were already being interpolated. For example,
Customer_id Lat
0. A 1
1. A NaN
2. A NaN
3. A NaN
4. A NaN
It would interpolate correctly until some value (let's say it interpolates 1, 2 and 3 correctly) and then leaves 4 as NaN.
I have tried to set a limit in interpolate greater than the maximum number of records per client but it is still not working out. I don't know where my mistake is, can somebody help out?
(I don't know if it's relevant to mention or not but I fabricated my own NaN's for this. This is the code I used Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe I think the problem isn't here but since I'm very confused as to where the issue actually is I'll just leave it here)
When you interpolate with nearest it is only able to fill in-between missing values. (You'll notice this because you get an error when there's only 1 non-null value, like in your example). The remaining null values are "edges" which are taken care of with .bfill().ffill() for the nearest logic. This is also the appropriate logic to "interpolate" with only one non-missing value.
def my_interp(x):
if x.notnull().sum() > 1:
return x.interpolate(method='nearest').ffill().bfill()
else:
return x.ffill().bfill()
df.groupby('Customer_id').transform(my_interp)
# Lat Lon
#0 1.0 1.0
#1 1.0 1.0
#2 2.0 2.0
#3 4.0 4.0
#4 4.0 4.0
Related
Good afternoon all,
Bit stuck with the last stage of a calculation.
I have a dataframe which outputs as such:
LaCode Group Frequency
0 718 NaN 2
1 718 3 1
2 719 1 4
3 719 2 10
I'm struggling with the percentage calculation which is for each LaCode, ignore where Group is NaN (and just put NaN (or blank) and calculate percentage of the frequency's where Group is known.
Should output as such:
Percentage
NaN
100
28.571
71.428
Can anyone help with this? My code doesn't take into account the change in LaCode and I can't work out the correct syntax to incorporate that issue.
Thanks.
Edit: For completeness, I have converted the NaN to an integer that stands out so I can see it (in this instance 0 as that isn't a valid group in the survey)
The code I'm using for calculation was provided to me and I tweaked a little. Works ok when just one LaCode:
df['Percentage'] = df[df['Value'] != 0]['Count'].apply(lambda x: x/sum(df[df['Value'] != 0]['Count']))
I'm setting up a dataframe by reading a csv file in pandas, the columns represent points in one dimensional positional arguments for different samples, the rows each represent 0.01s time segments. I want to create a new dataframe to represent velocity and acceleration (so basically apply the operation [point(i)-point(i-1)]/0.01) to every cell in the data frame.
I'm having trouble using pandas.applymap or other approaches because I don't quite know how to refer to multiple arguments in the dataframe for every operation, if that makes sense.
import pandas as pd
import numpy as np
data = pd.read_csv("file_name")
def velocity(xf, xi):
v = (xf - xi)*100
return v
velocity = data.applymap(velocity)
This is what the first few column and rows of the original data frame look like:
X LFHD Y LFHD Z LFHD X RFHD Y RFHD
0 700.003 -1769.61 1556.05 811.922 -1878.46
1 699.728 -1769.50 1555.99 811.942 -1878.14
2 699.465 -1769.38 1555.99 811.980 -1877.81
3 699.118 -1769.38 1555.83 812.005 -1877.48
4 699.017 -1768.78 1556.19 812.003 -1877.11
For every positional value in each column, I want to calculate the velocity where the initial positional value is the cell above (xi as the input in the velocity function) and the final positional value is the cell in question (xf).
when I try to run the above code, it gives me an error because there is only one argument provided for velocity, when it expects 2. I don't know how to go about providing the second argument so that it outputs the proper new dataframe with the velocity calculated in each cell.
df_velocity = data.diff()*100
df_velocity
Out[6]:
X_LFHD Y_LFHD Z_LFHD X_RFHD Y_RFHD
0 NaN NaN NaN NaN NaN
1 -27.5 11.0 -6.0 2.0 32.0
2 -26.3 12.0 0.0 3.8 33.0
3 -34.7 0.0 -16.0 2.5 33.0
4 -10.1 60.0 36.0 -0.2 37.0
I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time
I have an aviation dataset that I am trying to clean. There are some missing values for the NumEngines feature, but there are some instances where a missing value can be derived from an entry elsewhere in the dataframe (this is not always the case). Below is a mini example of my dataset to illustrate both cases. Note that first Cessna entry can be used to fill in the second, but this is not the case for Piper.
df = pd.DataFrame()
df["Make"] = ["Cessna","Piper","Cessna","Boeing"]
df["Model"] = ["Citation","PA32RT","Citation","737-300"]
df["NumEngines"] = [2,None,None,2]
How can I make it so that the resulting DataFrame would be
Make Model NumEngines
0 Cessna Citation 2.0
1 Piper PA32RT NaN
2 Cessna Citation 2.0
3 Boeing 737-300 2.0
I would bet transform('first') could make it again here:
df.groupby(['Make', 'Model']).transform('first')
Out[179]:
NumEngines
0 2.0
1 NaN
2 2.0
3 2.0
I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]
Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]