Working with missing data

Working with missing data - python

I have the following dataframe:
from pandas import *
from math import *
data=read_csv('agosto.csv')
Fecha DirViento MagViento
0 2011/07/01 00:00 N 6.6
1 2011/07/01 00:15 N 5.5
2 2011/07/01 00:30 N 6.6
3 2011/07/01 00:45 N 7.5
4 2011/07/01 01:00 --- 6.0
5 2011/07/01 01:15 --- 7.1
6 2011/07/01 01:30 S 4.7
7 2011/07/01 01:45 SE 3.1
.
.
.
The first thing i want to do, is to convert wind values to numerical values in order to obtain the u and v wind components. But when I perform the operations, the missing data (---) generates conflicts.
direccion=[]
for i in data['DirViento']:
if i=='SSW':
dir=202.5
if i=='S':
dir=180.0
if i=='N':
dir=360.0
if i=='NNE':
dir=22.5
if i=='NE':
dir=45.0
if i=='ENE':
dir=67.5
if i=='E':
dir=90.0
if i=='ESE':
dir=112.5
if i=='SE':
dir=135.0
if i=='SSE':
dir=157.5
if i=='SW':
dir=225.0
if i=='WSW':
dir=247.5
if i=='W':
dir=270.0
if i=='WNW':
dir=292.5
if i=='NW':
dir=315.0
if i=='NNW':
dir=337.5
direccion.append(dir)
data['DirViento']=direccion
i get the following:
data['DirViento'].head()
0 67.5
1 67.5
2 67.5
3 67.5
4 67.5
because missing data is assigned the value of the other rows? The components of get with the following code
Vviento=[]
Uviento=[]
for i in range(0,len(data['MagViento'])):
Uviento.append((data['MagViento'][i]*sin((data['DirViento'][i]+180)*(pi/180.0))))
Vviento.append((data['MagViento'][i]*cos((data['DirViento'][i]+180)*(pi/180.0))))
data['PromeU']=Uviento
data['PromeV']=Vviento
Now grouped to obtain statistical data
index=data.set_index(['Fecha','Hora'],inplace=True)
g = index.groupby(level=0)
but i get error
IndexError: index out of range for array
Am I doing something wrong? How to perform operations without taking into account missing data?

I see one flow in your code. You conditional statement should be more like:
if i == 'SSW':
dir = 202.5
elif i == 'S':
...
else:
dir = np.nan
Or you can clean dir variable in the beginning of the loop. Otherwise dir for row with missing data will be the same as dir for previous iteration.
But I think this code could be improved in more pythonic way, for example, something like this.
# test DataFrame
df = pd.DataFrame({'DirViento':['N', 'N', 'N', 'N', '--', '--', 'S', 'SE'])
DirViento
0 N
1 N
2 N
3 N
4 --
5 --
6 S
7 SE
# create points of compass list
dir_lst = ['NNE','NE','ENE','E','ESE','SE','SSE','S','SSW','WSW','W','WNW','NW','NNW','N']
# create dictionary from it
dir_dict = {x: (i + 1) *22.5 for i, x in enumerate(dir_lst)}
# add a new column
df['DirViento2'] = df['DirViento'].apply(lambda x: dir_dict.get(x, None))
DirViento DirViento2
0 N 360
1 N 360
2 N 360
3 N 360
4 -- NaN
5 -- NaN
6 S 180
7 SE 135
update Good suggestion from #DanAllan in comments, the code becomes even shorter and even more pythonic:
df['DirViento2'] = df['DirViento'].replace(dir_dict)

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608

Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

Insert values at the same row index from two different loops in pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I started to work with pandas very recently, and my issue is the following: I have two loops, that generates each 10 values. What I want to do is to insert at the bottom of my data frame the generated values, in such a way that the index is the same for both loops.
Here is a mock-up example, that is quite close of what I'm trying to do:
import pandas as pd
import random
randint = {'rand': [10,52,99,8],'rand2': [541,632,789,251], 'rand3': [1,3,4,1]}
df = pd.DataFrame(randint, columns = ['rand','rand2', "rand3"])
i = j = len(df)
for x in range(10):
rand = random.randint(1,101)
rand2 = random.randint(1,1001)
df.loc[df.index[i], "rand"] = rand
df.loc[df.index[i], "rand2"] = rand2
i = i + 1
for y in range(10):
rand3 = random.randint(1,11)
df.loc[df.index[j], "rand3"] = rand3
j = j + 1
print(df)
So, what I would like is to have for instance at row 5 the first set of rand, rand2, rand3 at the same row, and so forth (e.g.: for x and y = 1, I would have the three values at the same row, for x and y = 2, same thing, etc...). The issue is that I have read that it was not a good idea to iterate with pandas (and obviously, pandas is raising me the error "index 4 is out of bounds for axis 0 with size 4"), but I really have trouble to understand the pandas syntax and I'm a bit lost on how I am supposed to tackle this issue. Thank you for your help.
Expected output:
At first, my dataframe would look like this:
rand
rand2
rand3
10
541
1
52
632
3
99
789
4
8
251
1
Now, let us imagine that the first time in the first loop (so for x = 1) , rand = 8, rand2 = 455, and for the first time of the second loop (so for y=1), rand3 = 7.
So now, I would like to add the values obtained to the last row, in such a way that my dataframe would look like this:
rand
rand2
rand3
10
541
1
52
632
3
99
789
4
8
251
1
8
455
7
The issue is that I don't really know to indicate to pandas that I want to have the same index for the two loops. Let me know if it is still not clear.

I am not sure if this is the case, but in your example the second loop is working on rand2 column and not rand3. Also, when you use df.loc use the value of i or j (not df.index[i/j])
i = j = len(df)
for x in range(10):
rand = random.randint(1,101)
rand2 = random.randint(1,1001)
df.loc[i, "rand"] = rand
df.loc[i, "rand2"] = rand2
i = i + 1
for y in range(10):
rand3 = random.randint(1,11)
df.loc[j, "rand3"] = rand3
j = j + 1
print(df)
rand rand2 rand3
0 10.0 541.0 1.0
1 52.0 632.0 3.0
2 99.0 789.0 4.0
3 8.0 251.0 1.0
4 37.0 902.0 6.0
5 65.0 717.0 11.0
6 95.0 345.0 6.0
7 81.0 218.0 9.0
8 90.0 233.0 10.0
9 15.0 918.0 6.0
10 62.0 775.0 10.0
11 27.0 955.0 4.0
12 43.0 17.0 2.0
13 69.0 41.0 8.0
Pandas way of doing this is like that:
>>> import numpy as np
>>> pd.DataFrame({
'rand': np.random.randint(1,101,10),
'rand2': np.random.randint(1,1001,10),
'rand3': np.random.randint(1,11,10)})
rand rand2 rand3
0 50 877 5
1 9 929 5
2 23 605 7
3 52 205 4
4 39 341 6
5 17 455 7
6 11 505 7
7 68 647 10
8 66 920 6
9 63 386 9

Python PuLP performance issue - taking too much time to solve

I am using pulp to create an allocator function which packs the items in the trucks based on the weight and volume. It works fine(takes 10-15 sec) for 10-15 items but when I double the items it takes more than half hour to solve it.
def allocator(item_mass,item_vol,truck_mass,truck_vol,truck_cost, id_series):
n_items = len(item_vol)
set_items = range(n_items)
n_trucks = len(truck_cost)
set_trucks = range(n_trucks)
print("working1")
y = pulp.LpVariable.dicts('truckUsed', set_trucks,
lowBound=0, upBound=1, cat=LpInteger)
x = pulp.LpVariable.dicts('itemInTruck', (set_items, set_trucks),
lowBound=0, upBound=1, cat=LpInteger)
print("working2")
# Model formulation
prob = LpProblem("Truck allocation problem", LpMinimize)
# Objective
prob += lpSum([truck_cost[i] * y[i] for i in set_trucks])
print("working3")
# Constraints
for j in set_items:
# Every item must be taken in one truck
prob += lpSum([x[j][i] for i in set_trucks]) == 1
for i in set_trucks:
# Respect the mass constraint of trucks
prob += lpSum([item_mass[j] * x[j][i] for j in set_items]) <= truck_mass[i]*y[i]
# Respect the volume constraint of trucks
prob += lpSum([item_vol[j] * x[j][i] for j in set_items]) <= truck_vol[i]*y[i]
print("working4")
# Ensure y variables have to be set to make use of x variables:
for j in set_items:
for i in set_trucks:
x[j][i] <= y[i]
print("working5")
s = id_series #id_series
prob.solve()
print("working6")
This is the data i am running it on
items:
Name Pid Quantity Length Width Height Volume Weight t_type
0 A 1 1 4.60 4.30 4.3 85.05 1500 Open
1 B 2 1 4.60 4.30 4.3 85.05 1500 Open
2 C 3 1 6.00 5.60 9.0 302.40 10000 Container
3 D 4 1 8.75 5.60 6.6 441.00 1000 Open
4 E 5 1 6.00 5.16 6.6 204.33 3800 Open
5 C 6 1 6.00 5.60 9.0 302.40 10000 All
6 C 7 1 6.00 5.60 9.0 302.40 10000 Container
7 D 8 1 8.75 5.60 6.6 441.00 6000 Open
8 E 9 1 6.00 5.16 6.6 204.33 3800 Open
9 C 10 1 6.00 5.60 9.0 302.40 10000 All
.... times 5
trucks(this just the top 5 rows, I have 54 types of trucks in total):
Category Name TruckID Length(ft) Breadth(ft) Height(ft) Volume \
0 LCV Tempo 407 0 9.5 5.5 5.5 287.375
1 LCV Tempo 407 1 9.5 5.5 5.5 287.375
2 LCV Tempo 407 2 9.5 5.5 5.5 287.375
3 LCV 13 Feet 3 13.0 5.5 7.0 500.500
4 LCV 14 Feet 4 14.0 6.0 6.0 504.000
Weight Price
0 1500 1
1 2000 1
2 2500 2
3 3500 3
4 4000 3
where ItemId is this:
data["ItemId"] = data.index + 1
id_series = data["ItemId"].tolist()

PuLP can handle multiple solvers. See what ones you have with:
pulp.pulpTestAll()
This will give a list like:
Solver pulp.solvers.PULP_CBC_CMD unavailable.
Solver pulp.solvers.CPLEX_DLL unavailable.
Solver pulp.solvers.CPLEX_CMD unavailable.
Solver pulp.solvers.CPLEX_PY unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.COIN_CMD passed.
Solver pulp.solvers.COINMP_DLL unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.GLPK_CMD passed.
Solver pulp.solvers.XPRESS unavailable.
Solver pulp.solvers.GUROBI unavailable.
Solver pulp.solvers.GUROBI_CMD unavailable.
Solver pulp.solvers.PYGLPK unavailable.
Solver pulp.solvers.YAPOSIB unavailable.
You can then solve using, e.g.:
lp_prob.solve(pulp.COIN_CMD())
Gurobi and CPLEX are commercial solvers that tend to work quite well. Perhaps you could access them? Gurobi has a good academic license.
Alternatively, you may wish to look into an approximate solution, depending on your quality constraints.

Efficient pandas rolling aggregation over date range by group - Python 2.7 Windows - Pandas 0.19.2

I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.

Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0

operations in pandas DataFrame

I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.

This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Working with missing data - python

Related

Create new column with multiple values in Python

Insert values at the same row index from two different loops in pandas [duplicate]

Python PuLP performance issue - taking too much time to solve

Efficient pandas rolling aggregation over date range by group - Python 2.7 Windows - Pandas 0.19.2

operations in pandas DataFrame

Categories

Resources