I am trying to convert SQL query into equivalent python pandas.
The SQL query is
select count(*),sum(days) into :_cnt_ML_2R, :_pd_QL_1R
from _gm_Qr_bfr_mnt
where x=1 and y=1 and input(code,8.) in (70001:73599)
Now I was trying to convert it into equivalent python pandas.
For SQL select count(*) from _gm_Qr_bfr_mnt equivalent panda is pd.Series(_gm_Qr_bfr_mnt.shape[0])
and for SQL select sum(days) from _gm_Qr_bfr_mnt equivalent panda is pd.Series(__gm_Qr_bfr_mnt['days'].sum())
but I am unable to convert the INTO from SQL to pandas along with WHERE condition.
Need guidance on that how to covert this to the equivalent pandas.
for a SQL like as follows
select available_date, count(), sum() from DF
where price > 500
group by available_date
the equivalent will be
df.query('Price > 500').groupby('Available_Since_Date')['Available_Quantity'].agg(['size', 'sum'])
Where clause: df.query('Price > 500')
Group by: group by columns for getting count ('size), and sum ('sum')
product_name Price Final_Price Available_Quantity Available_Since_Date
0 Keyboard 500.000 5.0 5.0 11/5/2021
1 Mouse NaN NaN 9.0 4/23/2021
2 Monitor 5000.235 10.0 6.0 08/21/2021
3 CPU NaN NaN NaN 09/18/2021
4 CPU 10000.550 20.0 6.0 09/18/2021
5 Speakers 250.500 8.0 5.0 01/05/2021
6 NaT NaN NaN 8.0 NaT
size sum
Available_Since_Date
08/21/2021 1 6.0
09/18/2021 1 6.0
Check filter the df first
sub_df = df.loc[(df['x'].eq(1) &
df['y'].eq(1) &
df['input(code,8.)'].between(70001, 73599, inclusive="both"))]
_cnt_ML_2R = sub_df.shape[0]
:_pd_QL_1R = sub_df['days'].sum()
Related
I have to read some table from a SQL Server and merge their values in a single data structure for a machine learning project. I'm using Pandas and in particular pd.read_sql_query for read the values and pd.merge to fuse them.
One table has at least 80 milion rows and if I try to read it entirely it occupies all my memory storage (it's not so much, only 20gb but I've got a small ssd), so I decided to divide it in chunk of 100000 rows:
df_bilanci = pd.read_sql_query('SELECT [IdBilancioRiclassificato], [IdBilancio], [Importo] FROM dbo.Bilanci', conn, chunksize=100000)
One single chunk will be like this:
IdBilancioRiclassificato
IdBilancio
Importo
0
10.0
7001.0
0.0
1
11.0
7001.0
502643.0
2
12.0
7001.0
-4550.0
3
10.0
7002.0
654654.0
4
11.0
7002.0
0.0
5
12.0
7002.0
0.0
I'm interested to have the values of IdBilancioRiclassificato as columns (there are a total of 97 unique values for this column, so they have to be 97 columns), so I used pd.pivot on every chunk and then pd.concat plus merge to put together all the data:
for chunk in df_bilanci:
chunk.reset_index()
chunk_pivoted = pd.pivot(data=chunk,
index='IdBilancio',
columns='IdBilancioRiclassificato',
values='Importo'
)
df_aziende_bil = pd.concat([df_aziende_bil, pd.merge(left=df_aziende_anagrafe, right=chunk_pivoted, left_on='ID', right_index=True)])
At this point however, the chunk_pivoted dataframe has some values replaced with NaN values but if I look in the table the values exist.
The result expected is a table like this one:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
502643.0
-4550.0
7002.0
654654.0
0.0
0.0
but i've got something like this:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
NaN
-4550.0
7002.0
654654.0
NaN
NaN
In a dataframe, with some empty(NaN) values in some rows - Example below
s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])
When I groupby start and end columns
s.groupby(['start', 'end']).total.sum()
the output I get is
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
735571896 203559638 nan
282186552 nan
736453090 6126187 nan
I want to exclude all the groups of start where all values with end is 'nan' - Expected output -
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
I tried with dropna(), but it is removing all the nan values and not nan groups.
I am newbie in python and pandas. Can someone help me in this? thank you
In newer pandas versions is necessary use min_count=1 for missing values if use sum:
s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()
Then is possible filter if at least one non missing value per first level by Series.notna with GroupBy.transform and GroupBy.any, filtering is by boolean indexing:
s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
Or is possible get unique values of first level index values by MultiIndex.get_level_values and filtering by DataFrame.loc:
idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)
I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129
I'm looking to adjust values of one column based on a conditional in another column.
I'm using np.busday_count, but I don't want the weekend values to behave like a Monday (Sat to Tues is given 1 working day, I'd like that to be 2)
dispdf = df[(df.dispatched_at.isnull()==False) & (df.sold_at.isnull()==False)]
dispdf["dispatch_working_days"] = np.busday_count(dispdf.sold_at.tolist(), dispdf.dispatched_at.tolist())
for i in range(len(dispdf)):
if dispdf.dayofweek.iloc[i] == 5 or dispdf.dayofweek.iloc[i] == 6:
dispdf.dispatch_working_days.iloc[i] +=1
Sample:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 1
49193 3.0 0
42470 3.0 1
47874 6.0 1
44500 3.0 1
43031 6.0 3
43193 0.0 4
43591 6.0 3
Expected Results:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 2
49193 3.0 0
42470 3.0 1
47874 6.0 2
44500 3.0 1
43031 6.0 2
43193 0.0 4
43591 6.0 4
At the moment I'm using this for loop to add a working day to Saturday and Sunday values. It's slow!
Can I use a vectorization instead to speed this up. I tried using .apply but to no avail.
Pretty sure this works, but there are more optimized implementations:
def adjust_dispatch(df_line):
if df_line['dayofweek'] >= 5:
return df_line['dispatch_working_days'] + 1
else:
return df_line['dispatch_working_days']
df['dispatch_working_days'] = df.apply(adjust_dispatch, axis=1)
for in you code could be replaced by that line:
dispdf.loc[dispdf.dayofweek>5,'dispatch_working_days']+=1
or you could use numpy.where
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html