Count All Occurrences of a Specific Value in a Dask Dataframe - python

I have a dask dataframe with thousands of columns and rows as follows:
pprint(daskdf.head())
grid lat lon ... 2014-12-29 2014-12-30 2014-12-31
0 0 48.125 -124.625 ... 0.0 0.0 -17.034216
1 0 48.625 -124.625 ... 0.0 0.0 -19.904214
4 0 42.375 -124.375 ... 0.0 0.0 -8.380443
5 0 42.625 -124.375 ... 0.0 0.0 -8.796803
6 0 42.875 -124.375 ... 0.0 0.0 -7.683688
I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:
pddf[pddf==500].count().sum()
I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:
daskdf[daskdf==500].count().sum().compute()
but this yielded a "Not Implemented" error.

As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions. In this case this might look like:
ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()
You can experiment with whether also doing a .sum() within the lambda helps (it would produce smaller intermediaries) and what the meta= argument to map_partition should look like.

Related

Retrieve next row in pandas dataframe / multiple list comprehension outputs

I have a Pandas dataframe, wt, with a datetime index and three columns as well as dataframe t with the same datetime index and three other columns below:
wt
date 0 1 2
2004-11-19 0.2 0.3 0.5
2004-11-22 0.0 0.0 0.0
2004-11-23 0.0 0.0 0.0
2004-11-24 0.0 0.0 0.0
2004-11-26 0.0 0.0 0.0
2004-11-29 0.0 0.0 0.0
2004-11-30 0.0 0.0 0.0
t
date GLD SPY TLT
2004-11-19 0.009013068949977443 -0.011116725618999457 -0.007980218051028332
2004-11-22 0.0037963376507370583 0.004769204564810003 0.005211874008610895
2004-11-23 -0.00444938820912133 0.0015256823190370472 0.0012398557258792575
2004-11-24 0.006703910614525022 0.0023696682464455776 0.0
2004-11-26 0.005327413984461682 -0.0007598784194529085 -0.00652932567826181
2004-11-29 0.002428792227864962 -0.004562737642585524 -0.010651558073654366
2004-11-30 -0.006167400881057272 0.0006790595025889523 -0.004237773450922022
2004-12-01 0.005762411347517871 0.011366528119433505 -0.0015527950310557648
I'm currently using the Pandas iterrrows method to run through each row for processing, and as a first step, I check if the row entries are non-zero, as below:
for dt, row in t.iterrows():
if sum(wt.loc[dt]) <= 0:
...
Based on this, I'd like to assign values to dataframe wt if non-zero values don't currently exist. How can I retrieve the next row for a given dt entry (eg, '11/22/2004' for dt = '11/19/2004')?
Part 2
As an addendum, I'm setting this up using a for loop for testing but would like to use list comprehension once complete. Processing will return the wt dataframe described above, as well as an intermediate, secondary dataframe again with datetime index and a single column (sample below):
r
date r
2004-11-19 0.030202
2004-11-22 -0.01047
2004-11-23 0.002456
2004-11-24 -0.01274
2004-11-26 0.00928
Is there a way to use list comprehensions to return both the above wt and this r dataframes without simply creating two separate comprehensions?
Edit
I was able to get desired results by changing my approach, so adding for clarification (referenced dataframes are as described above). Wonder if there's any way to apply list comprehensions for this.
r = pd.DataFrame(columns=['ret'],index=wt.index.copy())
dts = wt.reset_index().date
for i, dt in enumerate(dts):
row = t.loc[dt]
dt_1 = dts.shift(-1).iloc[i]
try:
wt.loc[dt_1] = ((wt.loc[dt].tolist() * (1+row)).transpose() / np.dot(wt.loc[dt].tolist(), (1+row))).tolist()
r.loc[dt] = np.dot(wt.loc[dt], row)
except:
print(f'Error calculating for date {dt}')
continue

Loading Pandas Dataframe with skipped sentiment

I have this dataset for sentiment analysis, loading the data with this code:
url = 'https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/amazon_cells_labelled.tsv'
df = pd.read_csv(url, sep='\t', names=["Sentence", "Feeling"])
The issue is the DataFrame is getting lines with NaN, but It's just part of the whole sentence.
The Output, right now is like this:
sentence feeling
I do not like it. NaN
I give it a bad score. 0
The Output should look like:
sentence feeling
I do not like it. I give it a bad score 0
Can you help me to concatenate or load the dataset based on the scores?
Create virtual groups before groupby and agg rows:
grp = df['Feeling'].notna().cumsum().shift(fill_value=0)
out = df.groupby(grp).agg({'Sentence': ' '.join, 'Feeling': 'last'})
print(out)
# Output:
Sentence Feeling
Feeling
0 I try not to adjust the volume setting to avoi... 0.0
1 Good case, Excellent value. 1.0
2 I thought Motorola made reliable products!. Ba... 1.0
3 When I got this item it was larger than I thou... 0.0
4 The mic is great. 1.0
... ... ...
996 But, it was cheap so not worth the expense or ... 0.0
997 Unfortunately, I needed them soon so i had to ... 0.0
998 The only thing that disappoint me is the infra... 0.0
999 No money back on this one. You can not answer ... 0.0
1000 It's rugged. Well this one is perfect, at the ... NaN
[1001 rows x 2 columns]

update data frame based on data from another data frame using pandas python

I have two data frames df1 and df2. Both have first column common SKUCode=SKU
df1:
df2:
I want to update df1 and set SKUStatus=0 if SKUCode matches SKU in df2.
I want to add new row to df1 if SKU from df2 has no match to SKUCode.
So after the operation df1 looks like following:
One way I could get this done is via df2.iterrows() and looping through values however I think there must be another neat way of doing this?
Thank you
import pandas as pdx
df1=pdx.DataFrame({'SKUCode':['A','B','C','D'],'ListPrice':[1798,2997,1798,999],'SalePrice':[1798,2997,1798,999],'SKUStatus':[1,1,1,0],'CostPrice':[500,773,525,300]})
df2=pdx.DataFrame({'SKUCode':['X','Y','B'],'Status':[0,0,0],'e_date':['31-05-2020','01-06-2020','01-06-2020']})
df1.merge(df2,left_on='SKUCode')
try this, using outer merge which gives both matching and non-matching records.
In [75]: df_m = df1.merge(df2, on="SKUCode", how='outer')
In [76]: mask = df_m['Status'].isnull()
In [77]: df_m.loc[~mask, 'SKUStatus'] = df_m.loc[~mask, 'Status']
In [78]: df_m[['SKUCode', "ListPrice", "SalePrice", "SKUStatus", "CostPrice"]].fillna(0.0)
output
SKUCode ListPrice SalePrice SKUStatus CostPrice
0 A 1798.0 1798.0 1.0 500.0
1 B 2997.0 2997.0 0.0 773.0
2 C 1798.0 1798.0 1.0 525.0
3 D 999.0 999.0 0.0 300.0
4 X 0.0 0.0 0.0 0.0
5 Y 0.0 0.0 0.0 0.0
I'm not sure exactly if I understood you correctly but I think you can use .loc. something along the lines of:
df1.loc[df2['SKUStatu'] != 0, 'SKUStatus'] = 1
You should have a look at pd.merge function [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html].
First rename a column with the same name (e.g rename SKU to SKUCode). Then try:
df1.merge(df2, left_on='SKUCode')
If you provide input data (not screenshots), I can try with the appropriate parameters.

How to use incremental PCA on dask dataframe?

I am using a dask dataframe which can not be loaded directly into the memory because of the size of it. I want to perform dimentionality reduction of top of using incremental PCA.
My dataframe is sparse in nature, so the question is can I perform it and if yes then how to do so.
image_features_df.head(3)
feat1 feat2 feat3 ... feat25087 feat25088 fid selling_price
0 0.0 0.0 0.0 ... 0.0 0.0 2 269.00
4 0.3 0.1 0.0 ... 0.0 0.8 26 1720.00
6 0.8 0.0 0.0 ... 0.0 0.1 50 18145.25
The above is a view of my dataframe. I want the output to have 95% cumulative varience. How to do so?
My dataframe has 100,000 rows and 25088 columns so please tell a solution which is memory efficient.
Have a look at the PCA implementation in dask-ML, https://ml.dask.org/modules/generated/dask_ml.decomposition.PCA.html,
this might already work for your case, as it uses the tsqr algorithm (https://arxiv.org/abs/1301.1071)

Aligning 2 python lists according to 2 other list

I have two arrays namely nlxTTL and ttlState. Both the arrays comprise of repeating pattern of 0's and 1's indicating input voltage which can be HIGH(1) or LOW(0) and are recorded from same source which sends a TTL pulse(HIGH and LOW) with 1second pulse width.
But due to some logging mistake, some drops happen in ttlState list i.e. it doesn't log a repeating sequence of 0 and 1's and ends up dropping values.
The good part is I also log timestamp for each TTL input received for both the lists. Inter TTL event timestamp difference clearly shows that the TTL event has missed one of the pulses.
Here is an example of what data looks like:
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
As you can see the nlxTime and ttlTime clearly are different from each other. How can then using these timestamps I can align all 4 lists?
When dealing with tabular data such as a CSV file, it's a good idea to use a library to make the process easier. I like the pandas dataframe library.
Now for your question, one way to think about this problem is that you really have two datasets... An nlx dataset and a ttl dataset. You want to join those datasets together by timestamp. Pandas makes tasks like this very easy.
import pandas as pd
from StringIO import StringIO
data = """\
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
"""
# Load data into dataframe.
df = pd.read_csv(StringIO(data))
# Remove spaces from column names.
df.columns = [x.strip() for x in df.columns]
# Split the data into an nlx dataframe and a ttl dataframe.
nlx = df[['nlxTTL', 'nlxTime']].reset_index()
ttl = df[['ttlState', 'ttlTime']].reset_index()
# Merge the dataframes back together based on their timestamps.
# Use an outer join so missing data gets filled with NaNs instead
# of just dropping the rows.
merged_df = nlx.merge(ttl, left_on='nlxTime', right_on='ttlTime', how='outer')
# Get back to the original set of columns
merged_df = merged_df[df.columns]
# Print out the results.
print(merged_df)
This produces the following output.
nlxTTL ttlState nlxTime ttlTime
0 0.0 0.0 1000.0 1000.0
1 1.0 1.0 2000.0 2000.0
2 0.0 NaN 3000.0 NaN
3 1.0 1.0 4000.0 4000.0
4 0.0 NaN 5000.0 NaN
5 1.0 1.0 6000.0 6000.0
6 0.0 0.0 7000.0 7000.0
7 1.0 1.0 8000.0 8000.0
8 NaN 0.0 NaN 9000.0
9 NaN 1.0 NaN 10000.0
You'll notice that it fills in the dropped values with NaN values because we are doing an outer join. If this is undesirable, change the how='outer' parameter to how='inner' to perform an inner join. This will only keep records for which you have both an nlx and ttl response at that timestamp.

Categories