How to round a no index dataframe in Dask? - python

I was trying to merge 2 dataframes with float type series in Dask (due to memory issue I can't use pure Pandas). From the post, I found that there will have issue when merging float type columns. So I tried the answer in the post accordingly, to get the XYZ values * 100 and convert into int.
x y z R G B
39020.470001199995750 33884.200004600003012 36.445701600000000 25 39 26
39132.740005500003463 33896.049995399996988 30.405698800000000 19 24 18
39221.059997600001225 33787.050003099997411 26.605699500000000 115 145 145
39237.370010400001775 33773.019996599992737 30.205699900000003 28 33 37
39211.370010400001775 33848.270004300000437 32.535697900000002 19 28 25
What I did
N = 100
df2.x = np.round(df2.x*N).astype(int)
df2.head()
But since this dataframe has no index, it results in a error message
local variable 'index' referenced before assignment
Expected answer
x y z R G B
3902047 3388420 3644 25 39 26

I was having the same problem and got it to work this way:
df2.x = (df2.x*N).round().astype(int)
If you need to round to a specific decimal:
(df2.x*N).round(2)

Related

What's the name of this operation in Pandas?

whats the name of below operation in Pandas?
import numpy as np
import pandas as pd
x=np.linspace(10,15,64)
y=np.random.permutation(64)
z=x[y]
ndarray "x" is (I assume) shuffled using ndarray "y" and then result ndarray is assigned to "z".
What is the name of this operation? I can't find it in Pandas documentation.
Thanks,
Pawel
This is called indexing, both in Pandas and NumPy
This code is basically shuffling an array using an array of indices. Using pandas you could shuffle a Series containing x using Series.sample, and specifying frac=1 so the whole sample is shuffled:
s = pd.Series(x)
s.sample(frac=1)
52 14.126984
1 10.079365
41 13.253968
16 11.269841
29 12.301587
9 10.714286
37 12.936508
19 11.507937
15 11.190476
56 14.444444
0 10.000000
45 13.571429
34 12.698413
12 10.952381
....
If you want to use the existing y, you could index the Series using the iloc indexer:
s.iloc[y]
8 10.634921
53 14.206349
48 13.809524
43 13.412698
51 14.047619
21 11.666667
9 10.714286
29 12.301587
5 10.396825
61 14.841270
56 14.444444
39 13.095238
30 12.380952
...
Here are the docs on indexing with pandas.

Result of math.log in Python pandas DataFrame is interger

I have a DataFrame, all values are integer
Millage UsedMonth PowerPS
1 261500 269 101
3 320000 211 125
8 230000 253 101
9 227990 255 125
13 256000 240 125
14 153000 242 150
17 142500 215 101
19 220000 268 125
21 202704 260 101
22 350000 246 101
25 331000 230 125
26 250000 226 125
And I would like to calculate log(Millage)
SO I used code
x_trans=copy.deepcopy(x)
x_trans=x_trans.reset_index(drop=True)
x_trans.astype(float)
import math
for n in range(0,len(x_trans.Millage)):
x_trans.Millage[n]=math.log(x_trans.Millage[n])
x_trans.UsedMonth[n]=math.log(x_trans.UsedMonth[n])
I got all interger values
Millage UsedMonth PowerPS
0 12 5 101
1 12 5 125
2 12 5 101
3 12 5 125
4 12 5 125
5 11 5 150
It's python 3, Jupyter notebook
I tried math.log(100)
And get 4.605170185988092
I think the reason could be DataFrame data type.
How could I get the log() result as float
Thanks
One solution would be to simply do
x_trans['Millage'] = np.log(x_trans['Millage'])
Conversion to astype(float) is not an in-place operation. Assign back to your dataframe and you will find your log series will be of type float:
x_trans = x_trans.astype(float)
But, in this case, math.log is inefficient. Instead, you can use vectorised functionality via NumPy:
x_trans['Millage'] = np.log(x_trans['Millage'])
x_trans['UsedMonth'] = np.log(x_trans['UsedMonth'])
With this solution, you do not need to explicitly convert your dataframe to float.
In addition, note that deep copying is native in Pandas, e.g. x_trans = x.copy(deep=True).
First of I strongly recommend using the numpy library for those kind of mathematical operations, it is faster and outputs results in a easier way to use since both numpy and pandas are from the same project.
Now taking into account how you created your dataframe it automatically assumed your data type is integer, try to define it as float when you create the dataframe adding in the parameters dtype = float or better if you are using numpy package (import numpy as np) dtype = np.float64.

Calculations with Groupby objects Pandas

I would like to retrieve two groupby series objects and calculate between each other.
Series objects below:
Cost
ID yy
312 13 102429.610000
361 15 170526.000000
373 14 400000.000000
403 13 165000.000000
14 165000.000000
15 183558.720000
16 133763.760980
17 121301.930160
Percentage
ID yy
312 13 21.687500
361 15 33.181818
373 14 12.439024
403 13 22.966667
14 22.966667
15 24.142857
16 23.333333
17 36.666667
cost=df.groupby(['ID', 'yy'])['cost']
percentage=df.groupby(['ID', 'yy'])['percentage']
I essentially want to calculate cost * percentage.
how is this done correctly? The error is 'unsupported operand type(s) for *: 'SeriesGroupBy' and 'SeriesGroupBy'.
You are using groupby without any aggregate function which returns as groupby object, NOT a series.
You need
cost = df1.set_index(['ID', 'yy'])['cost']
pct = df2.set_index(['ID', 'yy'])['cost']
cost.mul(pct/100)
ID yy
312 13 22214.421669
361 15 56583.626963
373 14 49756.096000
403 13 37895.000550
14 37895.000550
15 44316.319281
16 31211.543783
17 44477.374796
Is this what you need ?
pct.mul(cost)/100
Out[332]:
ID yy
312 13 22214.421669
361 15 56583.626963
373 14 49756.096000
403 13 37895.000550
14 37895.000550
15 44316.319281
16 31211.543783
17 44477.374796
Name: V, dtype: float64
You can directly multiply cost and percentage just because here your indices i.e id and yy are same for both DF.
So
percentage.mul(cost) should work.
You made a mistake in treating two series from the same (grouped) df as two different objects. So just do:
with df.groupby(['ID', 'yy']) as dfg:
dfg['cost'] * dfg['percentage'] # you have to assign or write the output
You can probably even reduce this to a one-liner, if you post us reproducible data I'll post it. In fact as #Neo showed, something like:
df.groupby(['ID', 'yy']).percentage.mul(cost)

python pandas: How dropping items in dateframe

I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()

Quickly sampling large number of rows from large dataframes in python

I have a very large dataframe (about 1.1M rows) and I am trying to sample it.
I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe.
This is what Ive tried so far but all these methods are taking way too much time:
Method 1 - Using pandas :
sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]
Method 2 :
I tried to write all the sampled lines to another csv.
f = open("data.csv",'r')
out = open("sampled_date.csv", 'w')
out.write(f.readline())
while 1:
total += 1
line = f.readline().strip()
if line =='':
break
arr = line.split(",")
if (int(arr[0]) in sample_index_array):
out.write(",".join(e for e in (line)))
Can anyone please suggest a better method? Or how I can modify this to make it faster?
Thanks
We don't have your data, so here is an example with two options:
after reading: use a pandas Index object to select a subset via the .iloc selection method
while reading: a predicate with the skiprows parameter
Given
A collection of indices and a (large) sample DataFrame written to test.csv:
import pandas as pd
import numpy as np
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A 1000000 non-null int32
B 1000000 non-null int32
C 1000000 non-null int32
D 1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB
Code
Option 1 - after reading
Convert a sample list of indices to an Index object and slice the loaded DataFrame:
idxs = pd.Index(indices)
subset = df.iloc[idxs, :]
print(subset)
The .iat and .at methods are even faster, but require scalar indices.
Option 2 - while reading (Recommended)
We can write a predicate that keeps selected indices as the file is being read (more efficient):
pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)
See also the issue that led to extending skiprows.
Results
The same output is produced from the latter options:
A B C D
1 74 95 28 4
2 87 3 49 94
3 53 54 34 97
10 58 41 48 15
20 86 20 92 11
30 36 59 22 5
67 49 23 86 63
78 98 63 60 75
900 26 11 71 85
2176 12 73 58 91
78776 42 30 97 96

Categories