Quickly sampling large number of rows from large dataframes in python - python

I have a very large dataframe (about 1.1M rows) and I am trying to sample it.
I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe.
This is what Ive tried so far but all these methods are taking way too much time:
Method 1 - Using pandas :
sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]
Method 2 :
I tried to write all the sampled lines to another csv.
f = open("data.csv",'r')
out = open("sampled_date.csv", 'w')
out.write(f.readline())
while 1:
total += 1
line = f.readline().strip()
if line =='':
break
arr = line.split(",")
if (int(arr[0]) in sample_index_array):
out.write(",".join(e for e in (line)))
Can anyone please suggest a better method? Or how I can modify this to make it faster?
Thanks

We don't have your data, so here is an example with two options:
after reading: use a pandas Index object to select a subset via the .iloc selection method
while reading: a predicate with the skiprows parameter
Given
A collection of indices and a (large) sample DataFrame written to test.csv:
import pandas as pd
import numpy as np
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A 1000000 non-null int32
B 1000000 non-null int32
C 1000000 non-null int32
D 1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB
Code
Option 1 - after reading
Convert a sample list of indices to an Index object and slice the loaded DataFrame:
idxs = pd.Index(indices)
subset = df.iloc[idxs, :]
print(subset)
The .iat and .at methods are even faster, but require scalar indices.
Option 2 - while reading (Recommended)
We can write a predicate that keeps selected indices as the file is being read (more efficient):
pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)
See also the issue that led to extending skiprows.
Results
The same output is produced from the latter options:
A B C D
1 74 95 28 4
2 87 3 49 94
3 53 54 34 97
10 58 41 48 15
20 86 20 92 11
30 36 59 22 5
67 49 23 86 63
78 98 63 60 75
900 26 11 71 85
2176 12 73 58 91
78776 42 30 97 96

Related

How to round a no index dataframe in Dask?

I was trying to merge 2 dataframes with float type series in Dask (due to memory issue I can't use pure Pandas). From the post, I found that there will have issue when merging float type columns. So I tried the answer in the post accordingly, to get the XYZ values * 100 and convert into int.
x y z R G B
39020.470001199995750 33884.200004600003012 36.445701600000000 25 39 26
39132.740005500003463 33896.049995399996988 30.405698800000000 19 24 18
39221.059997600001225 33787.050003099997411 26.605699500000000 115 145 145
39237.370010400001775 33773.019996599992737 30.205699900000003 28 33 37
39211.370010400001775 33848.270004300000437 32.535697900000002 19 28 25
What I did
N = 100
df2.x = np.round(df2.x*N).astype(int)
df2.head()
But since this dataframe has no index, it results in a error message
local variable 'index' referenced before assignment
Expected answer
x y z R G B
3902047 3388420 3644 25 39 26
I was having the same problem and got it to work this way:
df2.x = (df2.x*N).round().astype(int)
If you need to round to a specific decimal:
(df2.x*N).round(2)

What's the name of this operation in Pandas?

whats the name of below operation in Pandas?
import numpy as np
import pandas as pd
x=np.linspace(10,15,64)
y=np.random.permutation(64)
z=x[y]
ndarray "x" is (I assume) shuffled using ndarray "y" and then result ndarray is assigned to "z".
What is the name of this operation? I can't find it in Pandas documentation.
Thanks,
Pawel
This is called indexing, both in Pandas and NumPy
This code is basically shuffling an array using an array of indices. Using pandas you could shuffle a Series containing x using Series.sample, and specifying frac=1 so the whole sample is shuffled:
s = pd.Series(x)
s.sample(frac=1)
52 14.126984
1 10.079365
41 13.253968
16 11.269841
29 12.301587
9 10.714286
37 12.936508
19 11.507937
15 11.190476
56 14.444444
0 10.000000
45 13.571429
34 12.698413
12 10.952381
....
If you want to use the existing y, you could index the Series using the iloc indexer:
s.iloc[y]
8 10.634921
53 14.206349
48 13.809524
43 13.412698
51 14.047619
21 11.666667
9 10.714286
29 12.301587
5 10.396825
61 14.841270
56 14.444444
39 13.095238
30 12.380952
...
Here are the docs on indexing with pandas.

Group pandas dataframe by quantile of single column

Sorry if this is duplicate post - I can't find a related post though
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
What I'd like is to group P by the quartiles/quantiles/deciles/etc of column A and then calculate a aggregate statistic (such as mean) by group. I can define deciles of the column as
P['A'].quantile(np.arange(10) / 10)
I'm not sure how to groupby the deciles of A. Thanks in advance!
If you want to group P e.g. by quartiles, run:
gr = P.groupby(pd.qcut(P.A, 4, labels=False))
Then you can perform any operations on these groups.
For presentation, below you have just a printout of P limited to
20 rows:
for key, grp in gr:
print(f'\nGroup: {key}\n{grp}')
which gives:
Group: 0
A B
0 8 24
3 10 94
10 9 93
15 4 91
17 7 49
Group: 1
A B
7 34 24
8 15 60
12 27 4
13 31 1
14 13 83
Group: 2
A B
4 52 98
5 53 66
9 58 16
16 59 67
18 47 65
Group: 3
A B
1 67 87
2 79 48
6 98 14
11 86 2
19 61 14
As you can see, each group (quartile) has 5 members, so the grouping is
correct.
As a supplement
If you are interested in borders of each quartile, run:
pd.qcut(P.A, 4, labels=False, retbins=True)[1]
Then cut returns 2 results (a tuple). The first element (number 0) is
the result returned before, but we are this time interested in the
second element (number 1) - the bin borders.
For your data they are:
array([ 4. , 12.25, 40.5 , 59.5 , 98. ])
So e.g. the first quartile is between 4 and 12.35.
You can use the quantile Series to make another column, to marking each row with its quantile label, and then group by that column. numpy searchsorted is very useful to do this:
import numpy as np
import pandas as pd
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
q = P['A'].quantile(np.arange(10) / 10)
P['G'] = P['A'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])
Since the quantile Series stores the lower values of the quantile intervals, be sure to pass the parameter side='right' to np.searchsorted to not get 0 (the minimum should be 1 or you have one index more than you need).
Now you can elaborate your statistics by doing, for example:
P.groupby('G').agg(['sum', 'mean']) #add to the list all the statistics method you wish

Seaborn plots kdeplot but not distplot

I want to plot data from a pandas dataframe column formed from couchdb. This is what code and output from the data:
print df4.Patient_Age
Doc_ID
000103f8-7f48-4afd-b532-8e6c1028d965 99
00021ec5-9945-47f7-bfda-59cf8918f10b 92
0002510f-fb89-11e3-a6eb-742f68319ca7 32
00025550-9a97-44a4-84d9-1f6f7741f973 73
0002d1b8-b576-4db7-af55-b3f26f7ca63d 49
0002d40f-2b45-11e3-8f66-742f68319ca7 42
000307eb-18a6-47cd-bb03-33e484fad029 18
00033d3d-1345-4739-9522-b41b8db3ee23 42
00036d2e-0a51-4cfb-93d1-3e137a026f19 42
0003b054-5f3b-4553-8104-f71d7a940d84 10
Name: Patient_Age, dtype: object
If I execute this code:
sns.kdeplot(df4.Patient_Age)
the plot is generated as expected. However, when I run this:
sns.distplot(df4.Patient_Age)
I get the following error with distplot:
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
To correct the error, I use:
df4.Patient_Age = [int(i) for i in df4.Patient_Age]
all(isinstance(item,int) for item in df4.Patient_Age)
The output is:
False
What I would like to understand are:
Why was the kdeplot being generated earlier but not the histplot?
When I changed the datatype to int, why do I still get a False? And if the data is not int (as indicated by False), why does the histplot work after the transformation?
The problem is that your values are not numeric. If you force them to integers or floats, it will work.
from io import StringIO
import pandas
import seaborn
seaborn.set(style='ticks')
data = StringIO("""\
Doc_ID Age
000103f8-7f48-4afd-b532-8e6c1028d965 99
00021ec5-9945-47f7-bfda-59cf8918f10b 92
0002510f-fb89-11e3-a6eb-742f68319ca7 32
00025550-9a97-44a4-84d9-1f6f7741f973 73
0002d1b8-b576-4db7-af55-b3f26f7ca63d 49
0002d40f-2b45-11e3-8f66-742f68319ca7 42
000307eb-18a6-47cd-bb03-33e484fad029 18
00033d3d-1345-4739-9522-b41b8db3ee23 42
00036d2e-0a51-4cfb-93d1-3e137a026f19 42
0003b054-5f3b-4553-8104-f71d7a940d84 10
""")
df = pandas.read_table(data, sep='\s+')
df['Age'] = df['Age'].astype(float)
df.info()
# prints
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
Doc_ID 10 non-null object
Age 10 non-null float64
dtypes: float64(1), object(1)
memory usage: 240.0+ bytes
So then:
seaborn.distplot(df['Age'])
Give me:

Pandas error "Can only use .str accessor with string values"

I have the following input file:
"Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,
And I am reading it in with:
#!/usr/bin/env python
import pandas as pd
import sys
import numpy as np
filename = sys.argv[1]
df = pd.read_csv(filename,header=None)
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
print df
However, I get the error
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2241, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 188, in __get__
return self.construct_accessor(instance)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 528, in _make_str_accessor
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This worked OK in pandas 0.14 but does not work in pandas 0.17.0.
It's happening because your last column is empty so this becomes converted to NaN:
In [417]:
t="""'Name',97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[417]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 'Name' 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15 16
0 0N# NaN
If you slice your range up to the last row then it works:
In [421]:
for col in df.columns[2:-1]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[421]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
Alternatively you can just select the cols that are object dtype and run the code (skipping the first col as this is the 'Name' entry):
In [428]:
for col in df.select_dtypes([np.object]).columns[1:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[428]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
I got this error while working in Eclipse. It turned out that the project interpreter was somehow (after an update I believe) reset to Python 2.7. Setting it back to Python 3.6 resolved this issue. It all resulted in several crashes, restarts and warnings. After several minutes of troubles it seems fixed now.
While I know this is not a solution to the problem posed here, I thought it might be useful for others, as I came to this page after searching for this error.
In this case we have to use the str.replace() method on that series, but first we have to convert it to str type:
df1.Patient = 's125','s45',s588','s244','s125','s123'
df1 = pd.read_csv("C:\\Users\\Gangwar\\Desktop\\competitions\\cancer prediction\\kaggle_to_students.csv")
df1.Patient = df1.Patient.astype(str)
df1['Patient'] = df1['Patient'].str.replace('s','').astype(int)

Categories