I have a dataset of longitudes/latitudes as follows:
id,spp,lon,lat
1a,sp1,1,9
1b,sp1,3,11
1c,sp1,6,12
2a,sp2,1,9
2b,sp2,1,10
2c,sp2,3,10
2d,sp2,4,11
2e,sp2,5,12
2f,sp2,6,12
3a,sp3,4,13
3b,sp3,5,11
3c,sp3,8,8
4a,sp4,4,12
4b,sp4,6,11
4c,sp4,7,8
5a,sp5,8,8
5b,sp5,7,6
5c,sp5,8,2
6a,sp6,8,8
6b,sp6,7,5
6c,sp6,8,3
From such data, I want to generate a grid like this:
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 1 2 0 0 0 0
0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 1 0 1 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 3 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
which gives the number of data records in each cell of the grid, using variable "spp" as a categorical (grouping) factor.
From this grid, I thwn want to create a heat map, superimposed on a geographical map, so that I end up with something like the figure below.
I can see how to plot a heatmap on a Matplotlib/Basemap, but I could not figure out how to generate the grid from the point data. Also, it is important that I am able to choose the grid size, so that several different resolutions can be evaluated. I suppose that what I want could be achieved by either Numpy meshgrid or Scipy griddata, but I could not make further progress in understanding how to use them.
Any hints, ideas, suggestions will be much appreciated.
If you're willing to use pandas you could do something like this
dims = max(df[['lat','lon']].max())
df.groupby(['lat','lon'])['lat'].count().unstack().reindex(range(1,dims+1)).T.reindex(range(1,dims+1)).fillna(0).T
resulting in a square dataframe
lon 1 2 3 4 5 6 7 8 9 10 11 12 13
lat
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0
9 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 0.0 0.0 0.0 1.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
you could always convert by to numpy with df.values
Related
I have two medium-sized datasets which looks like:
books_df.head()
ISBN Book-Title Book-Author
0 0195153448 Classical Mythology Mark P. O. Morford
1 0002005018 Clara Callan Richard Bruce Wright
2 0060973129 Decision in Normandy Carlo D'Este
3 0374157065 Flu: The Story of the Great Influenza Pandemic... Gina Bari Kolata
4 0393045218 The Mummies of Urumchi E. J. W. Barber
and
ratings_df.head()
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
And I wanna get a pivot table like this:
ISBN 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
User-ID
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I've tried:
R_df = ratings_df.pivot(index = 'User-ID', columns ='ISBN', values = 'Book-Rating').fillna(0) # Memory overflow
which failed for:
MemoryError:
and this:
R_df = q_data.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().unstack()
which failed for the same.
I want to use it for singular value decomposition and matrix factorization.
Any ideas?
The dataset I'm working with is: http://www2.informatik.uni-freiburg.de/~cziegler/BX/
One option is to use pandas Sparse functionality, since your data here is (very) sparse:
In [11]: df
Out[11]:
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
In [12]: res = df.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().astype('Sparse[int]')
In [13]: res.unstack(fill_value=0)
Out[13]:
ISBN 0155061224 034545104X 0446520802 052165615X 0521795028
User-ID
276725 0 0 0 0 0
276726 5 0 0 0 0
276727 0 0 0 0 0
276729 0 0 0 3 6
In [14]: _.dtypes
Out[14]:
ISBN
0155061224 Sparse[int64, 0]
034545104X Sparse[int64, 0]
0446520802 Sparse[int64, 0]
052165615X Sparse[int64, 0]
0521795028 Sparse[int64, 0]
dtype: object
My understanding is that you can then use this with scipy e.g. for SVD:
In [15]: res.unstack(fill_value=0).sparse.to_coo()
Out[15]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
Let's say I have this DataFrame:
df = pd.DataFrame({'age':[10,11,10,20,25,10],'field':['cat','cat','cat','dog','cow','cat']})
>>> df
age field
0 10 cat
1 11 cat
2 10 cat
3 20 dog
4 25 cow
5 10 cat
My goal is to groupby('field'), use it as index and have age columns from 1 to 90 and get percentage distribution for each field, like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ...
field
cat 0 0 0 0 0 0 0 0 0 75 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
dog 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 ...
cow 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 ...
Please help me... thanks for your support!
I believe what you're looking for is pivot_table:
df = pd.DataFrame({'age':[10,11,10,20,25,10],'field':['cat','cat','cat','dog','cow','cat']})
pivot = \
(df
.assign(vals=1)
.pivot_table(values='vals', index='field', columns='age', aggfunc='sum')
.fillna(0)
)
row_totals = pivot.sum(axis=1)
percentages = pivot.div(row_totals, axis=0) * 100
final = percentages.reindex(range(1,91), axis=1, fill_value=0.0)
pivot_table calculates frequencies of occurences for crossess between age and field
row_totals calculate occurrencies in rows, which in turn are used to caluclate percentages in rows percentages
finally, we need to add empty columns 1-90
age 1 2 3 4 5 6 7 8 9 10 ... 81 82 83 \
field ...
cat 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 75.0 ... 0.0 0.0 0.0
cow 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
dog 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
age 84 85 86 87 88 89 90
field
cat 0.0 0.0 0.0 0.0 0.0 0.0 0.0
cow 0.0 0.0 0.0 0.0 0.0 0.0 0.0
dog 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I have the following dataframe
0 0 0
1 0 0
1 1 0
1 1 1
1 1 1
0 0 0
0 1 0
0 1 0
0 0 0
how do you get a dataframe which looks like this
0 0 0
4 0 0
4 3 0
4 3 2
4 3 2
0 0 0
0 2 0
0 2 0
0 0 0
Thank you for your help.
You may need using for loop here , with tranform, and using cumsum create the key and assign the position back to your original df
for x in df.columns:
df.loc[df[x]!=0,x]=df[x].groupby(df[x].eq(0).cumsum()[df[x]!=0]).transform('count')
df
Out[229]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
Or without for loop
s=df.stack().sort_index(level=1)
s2=s.groupby([s.index.get_level_values(1),s.eq(0).cumsum()]).transform('count').sub(1).unstack()
df=df.mask(df!=0).combine_first(s2)
df
Out[255]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
I have a dataframe like this
value
msno features days
B num_50 1 0
C num_100 3 1
A num_100 400 2
I used
df = df.unstack(level=-1,fill_value = '0')
df = df.unstack(level=-1,fill_value = '0')
df = df.stack()
then df looks like :
value
days 1 3 400
msno features
B num_50 0 0 0
num_100 0 0 0
C num_50 0 0 0
num_100 0 1 0
A num_50 0 0 0
num_100 0 0 2
Now, I want to fill this df with 0. But still keep original data ,like this:
value
days 1 2 3 4 ... 400
msno features
B num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 0
C num_50 0 0 0 0 ... 0
num_100 0 0 1 0 ... 0
A num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 2
I want to add columns which are in 1 - 400 and fill the columns by 0.
Could someone tell me how to do that?
By using reindex
df.columns=df.columns.droplevel()
df.reindex(columns=list(range(1,20))).fillna(0)
Out[414]:
days 1 2 3 4 5 6 7 8 9 10 11 12 13 \
msno features
A num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
days 14 15 16 17 18 19
msno features
A num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
I'm getting the error:
TypeError: 'method' object is not subscriptable
When I try to join two Pandas dataframes...
Can't see what is wrong with them!
For info, doing the Kaggle titanic problem:
titanic_df.head()
Out[102]:
Survived Pclass SibSp Parch Fare has_cabin C Q title Person
0 0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 1 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 0 3 0 0 8.0500 1 0.0 0.0 Mr male
In [103]:
sns.barplot(x=titanic_df["Survived"],y=titanic_df["title"])
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b5edb00>
In [125]:
title_dummies=pd.get_dummies(titanic_df["title"])
title_dummies=title_dummies.drop([" Don"," Rev"," Dr"," Col"," Capt"," Jonkheer"," Major"," Mr"],axis=1)
title_dummies.head()
Out[125]:
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [126]:
titanic_df=title_dummies.join[titanic_df]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-126-a0e0fe306754> in <module>()
----> 1 titanic_df=title_dummies.join[titanic_df]
TypeError: 'method' object is not subscriptable
You need change [] to () in DataFrame.join function:
titanic_df=title_dummies.join(titanic_df)
print (titanic_df)
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess Survived \
0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
2 2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
4 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
Pclass SibSp Parch Fare has_cabin C Q title Person
0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 3 0 0 8.0500 1 0.0 0.0 Mr male