I have a data set with the rating of user ID to all product ID. There are only 5000 products and 10,000 users but the ID is in different number. I would like to transform my dataframe to a coo_sparse_matrix(data, (row,col), shape) but with row and col as the real number of products and users, not the ID. Is there any way to do that? Below is the illustration:
Data frame:
User ID
Product ID
Rating
1
14
0.1
1
15
0.2
2
14
0.3
2
16
0.3
5
19
0.4
and expected to have a matrix (in sparse coo form)
ProductID
14
15
16
19
UserID
1
0.1
0.2
0
0
2
0.3
0
0.3
0
5
0
0
0
0.4
because normally the sparse_coo would give a very large matrix with index (1,2,...,19) for product ID and (1,2,3,4,5) for user ID.
Please help me, it is for the thesis due in 3 days and I just found out this error, I code with Python.
Thank you very much!
Hi hope this helps and good luck with your thesis:
import pandas as pd
from scipy.sparse import coo_matrix
dataframe=pd.DataFrame(data={'User ID':[1,1,2,2,5], 'Product ID':[14,15,14,16,19], 'Rating':[0.1,0.2,0.3,0.3,0.4]})
row=dataframe['User ID']
col=dataframe['Product ID']
data=dataframe['Rating']
coo=coo_matrix((data, (row, col))).toarray()
new_dataframe=pd.DataFrame(coo)
#Drop non existing Product IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[:, (new_dataframe != new_dataframe.iloc[0]).any()]
#Drop non existing User IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[(new_dataframe!=0).any(axis=1)]
print(new_dataframe)
Output:
14 15 16 19
1 0.1 0.2 0.0 0.0
2 0.3 0.0 0.3 0.0
5 0.0 0.0 0.0 0.4
Related
I have a data frame with columns containing different country values, I would like to have a function that shifts the rows in this dataframe independently without the dates. For example, I have a list of related profile shifters for each country which would be used in shifting the rows.
If the profile shifter for a country is -3, that country column, is shifted 3 times downwards, while the last 3 values become the first 3 values in the dataframe. If a profile shifter is +3, the third value of a row is shifted upwards while the first 2 values become the last values in that column.
After the rows have been shifted instead of having the default Nan value appear in the empty cells, I want the preceding or succeeding values to take up the empty cells. The function should also return a data frame Sample-dataset Profile Shifter Expected-results.
Sample Dataset:
Datetime ARG AUS BRA
1/1/2050 0.00 0.1 2.1 3.1
1/1/2050 1.00 0.2 2.2 3.2
1/1/2050 2.00 0.3 2.3 3.3
1/1/2050 3.00 0.4 2.4 3.4
1/1/2050 4.00 0.5 2.5 3.5
1/1/2050 5.00 0.6 2.6 3.6
Country Profile Shifters:
Country ARG AUS BRA
UTC -3 -2 4
Desired Output:
Datetime ARG AUS BRA
1/1/2050 0.00 0.3 2.4 3.4
1/1/2050 1.00 0.4 2.5 3.5
1/1/2050 2.00 0.5 2.1 3.1
1/1/2050 3.00 0.1 2.2 3.2
1/1/2050 4.00 0.2 2.3 3.3
This is what I have been trying for days now but it's not working
cols = df1.columns
for i in cols:
if i == 'ARG':
x = df1.iat[0:3,0]
df1['ARG'] = df1.ARG.shift(periods=-3)
df1['ARG'].replace(to_replace=np.nan, x)
elif i == 'AUS':
df1['AUS'] = df1.AUS.shift(periods=2)
elif i == 'BRA':
df1['BRA'] = df1.BRA.shift(periods=1)
else:
pass
This works but is far from being 'good pandas'. I hope that someone will come along and give a nicer, cleaner 'more pandas' answer.
Imports used:
import pandas as pd
import datetime as datetime
Offset data setup:
offsets = pd.DataFrame({"Country" : ["ARG", "AUS", "BRA"], "UTC Offset" : [-3, -2, 4]})
Produces:
Country UTC Offset
0 ARG -3
1 AUS -2
2 BRA 4
Note that the timezone offset data I've used here is in a slightly different structure from the example data (country codes by rows, rather than columns). Also worth pointing out that Australia and Brazil have several time zones, so there is no one single UTC offset which applies to those whole countries (only one in Argentina though).
Sample data setup:
sampleDf = pd.DataFrame()
for i in range(6):
dt = datetime.datetime(2050,1,1,i)
sampleDf = sampleDf.append({'Datetime' : dt,
'ARG' : i / 10,
'AUS' : (i + 10)/ 10,
'BRA' : (i + 20) / 10},
ignore_index=True)
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.0 2.0
1 2050-01-01 01:00:00 0.1 1.1 2.1
2 2050-01-01 02:00:00 0.2 1.2 2.2
3 2050-01-01 03:00:00 0.3 1.3 2.3
4 2050-01-01 04:00:00 0.4 1.4 2.4
5 2050-01-01 05:00:00 0.5 1.5 2.5
Code to shift cells:
for idx, offsetData in offsets.iterrows(): # See note 1
countryCode = offsetData["Country"]
utcOffset = offsetData["UTC Offset"]
dfRowCount = sampleDf.shape[0]
wrappedOffset = (dfRowCount + utcOffset) if utcOffset < 0 else \
(-dfRowCount + utcOffset) # See note 2
countryData = sampleDf[countryCode]
sampleDf[countryCode] = pd.concat([countryData.shift(utcOffset).dropna(),
countryData.shift(wrappedOffset).dropna()]).sort_index() # See note 3
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.4 2.4
1 2050-01-01 01:00:00 0.1 1.5 2.5
2 2050-01-01 02:00:00 0.2 1.0 2.0
3 2050-01-01 03:00:00 0.3 1.1 2.1
4 2050-01-01 04:00:00 0.4 1.2 2.2
5 2050-01-01 05:00:00 0.5 1.3 2.3
Notes
Iterating over rows in pandas like this (to me) indicates 'you've run out of pandas skill, and are kind of going against the design of pandas'. What I have here works, but it won't benefit from any/many of the efficiencies of using pandas, and would not be appropriate for a large dataset. Using itertuples rather than iterrows is supposed to be quicker, but I think neither is great, so I went with what seemed most readable for this case.
This solution does two shifts, one of the data shifted by the timezone offset, then a second shift of everything else to fill in what would otherwise be NaN holes left by the first shift. This line calculates the size of that second shift.
Finally, the results of the two shifts are concatenated together (after dropping any NaN values from both of them) and assigned back to the original (unshifted) column. sort_index puts them back in order based on the index, rather than having the two shifted parts one-after-another.
I have a dataframe, something like
name perc score
a 0.2 40
b 0.4 89
c 0.3 90
I want to have a total row where 'perc' has a mean aggregation and 'score' has a sum aggregation. The output should be like
name perc score
a 0.2 40
b 0.4 89
c 0.3 90
total 0.3 219
I want it as a dataframe output as I need to build plots using this. For now, I tried doing
df.loc['total'] = df.sum()
but this provides the sum for the percentage column as well, whereas I want an average for the percentage. How to do this in pandas?
try this:
df.loc['total'] = [df['perc'].mean(), df['score'].sum()]
Output:
perc score
name
a 0.20 40.0
b 0.40 89.0
c 0.30 90.0
total 0.30 219.0
I have the following data frame in pandas:
I want to add the Avg Price column in the original data frame, after grouping by (Date,Issuer) and then taking the dot product of weights and price, so that it is something like:
Is there a way to do it without using merge or join ? What would be the simplest way to do it?
One way using pandas.DataFrame.prod:
df["Avg Price"] = df[["Weights", "Price"]].prod(1)
df["Avg Price"] = df.groupby(["Date", "Issuer"])["Avg Price"].transform("sum")
print(df)
Output:
Date Issuer Weights Price Avg Price
0 2019-11-12 A 0.4 100 120.0
1 2019-15-12 B 0.5 100 100.0
2 2019-11-12 A 0.2 200 120.0
3 2019-15-12 B 0.3 100 100.0
4 2019-11-12 A 0.4 100 120.0
5 2019-15-12 B 0.2 100 100.0
I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
resampled=df.resample('120D').mean()
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
>>df1
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
>>df2
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
Update:
If I could get all the files to be upsampled to 179, that would be fine as well.
I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').asfreq().isna().sum(0))
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').mean().interpolate().head())
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
.resample('33D').mean().interpolate())
Because the interpolate would not affect the result in the down-sampling case.
Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
})
df1
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
df2
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0
I have a dataframe like:
I would like to substract the values like:
minus
What I tried so far (the dataframe: http://pastebin.com/PydRHxcz):
index = pd.MultiIndex.from_tuples([key for key in dfdict], names = ['a','b','c','d'])
dfl = pd.DataFrame([dfdict[key] for key in dfdict],index=index)
dfl.columns = ['data']
dfl.sort(inplace=True)
d = dfl.unstack(['a','b'])
I can do:
d[0:5] - d[0:5]
And I get zeros for all values.
But If I do:
d[0:5] - d[5:]
I get Nans for all values. Any Ideas how I can perform such an operation ?
EDIT:
What works is
dfl.unstack(['a','b'])['data'][5:] - dfl.unstack(['a','b'])['data'][0:5].values
But it feels a bit clumsy
You can use loc to select all rows that correspond to one label in the first level like this:
In [8]: d.loc[0]
Out[8]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 11.098909 9.223784 8.003650 10.014445 13.231898 10.372040
0.3 14.349606 11.420565 9.053073 10.252542 26.342501 25.219403
0.5 1.336937 2.522929 3.875139 11.161803 3.168935 6.287555
0.7 0.379158 1.061104 2.053024 12.358577 0.678352 2.133887
1.0 0.210244 0.631631 1.457333 15.117805 0.292904 1.053916
So doing the subtraction looks like:
In [11]: d.loc[0] - d.loc[1000]
Out[11]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 -3.870946 -3.239915 -3.504068 -0.722377 -2.335147 -2.460035
0.3 -65.611418 -42.225811 -25.712668 -1.028758 -65.106473 -44.067692
0.5 -84.494748 -55.186368 -34.184425 -1.619957 -89.356417 -69.008567
0.7 -92.681688 -61.636548 -37.386604 -4.227343 -110.501219 -78.925078
1.0 -101.071683 -61.758741 -37.080222 -3.081782 -103.779698 -80.337487