I have 3 dataframes in Pandas:
1) user_interests:
With 'user' as an id, and 'interest' as an interest:
2) similarity_score:
With 'user' as a unique id matching ids in user_interests:
3) similarity_total:
With 'interest' being a list of all the unique interests in user_interets:
What I need to do:
Step 1: Look up interest from similarity_table to user_interests
Step 2: Take the corresponding user from user_interests and match it to the user in similarity_score
Step 3: Take the corresponding similarity_score from similarity_score and add it to the corresponding interest in similarity_total
The ultimate objective being to total the similarity scores of all users interested in the subjects in similarity_total. A diagram may help:
I know this can be done in Pandas in one line, however I am not there yet. If anyone can point me in the right direction, that would be amazing. Thanks!
IIUC, I think you need:
user_interest['similarity_score'] = user_interest['users'].map(similarity_score.set_index('user')['similarity_score'])
similarity_total = user_interest.groupby('interest', as_index=False)['similarity_score'].sum()
Output:
interest similarity_score
0 Big Data 1.000000
1 Cassandra 1.338062
2 HBase 0.338062
3 Hbase 1.000000
4 Java 1.154303
5 MongoDB 0.338062
6 NoSQL 0.338062
7 Postgres 0.338062
8 Python 0.154303
9 R 0.154303
10 Spark 1.000000
11 Storm 1.000000
12 decision tree 0.000000
13 libsvm 0.000000
14 machine learning 0.000000
15 numpy 0.000000
16 pandas 0.000000
17 probability 0.000000
18 regression 0.000000
19 scikit-learn 0.000000
20 scipy 0.000000
21 statistics 0.000000
22 statsmodels 0.000000
I'm not sure what code you have already written but have you tried something similar to this for the merging? It's not one line though.
# Merge user_interest with similarity_total dataframe
ui_st_df = user_interests.merge(similarity_total, on='interest',how='left').copy()
# Merge ui_st_df with similarity_score dataframe
ui_ss_df = ui_st_df.merge(similarity_score, on='user',how='left').copy()
Related
I found a behavior in pandas that I'm not able to explain to myself.
I am studying a database of audio features with N+2 columns: an ID, the time t, and N audio features related to time t. For various reasons, I would like to put in every row also the features of the next T time steps. (yes, the same data will be repeated up to T times). I have therefore written a function that creates additional feature columns containing data from the successive time steps. I have implemented it in the three ways, as you can see in the attached code, and one of them is not working, which is surprising to me since it works if the underlying data structures are numpy arrays. Can anybody explain me why?
def create_datapoints_for_dnn(df, T):
"""
Here we take the data frame with chroma features at time t and create all features at times t+1, t+2, ..., t+T-1.
:param df: initial data frame of chroma features
:param T: number of time steps to keep
:return: expanded data frame of chroma features
"""
res = df.copy()
original_labels = df.columns.values
n_steps = df.shape[0] # the number of time steps in this song
nans = pd.Series(np.full(n_steps, np.NaN)).values # a column of nans of the correct length
for n in range(1, T):
new_labels = [ol + '+' + str(n) for ol in original_labels[2:]]
for nl, ol in zip(new_labels, original_labels[2:]):
# df.assign would use the name "nl" instead of what nl contains, so we build and unpack a dictionary
res = res.assign(**{nl: nans}) # create a new column
# CORRECT BUT EXTREMELY SLOW
# for i in range(n_steps - (T - 1)):
# res.iloc[i, res.columns.get_loc(nl)] = df.iloc[n+i, df.columns.get_loc(ol)]
# CORRECT AND FAST
res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[:, df.columns.get_loc(ol)].shift(-n)
# NOT WORKING
# res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)]
return res[: - (T - 1)] # drop the last T-1 rows because time t+T-1 is not defined for them
Data example (put it in a csv):
songID,time,A_t,A#_t
CrossEra-0850,0.0,0.0,0.0
CrossEra-0850,0.1,0.0,0.0
CrossEra-0850,0.2,0.0,0.0
CrossEra-0850,0.3,0.31621,0.760299
CrossEra-0850,0.4,0.0,0.00107539
CrossEra-0850,0.5,0.0,0.142832
CrossEra-0850,0.6,0.8506459999999999,0.12481600000000001
CrossEra-0850,0.7,0.0,0.21206399999999997
CrossEra-0850,0.8,0.0796207,0.28227399999999997
CrossEra-0850,0.9,2.55144,0.169434
CrossEra-0850,1.0,3.4581699999999995,0.08014550000000001
CrossEra-0850,1.1,3.1061400000000003,0.030419599999999998
Code to run it
import pandas as pd
import numpy as np
T = 4 # how many successive steps we want to put in a single row
df = pd.read_csv('path_to_csv')
res = create_datapoints_for_dnn(df, T)
res.to_csv('path_to_output', index=False)
Results:
Use pd.DataFrame.shift and concat
f-string requires Python 3.6. Otherwise use '+{}'.format(i)'
cols = ['songID', 'time']
d = df.drop(['songID', 'time'], 1)
df[cols].join(
pd.concat(
[d.shift(-i).add_suffix(f'+{i}') for i in range(4)],
axis=1
)
)
songID time A_t+0 A#_t+0 A_t+1 A#_t+1 A_t+2 A#_t+2 A_t+3 A#_t+3
0 CrossEra-0850 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.316210 0.760299
1 CrossEra-0850 0.1 0.000000 0.000000 0.000000 0.000000 0.316210 0.760299 0.000000 0.001075
2 CrossEra-0850 0.2 0.000000 0.000000 0.316210 0.760299 0.000000 0.001075 0.000000 0.142832
3 CrossEra-0850 0.3 0.316210 0.760299 0.000000 0.001075 0.000000 0.142832 0.850646 0.124816
4 CrossEra-0850 0.4 0.000000 0.001075 0.000000 0.142832 0.850646 0.124816 0.000000 0.212064
5 CrossEra-0850 0.5 0.000000 0.142832 0.850646 0.124816 0.000000 0.212064 0.079621 0.282274
6 CrossEra-0850 0.6 0.850646 0.124816 0.000000 0.212064 0.079621 0.282274 2.551440 0.169434
7 CrossEra-0850 0.7 0.000000 0.212064 0.079621 0.282274 2.551440 0.169434 3.458170 0.080146
8 CrossEra-0850 0.8 0.079621 0.282274 2.551440 0.169434 3.458170 0.080146 3.106140 0.030420
9 CrossEra-0850 0.9 2.551440 0.169434 3.458170 0.080146 3.106140 0.030420 NaN NaN
10 CrossEra-0850 1.0 3.458170 0.080146 3.106140 0.030420 NaN NaN NaN NaN
11 CrossEra-0850 1.1 3.106140 0.030420 NaN NaN NaN NaN NaN NaN
Here is my Python question:
I am asked to generate an output table which contains the number of Nan in each variables (there are more than 10 variables in the data), min, max, mean, std, 25%, 50%,and 70%. I used the describe function in panda to created the describe table which gave me everything i want but the number of Nan in each variables. I am thinking about adding the number of Nan as a new row into the output generated from the describe output.
Anyone can help with this?
output = input_data.describe(include=[np.number]) # this gives the table output
count_nan = input_data.isnull().sum(axis=0) # this counts the number of Nan of each variable
How can I add the second as a row into the first table?
You could use .append to append a new row to a DataFrame:
In [21]: output.append(pd.Series(count_nan, name='nans'))
Out[21]:
0 1 2 3 4
count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 0.583707 0.578610 0.566523 0.480307 0.540259
std 0.142930 0.358793 0.309701 0.097326 0.277490
min 0.450488 0.123328 0.151346 0.381263 0.226411
25% 0.519591 0.406628 0.478343 0.406436 0.429003
50% 0.549012 0.610845 0.607350 0.478787 0.516508
75% 0.613127 0.782827 0.695530 0.552658 0.627764
max 0.786316 0.969421 0.900046 0.582391 0.901610
nans 0.000000 0.000000 0.000000 0.000000 0.000000
I have a pandas dataframe df
Red Green Yellow Purple
Basket1 1 2 0 10
Basket2 4 5 0 0
Basket3 9 10 11 12
I want to iterate through this dataframe and divide each element by the total in each column. Example the first element would be 1/14. I know many pieces of code but unable to put it together. For ietrating I use
for idx, row in df.iterrows:
and for the column mean I use df.sum(axis=0)
Please help me out with the intermediate code.
This ought to do the trick:
>>> df/df.sum()
Red Green Yellow Purple
Basket1 0.071429 0.117647 0.0 0.454545
Basket2 0.285714 0.294118 0.0 0.000000
Basket3 0.642857 0.588235 1.0 0.545455
As for your serial "iterating through a dataframe and doing an operation on every element" approach, just know that, while a for loop is sometimes the easiest and most-intuitive way to get the job done, pandas is built for vectorization (i.e. doing things really quickly). When you have lots of data, finding a way to use built-in pandas is often the best tool for the job.
I have 4188006 rows of data. I want to group my data by its column Code value. And set the Code value as the key, the corresponding data as the value int0 a dict`.
The _a_stock_basic_data is my data:
Code date_time open high low close \
0 000001.SZ 2007-03-01 19.000000 19.000000 18.100000 18.100000
1 000002.SZ 2007-03-01 14.770000 14.800000 13.860000 14.010000
2 000004.SZ 2007-03-01 6.000000 6.040000 5.810000 6.040000
3 000005.SZ 2007-03-01 4.200000 4.280000 4.000000 4.040000
4 000006.SZ 2007-03-01 13.050000 13.470000 12.910000 13.110000
... ... ... ... ... ... ...
88002 603989.SH 2015-06-30 44.950001 50.250000 41.520000 49.160000
88003 603993.SH 2015-06-30 10.930000 12.500000 10.540000 12.360000
88004 603997.SH 2015-06-30 21.400000 24.959999 20.549999 24.790001
88005 603998.SH 2015-06-30 65.110001 65.110001 65.110001 65.110001
amt volume
0 418404992 22927500
1 659624000 46246800
2 23085800 3853070
3 131162000 31942000
4 251946000 19093500
.... ....
88002 314528000 6933840
88003 532364992 46215300
88004 169784992 7503370
88005 0 0
[4188006 rows x 8 columns]
And my code is:
_a_stock_basic_data = pandas.concat(dfs)
_all_universe = set(all_universe.values.tolist())
for _code in _all_universe:
_temp_data = _a_stock_basic_data[_a_stock_basic_data['Code']==_code]
data[_code] = _temp_data[_temp_data.notnull()]
_all_universe contains _a_stock_basic_data['Code']. The length of _all_universe is about 2816, and the number of for loop is 2816, it costs a lot of time to complete the process.
So, I just wonder how to use high performance method to group these data. And I think multiprocessing is a choice, but I think share memory is its problem. And I think as the data is more and more large, performance of code need take into consideration, otherwise, it will costs a lot. Thank you for your help.
I'll show an example which I think will solve your problem. Below I make a dataframe with random elements, where the column Code will have duplicate values
a = pd.DataFrame({'a':np.arange(20), 'b':np.random.random(20), 'Code':np.random.random_integers(0, 10, 20)})
To group by the column Code, set it as index:
a.index = a['Code']
you can now use the index to access the data by the value of Code:
In : a.ix[8]
Out:
a b Code
Code
8 1 0.589938 8
8 3 0.030435 8
8 13 0.228775 8
8 14 0.329637 8
8 17 0.915402 8
Did you tried the pd.concat function? Here you can append arrays along an axis of your choice.
pd.concat([data,_temp_data],axis=1)
- dict(_a_stock_basic_data.groupby(['Code']).size())
## Number of occurences per code
- dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column
?
I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]
Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]