I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)
If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14
Related
I have two dataframes:
one:
[A]
1
2
3
two:
[B]
7
6
9
How can I join two columns of different dataframes into another dataframe?
Like that:
[A][B]
1 7
2 6
3 9
I already tried that:
result = A
result = result.rename(columns={'employee_id': 'A'})
result['B'] = pd.Series(B['employee_id'])
and
B_column = B["employee_id"]
result = pd.concat([result,B_column], axis = 1)
result
but I still couldn't
import pandas as pd
df1 = pd.DataFrame(data = {"A" : range(1, 4)})
df2 = pd.DataFrame(data = {"B" : range(7, 10)})
df = df1.join(df2)
Gives
A
B
0
1
7
1
2
8
2
3
9
While there is various way to accomplish this, one way would be to just merge them on the index.
Something like this:
dfResult = dfA.merge(dfB, left_on=dfA.index, right_on=dfB.index, how='inner')
Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val
I am working with python in a pandas dataframe in where I have to do some calculations:
As you can see in those images, I have a lot of data with different id. What I need to do is to calculate per each id different operations, so what I am doing right now is this:
array_id_ad_hs = df['column_id'].unique()
for id in array_id_ad_hs:
df_history = df[df['column_id']==id]
df_history['new_column'] = 1000 - df_history['temporary_sum'].cumsum()
There is a better/faster way to do this operations?
You can use groupby
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': np.repeat(['A', 'B', 'C'], 5), 'x': np.random.normal(300, 100, 15)})
df['y'] = 1000 + df.groupby('id')['x'].cumsum()
print(df)
outputs
id x y
0 A 265.331439 1265.331439
1 A 392.658450 1657.989889
2 A 223.808512 1881.798401
3 A 209.223416 2091.021817
4 A 253.292921 2344.314738
5 B 425.387435 1425.387435
6 B 171.922392 1597.309827
7 B 198.998873 1796.308699
8 B 168.298701 1964.607401
9 B 347.075096 2311.682497
10 C 374.944209 1374.944209
11 C 310.802718 1685.746927
12 C 361.621695 2047.368623
13 C 250.134388 2297.503011
14 C 294.190045 2591.693056
I am using xlwings to replace my VB code with Python but since I am not an experienced programmer I was wondering - which data structure to use?
Data is in .xls in 2 columns and has the following form; In VB I lift this into a basic two dimensional array arrCampaignsAmounts(i, j):
Col 1: 'market_channel_campaign_product'; Col 2: '2334.43 $'
Then I concatenate words from 4 columns on another sheet into a similar 'string', into another 2-dim array arrStrings(i, j):
'Austria_Facebook_Winter_Active vacation'; 'rowNumber'
Finally, I search strings from 1. array within strings from 2. array; if found I write amounts into rowNumber from arrStrings(i, 2).
Would I use 4 lists for this task?
Two dictionaries?
Something else?
Definitely use pandas Dataframes. Here are references and very simple Dataframe examples.
#reference: http://pandas.pydata.org/pandas-docs/stable/10min.html
#reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
import numpy as np
import pandas as pd
def df_dupes(df_in):
'''
Returns [object,count] pairs for each unique item in the dataframe.
'''
# import pandas
if isinstance(df_in, list) or isinstance(df_in, tuple):
import pandas as pd
df_in = pd.DataFrame(df_in)
return df_in.groupby(df_in.columns.tolist(),as_index=False).size()
def df_filter_example(df):
'''
In [96]: df
Out[96]:
A B C D
0 1 4 9 1
1 4 5 0 2
2 5 5 1 0
3 1 3 9 6
'''
import pandas as pd
df=pd.DataFrame([[1,4,9,1],[4,5,0,2],[5,5,1,0],[1,3,9,6]],columns=['A','B','C','D'])
return df[(df.A == 1) & (df.D == 6)]
def df_compare(df1, df2, compare_col_list, join_type):
'''
df_compare compares 2 dataframes.
Returns left, right, inner or outer join
df1 is the first/left dataframe
df2 is the second/right dataframe
compare_col_list is a lsit of column names that must match between df1 and df2
join_type = 'inner', 'left', 'right' or 'outer'
'''
import pandas as pd
return pd.merge(df1, df2, how=join_type,
on=compare_col_list)
def df_compare_examples():
import numpy as np
import pandas as pd
df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9 '''
df2=pd.DataFrame([[4,5,6],[7,8,9],[10,11,12]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 4 5 6
1 7 8 9
2 10 11 12 '''
# One can see that df1 contains 1 row ([1,2,3]) not in df3 and
# df2 contains 1 rown([10,11,12]) not in df1.
# Assume c1 is not relevant to the comparison. So, we merge on cols 2 and 3.
df_merge = pd.merge(df1,df2,how='outer',on=['c2','c3'])
print(df_merge)
''' c1_x c2 c3 c1_y
0 1 2 3 NaN
1 4 5 6 4
2 7 8 9 7
3 NaN 11 12 10 '''
''' One can see that columns c2 and c3 are returned. We also received
columns c1_x and c1_y, where c1_X is the value of column c1
in the first dataframe and c1_y is the value of c1 in the second
dataframe. As such,
any row that contains c1_y = NaN is a row from df1 not in df2 &
any row that contains c1_x = NaN is a row from df2 not in df1. '''
df1_unique = pd.merge(df1,df2,how='left',on=['c2','c3'])
df1_unique = df1_unique[df1_unique['c1_y'].isnull()]
print(df1_unique)
df2_unique = pd.merge(df1,df2,how='right',on=['c2','c3'])
print(df2_unique)
df_common = pd.merge(df1,df2,how='inner',on=['c2','c3'])
print(df_common)
def delete_column_example():
print 'create df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
print 'drop (delete/remove) column'
col_name = 'b'
df.drop(col_name, axis=1, inplace=True) # or df = df.drop('col_name, 1)
def delete_rows_example():
print '\n\ncreate df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['col_1','col_2','col_3'])
print(df)
print '\n\nappend rows'
df= df.append(pd.DataFrame([[11,22,33]], columns=['col_1','col_2','col_3']))
print(df)
print '\n\ndelete rows where (based on) column value'
df = df[df.col_1 == 4]
print(df)
I am trying to use a loop function to create a matrix of whether a product was seen in a particular week.
Each row in the df (representing a product) has a close_date (the date the product closed) and a week_diff (the number of weeks the product was listed).
import pandas
mydata = [{'subid' : 'A', 'Close_date_wk': 25, 'week_diff':3},
{'subid' : 'B', 'Close_date_wk': 26, 'week_diff':2},
{'subid' : 'C', 'Close_date_wk': 27, 'week_diff':2},]
df = pandas.DataFrame(mydata)
My goal is to see how many alternative products were listed for each product in each date_range
I have set up the following loop:
for index, row in df.iterrows():
i = 0
max_range = row['Close_date_wk']
min_range = int(row['Close_date_wk'] - row['week_diff'])
for i in range(min_range,max_range):
col_head = 'job_week_' + str(i)
row[col_head] = 1
Can you please help explain why the "row[col_head] = 1" line is neither adding a column, nor adding a value to that column for that row.
For example, if:
row A has date range 1,2,3
row B has date range 2,3
row C has date range 3,4,5'
then ideally I would like to end up with
row A has 0 alternative products in week 1
1 alternative products in week 2
2 alternative products in week 3
row B has 1 alternative products in week 2
2 alternative products in week 3
&c..
You can't mutate the df using row here to add a new column, you'd either refer to the original df or use .loc, .iloc, or .ix, example:
In [29]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[29]:
a b c
0 -1.525011 0.778190 -1.010391
1 0.619824 0.790439 -0.692568
2 1.272323 1.620728 0.192169
3 0.193523 0.070921 1.067544
4 0.057110 -1.007442 1.706704
In [30]:
for index,row in df.iterrows():
df.loc[index,'d'] = np.random.randint(0, 10)
df
Out[30]:
a b c d
0 -1.525011 0.778190 -1.010391 9
1 0.619824 0.790439 -0.692568 9
2 1.272323 1.620728 0.192169 1
3 0.193523 0.070921 1.067544 0
4 0.057110 -1.007442 1.706704 9
You can modify existing rows:
In [31]:
# reset the df by slicing
df = df[list('abc')]
for index,row in df.iterrows():
row['b'] = np.random.randint(0, 10)
df
Out[31]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
But adding a new column using row won't work:
In [35]:
df = df[list('abc')]
for index,row in df.iterrows():
row['d'] = np.random.randint(0,10)
df
Out[35]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
row[col_head] = 1 ..
Please try the below line:
df.at[index,col_head]=1