if I have the following, how do I make pd.DataFrame() turn this array into a dataframe with two columns. What's the most efficient way? My current approach involves creating copies out of each into a series and making dataframes out of them.
From this:
([[u'294 (24%) L', u'294 (26%) R'],
[u'981 (71%) L', u'981 (82%) R'],])
to
x y
294 294
981 981
rather than
x
[u'294 (24%) L', u'294 (26%) R']
my current approach. Looking for something more efficient
numL = pd.Series(numlist).map(lambda x: x[0])
numR = pd.Series(numlist).map(lambda x: x[1])
nL = pd.DataFrame(numL, columns=['left_num'])
nR = pd.DataFrame(numR, columns=['right_num'])
nLR = nL.join(nR)
nLR
UPDATE**
I noticed that my error simply comes down to when you pd.DataFrame() a list versus a series. WHen you create a dataframe out of a list, it merges the items into the same column. Not so with a list. That solved my problem in the most efficient way.
data = [[u'294 (24%) L', u'294 (26%) R'], [u'981 (71%) L', u'981 (82%) R'],]
clean_data = [[int(item.split()[0]) for item in row] for row in data]
# clean_data: [[294, 294], [981, 981]]
pd.DataFrame(clean_data, columns=list('xy'))
# x y
# 0 294 294
# 1 981 981
#
# [2 rows x 2 columns]
Related
I have been playing around with a dataset about football, and need to group my ['position'] by values, and assign them to a new variable.
First, here is my dataframe
df = player_stats[['id','player','date','team_name','fixture_name','position','shots', 'shots_on_target', 'xg',
'xa', 'attacking_pen_area_touches', 'penalty_area_entry_passes',
'carries_total_distance', 'total_distance_passed', 'aerial_sucess_perc',
'passes_attempted', 'passes_completed', 'short_pass_accuracy_perc', 'medium_pass_accuracy_perc',
'long_pass_accuracy_perc', 'final_third_entry_passes', 'carries_total_distance', 'ball_recoveries',
'total_distance_passed', 'dribbles_completed', 'dribbles_attempted', 'touches',
'tackles_won', 'tackles_attempted']]
I have split my ['position'] as it had multiple string-values, and added them to a column called ['position_new].
position_new
AM 277
CB 938
CM 534
DF 7
DM 604
FW 766
GK 389
LB 296
LM 149
LW 284
MF 5
RB 300
RM 160
RW 323
WB 275
What I need, is basically to have 3 different variables who have all the same columns, but are separated by the value in the position_new. Look at the below scheme:
So: my variable: Att, need to have all the columns of df, but only with values in position_new that are equal too: FW, LF, RW.
I know how to hardcode it, but cannot get my head around, how to transform it into a for loop.
Here is my loop..
for col in df[29:30]:
if df.loc[df['position_new'] == 'FW', 'LW', 'RW']:
att = df
elif df.loc[df['position_new'] == 'AM', 'CM', 'DM', 'LM', 'RM']:
mid = df
else:
defender = df
Thank you!
I'm not sure what you are trying to do but it looks like you want to get all positions that are of type attackers, midfielders, and defenders based on their two-letter abbreviation into separate variables.
What you are doing is not optimal because it won't work on any generic data frame with this type of info.
But, if you want to do it for just this case, you are simply missing the correct comparison operator in your for loop. Try:
if df.loc[df['position_new'].isin(['FW', 'LW', 'RW'])]:
I created 2 DataFrames with [6,2] and [3,2]. I want to multiple 2 DataFrames to get [6,3] matrix. I am using the loop below but it is giving me a return self._getitem_column(key) error.Below is an example.
df1= pd.DataFrame{[1,2,3,4,5,6],
[23,24,25,26,27,28]}
df2= pd.DataFrame{[1,2,3],
[12,13,14]}
for j in range(len(df2)):
for i in range(len(df1)):
df3 = (df1[i, 2] * df2[j,2])
#expected result
df3= {0 1 2 3
1 276 299 322
2 288 312 336
3 300 325 350
4 312 338 364
5 324 351 378
6 336 364 392}
I am trying to replicate what I did in an excel sheet
It might be easier to leave it out of dataframes altogether, unless you have the information in dataframes currently (in which case, write back and I'll show you how to do that).
For now, this might be easier:
list1 = list(range(23,29)) # note that you have to go one higher to include 28
list2 = list(range(12,15)) # same deal
outputlist = []
for i in list1:
for j in list2:
outputlist.append(i * j)
import numpy as np
outputlist = np.array(outputlist).reshape(len(df1),len(df2))
import pandas as pd
df3 = pd.DataFrame(outputlist)
EDIT: Ok, this might get you where you need to go, then:
list3 = []
for j in range(len(df2)):
for i in range(len(df1)):
list3.append(df1.loc[i+1,0] * df2.loc[j+1,0])
import numpy as np
list3 = np.array(outputlist).reshape(len(df1),len(df2))
df3 = pd.DataFrame(list3)
EDIT AGAIN: Try this! Just make sure you replace "thenameofthecolumnindf1" with the actual name of the column in df1 that you're interested in, etc.
import numpy as np
list3 = []
for i in df1[thenameofthecolumnindf1]:
for j in df2[thenameofthecolumnindf2]:
list3.append(i * j)
list3 = np.array(list3).reshape(len(df1),len(df2))
df3 = pd.DataFrame(list3)
The math for this simply won't work. To do matrix multiplication, Number of columns in the first matrix (6) should be equal to the number of rows in the second matrix (2). You're likely getting a key indexing error because of the mismatched row/column value.
You'll have to account for 3 different dimensions in order to properly multiply them, not just 2 as is done above.
I have a pandas dataframe with a format exactly like the one in this question and I'm trying to achieve the same result. In my case, I am calculating the fuzz-ratio between the row's index and it's corresponding col.
If I try this code (based on the answer to the linked question)
def get_similarities(x):
return x.index + x.name
test_df = test_df.apply(get_similarities)
the concatenation of the row index and col name happens cell-wise, just as intended. Running type(test_df) returns pandas.core.frame.DataFrame, as expected.
However, if I adapt the code to my scenario like so
def get_similarities(x):
return fuzz.partial_ratio(x.index, x.name)
test_df = test_df.apply(get_similarities)
it doesn't work. Instead of a dataframe, I get back a series (the return type of that function is an int)
I don't understand why the two samples would not behave the same nor how to fix my code so it returns a dataframe, with the fuzzy.ratio for each cell between the a row's index for that cell and the col name for that cell.
what about the following approach?
assuming that we have two sets of strings:
In [245]: set1
Out[245]: ['car', 'bike', 'sidewalk', 'eatery']
In [246]: set2
Out[246]: ['walking', 'caring', 'biking', 'eating']
Solution:
In [247]: from itertools import product
In [248]: res = np.array([fuzz.partial_ratio(*tup) for tup in product(set1, set2)])
In [249]: res = pd.DataFrame(res.reshape(len(set1), -1), index=set1, columns=set2)
In [250]: res
Out[250]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
There is a way to accomplish this via DataFrame.apply with some row manipulations.
Assuming the 'test_df` is as follows:
In [73]: test_df
Out[73]:
walking caring biking eating
car carwalking carcaring carbiking careating
bike bikewalking bikecaring bikebiking bikeeating
sidewalk sidewalkwalking sidewalkcaring sidewalkbiking sidewalkeating
eatery eaterywalking eaterycaring eaterybiking eateryeating
In [74]: def get_ratio(row):
...: return row.index.to_series().apply(lambda x: fuzz.partial_ratio(x,
...: row.name))
...:
In [75]: test_df.apply(get_ratio)
Out[75]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
It took some digging, but I figured it out. The problem comes from the fact that DataFrame.apply is either applied column-wise or row-wise, not cell by cell. So your get_similarities function is actually getting access to an entire row or column of data at a time! By default it gets the entire column -- so to solve your problem, you just have to make a get_similarities function that returns a list where you manually call fuzz.partial_ratio on each element, like this:
import pandas as pd
from fuzzywuzzy import fuzz
def get_similarities(x):
l = []
for rname in x.index:
print "Getting ratio for %s and %s" % (rname, x.name)
score = fuzz.partial_ratio(rname,x.name)
print "Score %s" % score
l.append(score)
print len(l)
print
return l
a = pd.DataFrame([[1,2],[3,4]],index=['apple','banana'], columns=['aple','banada'])
c = a.apply(get_similarities,axis=0)
print c
print type(c)
I left my print statements in their so you can see what the DataFrame.apply call is doing for yourself -- that's when it clicked for me.
I have a dataframe that looks like this:
dic = {'A':['PINCO','PALLO','CAPPO','ALLOP'],
'B':['KILO','KULO','FIGA','GAGO'],
'C':[['CAL','GOL','TOA','PIA','STO'],
['LOL','DAL','ERS','BUS','TIS'],
['PIS','IPS','ZSP','YAS','TUS'],
[]]}
df1 = pd.DataFrame(dic)
My goal is to insert for each row the element of A as first item of the list contained in column C. At the same time I want to set the element of B as last item of the list contained in C.
I was able to achieve my goal by using the following lines of code:
for index, row in df1.iterrows():
try:
row['C'].insert(0,row['A'])
row['C'].append(row['B'])
except:
pass
Is there a more elegant and efficient way to achieve my goal maybe using some Pandas function? I would like to avoid for loops possibly.
Inspired by Ted's solution but without modifying columns A and B:
def tolist(value):
return [value]
df1.C = df1.A.map(tolist) + df1.C + df1.B.map(tolist)
Using apply, you would not write an explicit loop:
def modify(row):
row['C'][:] = [row['A']] + row['C'] + [row['B']]
df1.apply(modify, axis=1)
A good general rule is to avoid using apply with axis=1 if at all possible as iterating over the rows is expenisve
You can convert each element in columns A and B to a list with map and then sum across the rows.
df1['A'] = df1.A.map(lambda x: [x])
df1['B'] = df1.B.map(lambda x: [x])
df1.sum(1)
CPU times: user 3.07 s, sys: 207 ms, total: 3.27 s
The alternative is to use apply with axis=1 which ran 15 times slower on my computer on 1 million rows
df1.apply(lambda x: [x['A']] + x['C'] + [x['B']], 1)
CPU times: user 48.5 s, sys: 119 ms, total: 48.6 s
Use a list comprehension with df1.values.tolist()
pd.Series([[r[0]] + r[2] + [r[1]] for r in df1.values.tolist()], df1.index)
0 [PINCO, CAL, GOL, TOA, PIA, STO, KILO]
1 [PALLO, LOL, DAL, ERS, BUS, TIS, KULO]
2 [CAPPO, PIS, IPS, ZSP, YAS, TUS, FIGA]
3 [ALLOP, GAGO]
dtype: object
time testing
I have a dataset with weights for each observation and I want to prepare weighted summaries using groupby but am rusty as to how to best do this. I think it implies a custom aggregation function. My issue is how to properly deal with not item-wise data, but group-wise data. Perhaps it means that it is best to do this in steps rather than in one go.
In pseudo-code, I am looking for
#first, calculate weighted value
for each row:
weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
sum(weighted jobs)/sum(weight)
I am not sure how to work the "for each city"-part into a custom aggregate function and get access to group-level summaries.
Mock data:
import pandas as pd
import numpy as np
np.random.seed(43)
## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})
Simply multiply the two columns:
In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']
Now you can groupby the city (and take the sum):
In [12]: df_city_sums = df_city.groupby('city').sum()
In [13]: df_city_sums
Out[13]:
jobs weight weighted_jobs
city
oakland 362 690 7958
san mateo 367 1017 9026
sf 253 638 6209
[3 rows x 3 columns]
Now you can divide the two sums, to get the desired result:
In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs']
Out[14]:
city
oakland 21.983425
san mateo 24.594005
sf 24.541502
dtype: float64