I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl).
I want to sum up the runs made by two people as pair and I have prepared my data.
When I am trying a GROUPBY on the dataframe columns (batsman and non_striker) it is making 2 combination of the same pair.
like (a,b) and (b,a) - rather I wish it should consider it as same.
As I can't drop rows any further.
import pandas as pd
df = pd.read_csv("C:\\Users\\Yash\\AppData\\Local\\Programs\\Python\\Python36-32\\Machine Learning\\IPL\\deliveries.csv")
df = df[(df["is_super_over"] != 1)]
df["pri_key"] = df["match_id"].astype(str) + "-" + df["inning"].astype(str)
openners = df[(df["over"] == 1) & (df["ball"] == 1)]
openners = openners[["pri_key", "batsman", "non_striker"]]
openners = openners.rename(columns = {"batsman":"batter1", "non_striker":"batter2"})
df = pd.merge(df, openners, on="pri_key")
df = df[["batsman", "non_striker", "batter1", "batter2", "batsman_runs"]]
df = df[((df["batsman"] == df["batter1"]) | (df["batsman"] == df["batter2"]))
& ((df["non_striker"] == df["batter1"]) | (df["non_striker"] == df["batter2"]))]
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(10)
Result:
batsman non_striker
DA Warner S Dhawan 1294
S Dhawan DA Warner 823
RV Uthappa G Gambhir 781
DR Smith BB McCullum 684
CH Gayle V Kohli 674
MEK Hussey M Vijay 666
M Vijay MEK Hussey 629
G Gambhir RV Uthappa 611
BB McCullum DR Smith 593
CH Gayle TM Dilshan 537
and, I want to keep 1 pair as unique
for those who don't understand cricket
I have a dataframe
batsman non_striker runs
a b 2
a b 3
b a 1
c d 6
d c 1
d c 4
b a 3
e f 1
f e 2
f e 6
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(30)
output:
batsman non_striker runs
a b 5
b a 4
c d 6
d c 5
e f 1
f e 8
expected output:
batsman non_striker runs
a b 9
c d 11
e f 9
what should I do? Please advise....
You can sort the batsman and non_striker and then group the data
df[['batsman', 'non_striker']] = df[['batsman', 'non_striker']].apply(sorted, axis=1)
df.groupby(['batsman', 'non_striker']).batsman_runs.sum().nlargest(10)
Edit: You can also use numpy for sorting the columns, which will be faster than using pandas sorted
df[['batsman', 'non_striker']] = np.sort(df[['batsman', 'non_striker']],1)
df.groupby(['batsman', 'non_striker'], sort = False).batsman_runs.sum().nlargest(10).sort_index()
Either way, you will get,
batsman non_striker
CH Gayle V Kohli 2650
DA Warner S Dhawan 2242
AB de Villiers V Kohli 2135
G Gambhir RV Uthappa 1795
M Vijay MEK Hussey 1302
BB McCullum DR Smith 1277
KA Pollard RG Sharma 1220
MEK Hussey SK Raina 1129
AT Rayudu RG Sharma 1121
AM Rahane SR Watson 1118
Craete a new DataFrame using np.sort. Then groupby and sum.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.sort(df[['batsman', 'non_striker']].values,1),
index=df.index,
columns=['player_1', 'player_2']).assign(runs = df.runs)
df1.groupby(['player_1', 'player_2']).runs.sum()
Output:
player_1 player_2
a b 9
c d 11
e f 9
Name: runs, dtype: int64
I hope I understand you right...
What you can do is something like put the smaller value always in column A and the greater value always in column B.
import pandas as pd
import numpy as np
# generate example
values = ['a', 'b' , 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame()
df['batsman'] = np.random.choice(values, size=10)
df['no_striker'] = np.random.choice(values, size=10)
# column evaluation
df['smaller'] = df['batsman'].where(df['batsman'] < df['no_striker'], df['no_striker'])
df['greater'] = df['batsman'].where(df['batsman'] > df['no_striker'], df['no_striker'])
Related
I'm a fairly new python/pandas user, and I have been tasked with the cleanup of a roughly 5,000 row csv of records, and the subsequent migration of the records into a sql database.
The contents are individual people's personal information(which prevents me from posting it for reference) and their 'seat' occupation information, but the file has been... mismanaged... over the years, and has ended up looking like this:
#Sect1 Sect2 Sect3 Seat#
L/L/L/L 320/320/319/321 D/C/D/C 1-2/1-2/1-2/1-2
V 602 - 1-6
T 101 F 1&3
R 158 - 3* 4
U 818 4 Ds9R
With that individual's personal information in four columns not shown to the left.
In reality, even just the top row from the selection above should actually be:
#Sect1 Sect2 Sect3 Seat#
L 320 D 1
L 320 D 2
L 320 C 1
L 320 C 2
L 319 D 1
L 319 D 2
L 321 C 1
L 321 C 2
with the '-'s implying that it's 'through' not 'and'. (For example; the second row in my original example would be Seat# 1 through Seat# 6, not Seat# 1 and 6.
I should also note that there's no unique ID/Index for these individuals, and it's purely based on First/Last name.
I've been attempting to break some of this up, and have limited success with
df1 = df1.drop('Sect2', axis=1).join(df1['Sect2'].str.split('/', expand=True).stack().reset_index(level=1, drop=True).rename('Sect2'))
but this eventually ends up creating erroneous records such as
#Sect1 Sect2 Sect3 Seat#
L 319 C 1
In the end, my question is; Is using a script to clean this data even possible? I'm rapidly running out of ideas, and really don't want to have to do this manually, but I also don't want to waste any more time trying to script this out if it's a pointless endeavor.
The code below should address the two issues described in your post. Should be selective enough to avoid misinterpreting rows, but some manual curation will likely still be necessary.
Basic concept is to iterate row by row, processing as much as possible before proceeding. First priority is to split rows containing character "/". If none is found, range value "-" is interpreted. The while loop permits gradual improvement. For example, the code would convert 1-3 into 1/2/3, then re-read the same row and split it into 3 different rows.
# build demo dataframe
d = {"Sect1": ["L/L/L/L", "V", "T", "R", "U"],
"Sect2": ["320/320/319/321", "602", "101", "158", "818"],
"Sect3": ["D/C/D/C", "-", "F", "-", "4"],
"Seat#": ["1-2/1-2/1-2/1-2", "1-6", "1&3", "3* 4", "Ds9R"]}
df = pd.DataFrame(data=d)
index = 0
while index < len(df):
len_df = len(df)
row_li = [df.iloc[index][x] for x in df.head()]
# extract separated values
sep_li = [x.split("/") for x in row_li]
sep_min, sep_max = len(min(sep_li, key=lambda x: len(x))), len(max(sep_li, key=lambda x: len(x)))
# extract range values
num_range_li = [re.findall("^\d+\-\d+$|$", x)[0].split("-") for x in row_li]
num_range_max = len(max(num_range_li, key=lambda x: len(x)))
# create temporary dictionary representing current row
r = {}
for i, head in enumerate(df.head()):
r[head] = row_li[i]
# separated values treatment -> split into distinct rows
if sep_min > 1 and sep_min == sep_max:
for i, head in enumerate(df.head()):
r[head] = sep_li[i]
row_df = pd.DataFrame(data=r)
df = df.append(row_df, ignore_index=True)
# range values treatment -> convert into separated values
elif num_range_max > 1:
for part in (1, 2):
for idx, header in enumerate(df.head()):
if len(num_range_li[idx]) > 1:
split_li = [str(x) for x in range(int(num_range_li[idx][0]), int(num_range_li[idx][1])+1)]
# convert range values to separated values
if part == 1:
r[header] = "/".join(split_li)
# multiply other values
else:
for i, head in enumerate(df.head()):
if i != idx:
r[head] = "/".join([r[head] for x in range(len(split_li))])
row_df = pd.DataFrame(data=r, index=[0])
df = df.append(row_df, ignore_index=True)
# if no new rows are added, increment
if len(df) == len_df:
index += 1
# if rows are added, drop current row
else:
df = df.iloc[:index].append(df.iloc[index+1:])
print(df)
Output
Sect1 Sect2 Sect3 Seat#
0 T 101 F 1&3
1 R 158 - 3* 4
2 U 818 4 Ds9R
4 V 602 - 1
5 V 602 - 2
6 V 602 - 3
7 V 602 - 4
8 V 602 - 5
9 V 602 - 6
10 L 320 D 1
11 L 320 D 2
12 L 320 C 1
13 L 320 C 2
14 L 319 D 1
15 L 319 D 2
16 L 321 C 1
17 L 321 C 2
I have a dataframe with rows that describe a movement of value between nodes in a system. This dataframe looks like this:
index from_node to_node value invoice_number
0 A E 10 a
1 B F 20 a
2 C G 40 c
3 D H 60 d
4 E I 35 c
5 X D 43 d
6 Y F 50 d
7 E H 70 a
8 B A 55 b
9 X B 33 a
I am looking to find "swaps" in the invoice history. A swap is defined where a node both receives and sends a value to a different node within the same invoice number. In the above dataset there are two swaps in invoice "a", and one swap in invoice "d" ("sent to" and "received from" could be the same node in the same row):
index node sent_to sent_value received_from received_value invoice_number
0 B F 20 X 33 a
1 E H 70 A 10 a
2 D H 60 X 43 d
I have solved this problem by iterating over all of the unique invoice numbers in the dataset and then iterating over each row within that invoice number to find pairs:
import pandas as pd
df = pd.DataFrame({
'from_node':['A','B','C','D','E','X','Y','E','B','X'],
'to_node':['E','F','G','H','I','D','F','H','A','B'],
'value':[10,20,40,60,35,43,50,70,55,33],
'invoice_number':['a','a','c','d','c','d','d','a','b','a'],
})
invoices = set(df.invoice_number)
list_df_swap = []
for invoice in invoices:
df_inv = df[df.invoice_number.isin([invoice])]
for r in df_inv.itertuples():
df_is_swap = df_inv[df_inv.to_node.isin([r.from_node])]
if len(df_is_swap.index) == 1:
swap = {'node': r.from_node,
'sent_to': r.to_node,
'sent_value': r.value,
'received_from': df_is_swap.iloc[0]['from_node'],
'received_value': df_is_swap.iloc[0]['value'],
'invoice_number': r.invoice_number
}
list_df_swap.append(pd.DataFrame(swap, index = [0]))
df_swap = pd.concat(list_df_swap, ignore_index = True)
The total dataset consists of several hundred million rows, and this approach is not very efficient. Is there a way to solve this problem using some kind of vectorised solution, or another method that would speed up the execution time?
Calculate all possible swaps, regradless of the invoice number:
swaps = df.merge(df, left_on='from_node', right_on='to_node')
Then select those that have the same invoice number:
columns = ['from_node_x', 'to_node_x', 'value_x', 'from_node_y', 'value_y',
'invoice_number_x']
swaps[swaps.invoice_number_x == swaps.invoice_number_y][columns]
# from_node_x to_node_x value_x from_node_y value_y invoice_number_x
#1 B F 20 X 33 a
#3 D H 60 X 43 d
#5 E H 70 A 10 a
Sorry, I should delete the old question, and create the new one.
I have a dataframe with two columns. The df looks as follows:
Word Tag
0 Asam O
1 instruksi O
2 - O
3 instruksi X
4 bahasa Y
5 Instruksi P
6 - O
7 instruksi O
8 sebuah Q
9 satuan K
10 - L
11 satuan O
12 meja W
13 Tiap Q
14 - O
15 tiap O
16 karakter P
17 - O
18 ke O
19 - O
20 karakter O
and I'd like to merge some rows which contain dash - to one row. so the output should be the following:
Word Tag
0 Asam O
1 instruksi-instruksi O
2 bahasa Y
3 Instruksi-instruksi P
4 sebuah Q
5 satuan-satuan K
6 meja W
7 Tiap-tiap Q
8 karakter-ke-karakter P
Any ideas? Thanks in advance. I have tried the answer from Jacob K, it works, then I found in my dataset, there are more than one - row in between. I have put the expected output, like index number 8
Solution from Jacob K:
# Import packages
import pandas as pd
import numpy as np
# Get 'Word' and 'Tag' columns as numpy arrays (for easy indexing)
words = df.Word.to_numpy()
tags = df.Tag.to_numpy()
# Create empty lists for new colums in output dataframe
newWords = []
newTags = []
# Use while (rather than for loop) since index i can change dynamically
i = 0 # To not cause any issues with i-1 index
while (i < words.shape[0] - 1):
if (words[i] == "-"):
# Concatenate the strings above and below the "-"
newWords.append(words[i-1] + "-" + words[i+1])
newTags.append(tags[i-1])
i += 2 # Don't repeat any concatenated values
else:
if (words[i+1] != "-"):
# If there is no "-" next, append the regular word and tag values
newWords.append(words[i])
newTags.append(tags[i])
i += 1 # Increment normally
# Create output dataframe output_df
d2 = {'Word': newWords, 'Tag': newTags}
output_df = pd.DataFrame(data=d2)
My approach with GroupBy.agg:
#df['Word'] = df['Word'].str.replace(' ', '') #if necessary
blocks = df['Word'].shift().ne('-').mul(df['Word'].ne('-')).cumsum()
new_df = df.groupby(blocks, as_index=False).agg({'Word' : ''.join, 'Tag' : 'first'})
print(new_df)
Output
Word Tag
0 Asam O
1 instruksi-instruksi O
2 bahasa Y
3 Instruksi-instruksi P
4 sebuah Q
5 satuan-satuan K
6 meja W
7 Tiap-tiap Q
8 karakter-ke-karakter P
Blocks (Detail)
print(blocks)
0 1
1 2
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 6
10 6
11 6
12 7
13 8
14 8
15 8
16 9
17 9
18 9
19 9
20 9
Name: Word, dtype: int64
This is a loop version:
import pandas as pd
# import data
DF = pd.read_csv("table.csv")
# creates a new DF
newDF = pd.DataFrame()
# iterate through rows
for i in range(len(DF)-1):
# prepare prev row index (?dealing with private instance of first row)
prev = i-1
if (prev < 0):
prev = 0
# copy column if the row is not '-' and the next row is not '-'
if (DF.loc[i+1, 'Word'] != '-'):
if (DF.loc[i, 'Word'] != '-' and DF.loc[prev, 'Word'] != '-'):
newDF = newDF.append(DF.loc[i, :])
# units the three rows if the middle one is '-'
else:
row = {'Tag': [DF.loc[i, 'Tag']], 'Word': [DF.loc[i, 'Word']+DF.loc[i+1, 'Word']+DF.loc[i+2, 'Word']]}
newDF = newDF.append(pd.DataFrame(row))
I am wondering the best way to slice a multi-index, using another index, where the other index is a subset of the main multi-index.
np.random.seed(1)
dict_data_russian = {'alpha':[1,2,3,4,5,6,7,8,9],'beta':['a','b','c','d','e','f','g','h','i'],'gamma':['r','s','t','u','v','w','x','y','z'],'value_r': np.random.rand(9)}
dict_data_doll = {'beta':['d','e','f'],'gamma':['u','v','w'],'dont_care': list('PQR')}
df_russian = pd.DataFrame(data=dict_data_russian)
df_russian.set_index(['alpha','beta','gamma'],inplace=True)
df_doll = pd.DataFrame(data=dict_data_doll)
df_doll.set_index(['beta','gamma'],inplace=True)
print df_russian
print df_doll.head()
Which yields:
value_r
alpha beta gamma
1 a r 0.4170
2 b s 0.7203
3 c t 0.0001
4 d u 0.3023
5 e v 0.1468
6 f w 0.0923
7 g x 0.1863
8 h y 0.3456
9 i z 0.3968
dont_care
beta gamma
d u P
e v Q
f w R
How best to use the index in df_doll to slice df_russian, on levels beta & gamma, in order to the following output?
value_r
alpha beta gamma
4 d u 0.3023
5 e v 0.1468
6 f w 0.0923
You can do
In [1131]: df_russian[df_russian.reset_index(0).index.isin(df_doll.index)]
Out[1131]:
alpha beta gamma value_r
4 d u 0.302333
5 e v 0.146756
6 f w 0.092339
This uses a boolean key derived by resetting the outer level of the main index and checking if the remaining levels are in the index of df_doll for each row.
You could strip off the index, join the frames, then add back the index
result = df_doll.reset_index().merge(df_russian.reset_index(), on=['beta', 'gamma'], how='inner')
result.set_index(['alpha', 'beta', 'gamma'], inplace=True)
result.drop('dont_care', 1)
I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )