Comparing/Mapping different series in different Dataframes - python

I have two data frames. Dataframe "A" which is the main dataframe has 3 columns "Number", "donation" and "Var1" . Dataframe B has 2 columns "Number" and "location". The "Number" column in DataFrame B is a subset of "Number" in A. What I would like to do is form a new column in DataFrame A - "NEW" which would map the values of numbers in both the column and if its present in DataFrame B would add value as 1 else all other values will be 0.
>>>DFA
Number donation Var1
243 4 45
677 56 34
909 34 22
565 78 24
568 90 21
784 33 88
787 22 66
>>>DFB
Number location
909 PB
565 WB
784 AU
These are the two dataframes, I want the DFA with a new column which looks something like this.
>>>DFA
Number donation Var1 NEW
243 4 45 0
677 56 34 0
909 34 22 1
565 78 24 1
568 90 21 0
784 33 88 1
787 22 66 0
This has a new column which has value as 1 if the Number was present in DFB if absent it gives 0.

You could use the isin method:
DFA['NEW'] = (DFA['Number'].isin(DFB['Number'])).astype(int)
For example,
import pandas as pd
DFA = pd.DataFrame({'Number': [243, 677, 909, 565, 568, 784, 787],
'Var1': [45, 34, 22, 24, 21, 88, 66],
'donation': [4, 56, 34, 78, 90, 33, 22]})
DFB = pd.DataFrame({'Number': [909, 565, 784], 'location': ['PB', 'WB', 'AU']})
DFA['NEW'] = (DFA['Number'].isin(DFB['Number'])).astype(int)
print(DFA)
yields
Number Var1 donation NEW
0 243 45 4 0
1 677 34 56 0
2 909 22 34 1
3 565 24 78 1
4 568 21 90 0
5 784 88 33 1
6 787 66 22 0

Related

How to group column which respects certain conditions

I try since this afternoon to group column which respects certain conditions. I giva an easy example, I got 3 column like it :
ID1_column_A ID2_column_B ID2_column_C
234 100 10
334 130 11
34 250 40
34 200 25
My aim is to group column who start per the same ID, so here, I will have only 2 column in output :
ID1_column_A Fusion_B_C
234 110
334 141
34 290
34 225
thanks for reading me
IIUC, you can try this:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
ID1 ID2
0 234 110
1 334 141
2 34 290
3 34 225
import pandas as pd
data = pd.DataFrame({'ID1_column_A': [234, 334, 34, 34], 'ID2_column_B': [100, 130, 250, 200], 'ID2_column_C': [10, 11, 40, 25]})
data['Fusion_B_C'] = data.loc[:, 'ID2_column_B':'ID2_column_C'].sum(axis=1)
data.drop(columns=['ID2_column_B', 'ID2_column_C'], inplace=True)#Deleting unnecessary columns
print(data)
Output
ID1_column_A Fusion_B_C
0 234 110
1 334 141
2 34 290
3 34 225

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

I have two data frames.
df1 includes 4 men and 4 women with their weight and height (inches).
#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75
df2 includes 4 men and 4 women with their weight and height (inches).
#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73
What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.
So in this case John's closest match is Abe so in a new column
df1['doppelganger'] = "Abe".
I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.
First is necessary distinguish men and women, here is used new column with repeat 4 times m and f. Then is used DataFrame.merge with outer join by new column for all combinations and created new columns for differences, last column is sum of them. then sorting by 3 columns by DataFrame.sort_values, so first row per groups by A and g are filtered by DataFrame.drop_duplicates:
df = (df1.assign(g = ['m']*4 + ['f']*4)
.merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
.assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
dif2 = lambda x: x['C'].sub(x['C_']).abs(),
sumdiff = lambda x: x['dif1'] + x['dif2'])
.sort_values(['A', 'g','sumdiff'])
.drop_duplicates(['A','g'])
.sort_index()
.rename(columns={'A_':'doppelganger'})
)
print (df)
A B C g doppelganger B_ C_ dif1 dif2 sumdiff
1 John 236 76 m Abe 236 75 0 1 1
7 Jack 204 74 m Adam 195 71 9 3 12
10 Jim 156 71 m Alex 178 72 22 1 23
14 Jared 182 72 m Alex 178 72 4 0 4
16 Suzy 119 60 f Mary 148 66 29 6 35
20 Sally 149 66 f Mary 148 66 1 0 1
25 Sharon 169 65 f Maylee 155 66 14 1 15
31 Sammy 182 75 f Madison 160 73 22 2 24
Input DataFrames:
print (df1)
A B C
0 John 236 76
1 Jack 204 74
2 Jim 156 71
3 Jared 182 72
4 Suzy 119 60
5 Sally 149 66
6 Sharon 169 65
7 Sammy 182 75
print (df2)
A B C
0 Aaron 285 77
1 Abe 236 75
2 Alex 178 72
3 Adam 195 71
4 Mary 148 66
5 Maylee 155 66
6 Marilyn 199 65
7 Madison 160 73

VLookup (then replace) in Pandas with Dictionary?

I want to replaces values in Pandas dataframe using dictionary
DataFrame = games-u-q-s.csv:
blue1 blue2 blue3 blue4 blue5 red1 red2 red3 red4 red5 winner
8 432 96 11 112 104 498 122 238 412 0
119 39 76 10 35 54 25 120 157 92 0
57 63 29 61 36 90 19 412 92 22 0
column 1 - 10 contain champId with winner column as label
Dictionary = champNum.csv
champId champNum
266 1
103 2
84 3
12 4
32 5
34 6
1 7
. .
. .
143 138
and save it as dataset_feature_champion_number.csv
I want to convert champId into champNum and expected output like this:
blue1 blue2 blue3 blue4 blue5 red1 red2 red3 red4 red5 winner
125 11 59 70 124 36 129 20 135 111 0
23 40 77 53 95 67 73 37 132 91 0
69 13 116 81 22 68 127 111 91 8 0
This is the code:
import csv
import os
import numpy as np
import pandas as pd
def createDictionary(csvfile):
with open(csvfile, mode='r') as data:
reader = csv.reader(data)
dict = {int(rows[0]):int(rows[1]) for rows in reader}
return dict
def convertDataframeToChampNum(csvfile,dictionary):
df = pd.read_csv(csvfile)
temp1 = df.iloc[:,1:11]
temp2 = df['winner']
temp3 = temp1.applymap(dictionary.get)
champNum = temp3.join(temp2)
return champNum
def saveAsCSV(dataframe):
dataframe.to_csv("dataset_feature_champion_number.csv")
def main():
diction = createDictionary("champNum.csv")
dataset = convertDataframeToChampNum("games-u-q-s.csv",diction)
saveAsCSV(dataset)
if __name__ =='__main__':
main()
And I got so many errors:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-f86679fc49f9> in <module>()
27
28 if __name__ =='__main__':
---> 29 main()
<ipython-input-19-f86679fc49f9> in main()
22
23 def main():
---> 24 diction = createDictionary("champNum.csv")
25 dataset = convertDataframeToChampNum("games-u-q-s.csv",diction)
26 saveAsCSV(dataset)
<ipython-input-19-f86679fc49f9> in createDictionary(csvfile)
7 with open(csvfile, mode='r') as data:
8 reader = csv.reader(data)
----> 9 dict = {int(rows[0]):int(rows[1]) for rows in reader}
10 return dict
11
<ipython-input-19-f86679fc49f9> in <dictcomp>(.0)
7 with open(csvfile, mode='r') as data:
8 reader = csv.reader(data)
----> 9 dict = {int(rows[0]):int(rows[1]) for rows in reader}
10 return dict
11
ValueError: invalid literal for int() with base 10: 'champNum'
I think you're looking for pandas.DataFrame.transform:
>>> a = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])
>>> a
0 1 2 3 4
0 1 2 3 4 5
1 6 7 8 9 10
>>> a.transform(lambda x: -x)
0 1 2 3 4
0 -1 -2 -3 -4 -5
1 -6 -7 -8 -9 -10
or, applied to your problem
df = pd.DataFrame({'blue1': [8, 119, 57],
'blue2': [432, 39, 63],
'blue3': [96, 76, 29],
'blue4': [11, 10, 61],
'blue5': [112, 35, 36],
'red1': [104, 54, 90],
'red2': [498, 25, 19],
'red3': [122, 120, 412],
'red4': [238, 157, 92],
'red5': [412, 92, 22],
'winner': [0, 0, 0]})
transform_dict = {266: 1, 103: 2, ...}
df.transform(lambda x: transform_dict[x] if x in transform_dict else None)

Set columns of DataFrame to sum of columns of another in pandas

I have a DataFrame that looks like the below, call this "values":
I would like to create another, call it "sums" that contains the sum of the DataFrame "values" from the column in "sums" to the end. It would look like the below:
I would like to create this without looking through the entire DataFrame, data point by data point. I have been trying with .apply() as seen below, but I keep getting the error: unsupported operand type(s) for +: 'int' and 'datetime.date'
In [26]: values = pandas.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
In [28]: sums = values.copy()
In [29]: sums.iloc[:,:] = ''
In [31]: for column in sums:
...: sums[column].apply(sum(values.loc[:,column:]))
...:
Traceback (most recent call last):
File "<ipython-input-31-030442e5005e>", line 2, in <module>
sums[column].apply(sum(values.loc[:,column:]))
File "C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\pandas\core\series.py", line 2220, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\src\inference.pyx", line 1088, in pandas.lib.map_infer (pandas\lib.c:63043)
TypeError: 'numpy.int64' object is not callable
In [32]: for column in sums:
...: sums[column] = sum(values.loc[:,column:])
In [33]: sums
Out[33]:
0 1 2 3 4 5 6 7 8
0 36 36 35 33 30 26 21 15 8
1 36 36 35 33 30 26 21 15 8
2 36 36 35 33 30 26 21 15 8
3 36 36 35 33 30 26 21 15 8
Is there a way to do this without looping each point individually?
Without looping, you can reverse your dataframe, cumsum per line and then re-reverse it:
>>> values.iloc[:,::-1].cumsum(axis=1).iloc[:,::-1]
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0
You can use the .cumsum() method to get the cumulative sum. The problem is that is operates from left to right, where you need it from right to left.
So we will reverse you data frame, use cumsum(), then set the axes back into the proper order.
import pandas as pd
values = pd.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
values[values.columns[::-1]].cumsum(axis=1).reindex_axis(values.columns, axis=1)
# returns:
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0

value counts of a Database chunk by chunk using pandas

I have a big DataFrame df and I want to count each values. I can't do:
df = pandas.read_csv('my_big_data.csv')
values_df = df.apply(value_counts)
because It is a very big Database.
I think it must be possible to do it chunk by chunk with chunksize, but I can't see how.
In [9]: pd.set_option('max_rows',10)
Construct a sample frame
In [10]: df = DataFrame(np.random.randint(0,100,size=100000).reshape(-1,1))
In [11]: df
Out[11]:
0
0 50
1 35
2 20
3 66
4 8
... ..
99995 51
99996 33
99997 43
99998 41
99999 56
[100000 rows x 1 columns]
In [12]: df.to_csv('test.csv')
Chunk read it and construct the .value_counts for each chunks
Concacatenate all of these results (so you have a frame that is indexed by the value being counts and the values are the counts).
In [13]: result = pd.concat([ chunk.apply(Series.value_counts) for chunk in pd.read_csv('test.csv',index_col=0,chunksize=10000) ] )
In [14]: result
Out[14]:
0
18 121
75 116
39 116
55 115
60 114
.. ...
88 83
8 83
56 82
76 76
18 73
[1000 rows x 1 columns]
Then groupby the index which puts all of the duplicates (indexes) in a groups. Summing give the sum of the individual value_counts.
In [15]: result.groupby(result.index).sum()
Out[15]:
0
0 1017
1 1015
2 992
3 1051
4 973
.. ...
95 1014
96 949
97 1011
98 999
99 981
[100 rows x 1 columns]

Categories