Am I using groupby.sum() correctly? - python

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.

If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

Related

How can I fetch only rows from left dataset but also a certain column from the right dataset, during inner join

I am trying to implement a logic where i have two dataframes. Say A (left) and B (right).
I need to find matching rows of A in B (i understand this can be done via a "inner" join). But my use case says i only need all rows from dataframe A but also a column from the matched record in B, the ID column (i understand this can be one via select). But here the problem arises, as after inner join the returned dataframe has rows from dataframe B also, which i dont need, but if i take leftsemi, then i wont be able to fetch the ID column from B dataframe.
For example:
def fetch_match_condition():
return ["EMAIL","PHONE1"]
match_condition = fetch_match_condition()
from pyspark.sql.functions import sha2, concat_ws
schema_col = ["FNAME","LNAME","EMAIL","PHONE1","ADD1"]
schema_col_master = schema_col.copy()
schema_col_master.append("ID")
data_member = [["ALAN","AARDVARK","lbrennan.kei#malchikzer.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","8653827956","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
master_df = spark.createDataFrame(data_member,schema_col)
master_df = master_df.withColumn("ID", sha2(concat_ws("||", *master_df.columns), 256))
test_member = [["ALAN","AARDVARK","lbren.kei#malchik.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","99997492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","87463829","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
test_member_1 = [["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"]]
test_df = spark.createDataFrame(test_member,schema_col)
test_df_1 = spark.createDataFrame(test_member_1,schema_col)
matched_df = test_df.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
master_df = master_df.union(matched_df.select(schema_col_master))
# Here only second last record will match and get added back to the master along with the same ID as it was in master_df
matched_df = test_df_1.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
# Here the problem arises since i need the match to be only 2, since the test_df_1 has only two records but the match will be 3 since master_df(david record) will also be selected.
PS : (Is there a way i can take leftsemi and then use withColumn and a UDF which fetches me this ID based on the row values of leftsemi dataframe from B dataframe ?)
Can anyone propose a solution for this ?

Multiply multi-dimensional matrix to get new dataframe with new column names

I created 2 DataFrames with [6,2] and [3,2]. I want to multiple 2 DataFrames to get [6,3] matrix. I am using the loop below but it is giving me a return self._getitem_column(key) error.Below is an example.
df1= pd.DataFrame{[1,2,3,4,5,6],
[23,24,25,26,27,28]}
df2= pd.DataFrame{[1,2,3],
[12,13,14]}
for j in range(len(df2)):
for i in range(len(df1)):
df3 = (df1[i, 2] * df2[j,2])
#expected result
df3= {0 1 2 3
1 276 299 322
2 288 312 336
3 300 325 350
4 312 338 364
5 324 351 378
6 336 364 392}
I am trying to replicate what I did in an excel sheet
It might be easier to leave it out of dataframes altogether, unless you have the information in dataframes currently (in which case, write back and I'll show you how to do that).
For now, this might be easier:
list1 = list(range(23,29)) # note that you have to go one higher to include 28
list2 = list(range(12,15)) # same deal
outputlist = []
for i in list1:
for j in list2:
outputlist.append(i * j)
import numpy as np
outputlist = np.array(outputlist).reshape(len(df1),len(df2))
import pandas as pd
df3 = pd.DataFrame(outputlist)
EDIT: Ok, this might get you where you need to go, then:
list3 = []
for j in range(len(df2)):
for i in range(len(df1)):
list3.append(df1.loc[i+1,0] * df2.loc[j+1,0])
import numpy as np
list3 = np.array(outputlist).reshape(len(df1),len(df2))
df3 = pd.DataFrame(list3)
EDIT AGAIN: Try this! Just make sure you replace "thenameofthecolumnindf1" with the actual name of the column in df1 that you're interested in, etc.
import numpy as np
list3 = []
for i in df1[thenameofthecolumnindf1]:
for j in df2[thenameofthecolumnindf2]:
list3.append(i * j)
list3 = np.array(list3).reshape(len(df1),len(df2))
df3 = pd.DataFrame(list3)
The math for this simply won't work. To do matrix multiplication, Number of columns in the first matrix (6) should be equal to the number of rows in the second matrix (2). You're likely getting a key indexing error because of the mismatched row/column value.
You'll have to account for 3 different dimensions in order to properly multiply them, not just 2 as is done above.

update pandas dataframe column not work on first time

I have a dataframe concated from some other dataframes, then I need to update some values in one column, and found that I have to do the same update twice. To find out what happened, I save the dataframe to disk and reload it, then do the update, now it works on the first time.
Is it a bug of pandas or I made some wrong?
I am using pandas 0.22.0 from conda 4.5.0
import pandas as pd
sum_trade = pd.read_csv('somefile.csv')
df = pd.concat(
[
sum_trade.loc[sum_trade.mon == 201806 ].groupby(['trade'])['cnt'].sum(),
sum_trade.loc[sum_trade.mon == 201706 ].groupby(['trade'])['cnt'].sum(),
sum_trade.loc[sum_trade.mon > 201800].groupby(['trade'])['cnt'].sum(),
sum_trade.loc[sum_trade.mon < 201800].groupby(['trade'])['cnt'].sum()
],
axis = 1
).reset_index()
df.columns = ['trade_code', 'cnt201806', 'cnt201706', 'cnt20181-6', 'cnt20171-6']
# subsititude ["1.blabla", "(1)foofoo", "其中:barbar"] to ["blabla", "foofoo", "barbar"]
pattern = re.compile(r'^(?\d?\.?\)?(其中:)?')
df.to_csv('temp.csv')
# The following line would not success
df.trade_code = df.trade_code.map(lambda x: pattern.sub('', x.strip()))
display(df[df.trade_code.map(lambda x: '1' in x)])
# do same update again seems worked
df.trade_code = df.trade_code.map(lambda x: pattern.sub('', x.strip()))
display(df[df.trade_code.map(lambda x: '1' in x)])
# if load data from file, first update will sucesses
df = pd.read_csv('temp.csv')
display(df[df.trade_code.map(lambda x: '1' in x)])
df.trade_code= df.trade_code.map(lambda x: pattern.sub('', x.strip()))
display(df[df.trade_code.map(lambda x: '1' in x)])
Here is some sample data of somefile.csv, which has about 2500 lines, and the concated df has about 200 lines (the names and numbers are faked):
city mon trade cnt
0 达纳苏斯 201701 1.农业 23458.0
1 达纳苏斯 201701 1.农副食品加工业 12345684.0
2 达纳苏斯 201701 1.房屋建筑业 22109.0
3 达纳苏斯 201701 1.电信、广播电视和卫星传输服务 338.0
4 达纳苏斯 201701 1.电力、热力生产和供应业 133333.0
below are the 2 outputs of the above code, which shows that some substitutions were successful, while some were not. I ran the code several times, it was always the following 4 lines not updated at the first time. But if data or pattern has problem, the second update should not work too.
trade cnt201806 cnt201706 cnt20181-6 cnt20171-6
33 1.化学纤维制造业 0.0 123451.0 0.0 5432185.0
34 1.印刷和记录媒介复制业 5678913.0 7890153.0 5555504.0 112233185.0
63 1.金属制品业 98765804.0 4321563.0 34567919.0 22222256.0
82 1.金属制品、机械和设备修理业 8765493.0 3214929.0 3322113331.0 556677155.0
====================================================================
trade cnt201806 cnt201706 cnt20181-6 cnt20171-6
I checked the data and found some trades are :
11.化学纤维制造业
11.印刷和记录媒介复制业
...
After the first substitution, they becomes:
1.化学纤维制造业
1.印刷和记录媒介复制业
...
That's why I have to substitute 2 times. I changed my pattern from '^(?\d?\.?\)?(其中:)?' to '^(?\d*\.?\)?(其中:)?' and all ok.
Thanks to all replies and comments.

Applying function to every cell in a Dataframe based on index and col

I have a pandas dataframe with a format exactly like the one in this question and I'm trying to achieve the same result. In my case, I am calculating the fuzz-ratio between the row's index and it's corresponding col.
If I try this code (based on the answer to the linked question)
def get_similarities(x):
return x.index + x.name
test_df = test_df.apply(get_similarities)
the concatenation of the row index and col name happens cell-wise, just as intended. Running type(test_df) returns pandas.core.frame.DataFrame, as expected.
However, if I adapt the code to my scenario like so
def get_similarities(x):
return fuzz.partial_ratio(x.index, x.name)
test_df = test_df.apply(get_similarities)
it doesn't work. Instead of a dataframe, I get back a series (the return type of that function is an int)
I don't understand why the two samples would not behave the same nor how to fix my code so it returns a dataframe, with the fuzzy.ratio for each cell between the a row's index for that cell and the col name for that cell.
what about the following approach?
assuming that we have two sets of strings:
In [245]: set1
Out[245]: ['car', 'bike', 'sidewalk', 'eatery']
In [246]: set2
Out[246]: ['walking', 'caring', 'biking', 'eating']
Solution:
In [247]: from itertools import product
In [248]: res = np.array([fuzz.partial_ratio(*tup) for tup in product(set1, set2)])
In [249]: res = pd.DataFrame(res.reshape(len(set1), -1), index=set1, columns=set2)
In [250]: res
Out[250]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
There is a way to accomplish this via DataFrame.apply with some row manipulations.
Assuming the 'test_df` is as follows:
In [73]: test_df
Out[73]:
walking caring biking eating
car carwalking carcaring carbiking careating
bike bikewalking bikecaring bikebiking bikeeating
sidewalk sidewalkwalking sidewalkcaring sidewalkbiking sidewalkeating
eatery eaterywalking eaterycaring eaterybiking eateryeating
In [74]: def get_ratio(row):
...: return row.index.to_series().apply(lambda x: fuzz.partial_ratio(x,
...: row.name))
...:
In [75]: test_df.apply(get_ratio)
Out[75]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
It took some digging, but I figured it out. The problem comes from the fact that DataFrame.apply is either applied column-wise or row-wise, not cell by cell. So your get_similarities function is actually getting access to an entire row or column of data at a time! By default it gets the entire column -- so to solve your problem, you just have to make a get_similarities function that returns a list where you manually call fuzz.partial_ratio on each element, like this:
import pandas as pd
from fuzzywuzzy import fuzz
def get_similarities(x):
l = []
for rname in x.index:
print "Getting ratio for %s and %s" % (rname, x.name)
score = fuzz.partial_ratio(rname,x.name)
print "Score %s" % score
l.append(score)
print len(l)
print
return l
a = pd.DataFrame([[1,2],[3,4]],index=['apple','banana'], columns=['aple','banada'])
c = a.apply(get_similarities,axis=0)
print c
print type(c)
I left my print statements in their so you can see what the DataFrame.apply call is doing for yourself -- that's when it clicked for me.

Using df.query() to extract rows from a DataFrame

I have a DataFrame df which contain three columns: ['mid','2014_amt','2015_amt']
I want to extract rows of a particular merchant. For example, consider my data is:
df['mid'] = ['as','fsd','qww','fd']
df['2014_amt] = [144,232,45,121]
df['2015_amt] = [676,455,455,335]
I want to extract the whole rows corresponding to mid = ['fsd','qww'] How is this best done? I tried with the below code:
df.query('mid== "fsd"')
If I want to run a loop, how can I use the above code to extract rows for specified values of mid?
for val in mid:
print df.query('mid' == "val"')
This is giving an error, as val is not specified.
Option 1
df.query('mid in ["fsd", "qww"]')
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455
Option 2
df[df['mid'].isin(['fsd', 'qww'])]
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455

Categories