Using df.query() to extract rows from a DataFrame - python

I have a DataFrame df which contain three columns: ['mid','2014_amt','2015_amt']
I want to extract rows of a particular merchant. For example, consider my data is:
df['mid'] = ['as','fsd','qww','fd']
df['2014_amt] = [144,232,45,121]
df['2015_amt] = [676,455,455,335]
I want to extract the whole rows corresponding to mid = ['fsd','qww'] How is this best done? I tried with the below code:
df.query('mid== "fsd"')
If I want to run a loop, how can I use the above code to extract rows for specified values of mid?
for val in mid:
print df.query('mid' == "val"')
This is giving an error, as val is not specified.

Option 1
df.query('mid in ["fsd", "qww"]')
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455
Option 2
df[df['mid'].isin(['fsd', 'qww'])]
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455

Related

pandas.core.frame.DataFrame rename index problems

In an existing table I got some summary by
df.groupby('bin_fare')['fare'].agg(['count', 'sum', 'mean'])
The result is table above. bin_fare name of Indexes
bin_fare count sum mean
1 491 3717.1413 7.570553
2 474 9000.3078 18.987991
3 259 14565.0003 14565.0003
4 84 16268.0375 16268.0375
I tried to rename indexes by adding this code
fare_rate_names = ['cheapest','avarage','above average','expensive']
df.groupby('bin_fare')['fare'].agg(['count','sum','mean']).rename(index=pd.Series(data=fare_rate_names))
But it renames only first 3 row !
bin_fare count sum mean
avarage 491 3717.1413 7.570553
above average 474 9000.3078 18.987991
expensive 259 14565.0003 14565.0003
4 84 16268.0375 16268.0375
How to fix it? Not adding element at the beginning of fare_rate_names
You can just set a new index:
df.index = pd.Series(fare_rate_names)
Or, the more pythonic ("pandastic"?):
df.set_index(pd.Series(fare_rate_names), inplace=True)
Also, you could create a dummy name for the 0th index:
fare_rate_names = ['foo', 'cheapest','avarage','above average','expensive']
df.groupby('bin_fare')['fare'].agg(['count','sum','mean']).rename(index=pd.Series(data=fare_rate_names))

How can I fetch only rows from left dataset but also a certain column from the right dataset, during inner join

I am trying to implement a logic where i have two dataframes. Say A (left) and B (right).
I need to find matching rows of A in B (i understand this can be done via a "inner" join). But my use case says i only need all rows from dataframe A but also a column from the matched record in B, the ID column (i understand this can be one via select). But here the problem arises, as after inner join the returned dataframe has rows from dataframe B also, which i dont need, but if i take leftsemi, then i wont be able to fetch the ID column from B dataframe.
For example:
def fetch_match_condition():
return ["EMAIL","PHONE1"]
match_condition = fetch_match_condition()
from pyspark.sql.functions import sha2, concat_ws
schema_col = ["FNAME","LNAME","EMAIL","PHONE1","ADD1"]
schema_col_master = schema_col.copy()
schema_col_master.append("ID")
data_member = [["ALAN","AARDVARK","lbrennan.kei#malchikzer.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","8653827956","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
master_df = spark.createDataFrame(data_member,schema_col)
master_df = master_df.withColumn("ID", sha2(concat_ws("||", *master_df.columns), 256))
test_member = [["ALAN","AARDVARK","lbren.kei#malchik.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","99997492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","87463829","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
test_member_1 = [["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"]]
test_df = spark.createDataFrame(test_member,schema_col)
test_df_1 = spark.createDataFrame(test_member_1,schema_col)
matched_df = test_df.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
master_df = master_df.union(matched_df.select(schema_col_master))
# Here only second last record will match and get added back to the master along with the same ID as it was in master_df
matched_df = test_df_1.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
# Here the problem arises since i need the match to be only 2, since the test_df_1 has only two records but the match will be 3 since master_df(david record) will also be selected.
PS : (Is there a way i can take leftsemi and then use withColumn and a UDF which fetches me this ID based on the row values of leftsemi dataframe from B dataframe ?)
Can anyone propose a solution for this ?

python: pandas .describe - how to put results into a variable?

pretty new to python and pandas in general. I have a dataframe that has 2 columns i want to analyze:
df2 = df1.query('debt == 1').groupby(['family_status'])['family_status'].describe()
This gives me the result of:
count unique top freq
family_status
civil partnership 388 1 civil partnership 388
divorced 85 1 divorced 85
married 931 1 married 931
unmarried 274 1 unmarried 274
widow / widower 63 1 widow / widower 63
which is a lot of information that i wanted to know - however, to do some additional analysis i'd like to be able to put these results by 'family_status' into variables - so, civil partnership, divorced, married, etc. or feed them into an ad-hoc function.
Edit for clarity - i want to have civil_partnership = the count (in this case 388) etc.
unsure how to proceed.
thanks for your time in advance,
Jared
.describe() returns a normal dataframe, so you can assign it to amy variable and apply any dataframe methods in it.
And to select by index use .ix
It would be easy to create a dictionary and call each faimily_status as a key in that dict. Since you've assigned your describe dataframe to df2:
df2.reset_index(inplace=True, drop=False)
d = {}
for status, count in zip(df2['family_status'], df2['count']):
d[status] = count
This should result in something like
d = {
'civil partnership' : 388
'divorced' : 85
'married' : 931
'unmarried' : 274
'widow / widower' : 63
}
edit:
df2.count invokes the count method -- syntax adjusted so that it calls the column, not the method.

Am I using groupby.sum() correctly?

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.
If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

Pandas: search multiple columns and return column with found value

I'm try to do some auditing for our purchase orders, and I created this dataframe (here's a csv sample of it):
ProductName,Qty,LineCost,BuyQty1,BuyQty1Cost,BuyQty2,BuyQty2Cost,BuyQty3,BuyQty3Cost
SIGN2WH,48,40.63,5,43.64,48,40.63,72,39.11
SIGN2BK,144,39.11,5,43.64,48,40.63,72,39.11
In my data source, some products get different breaks, depending on the quantity purchased. Hence the columns BuyQty1 and BuyQty1Cost. Qty and LineCost are the values I need to audit. So, what I'm trying to do is:
Check what quantity break corresponds to the value on the column
Qty. Example a Qty of 48 implies that the break is BuyQty2,
and the corresponding price should be BuyQty2Cost.
Then add a column with the ratio of LineCost/BuyQty2Cost. It would be BuyQty3Cost in the case of SIGN2BK (2nd line).
How should I tackle this?
import pandas as pd
def calculate_break_level(row):
if row.Qty >= row.BuyQty3:
return row.BuyQty3Cost
elif row.Qty >= row.BuyQty2:
return row.BuyQty2Cost
else:
return row.BuyQty1Cost
# apply the function row-by-row by specifying axis=1
# the newly produced Line_Cost is in the last column.
df['Line_Cost'] = df.apply(calculate_break_level, axis=1)
Out[58]:
ProductName Qty LineCost BuyQty1 BuyQty1Cost BuyQty2 BuyQty2Cost BuyQty3 BuyQty3Cost Line_Cost
0 SIGN2WH 48 40.63 5 43.64 48 40.63 72 39.11 40.63
1 SIGN2BK 144 39.11 5 43.64 48 40.63 72 39.11 39.11

Categories