I am trying to implement a logic where i have two dataframes. Say A (left) and B (right).
I need to find matching rows of A in B (i understand this can be done via a "inner" join). But my use case says i only need all rows from dataframe A but also a column from the matched record in B, the ID column (i understand this can be one via select). But here the problem arises, as after inner join the returned dataframe has rows from dataframe B also, which i dont need, but if i take leftsemi, then i wont be able to fetch the ID column from B dataframe.
For example:
def fetch_match_condition():
return ["EMAIL","PHONE1"]
match_condition = fetch_match_condition()
from pyspark.sql.functions import sha2, concat_ws
schema_col = ["FNAME","LNAME","EMAIL","PHONE1","ADD1"]
schema_col_master = schema_col.copy()
schema_col_master.append("ID")
data_member = [["ALAN","AARDVARK","lbrennan.kei#malchikzer.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","8653827956","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
master_df = spark.createDataFrame(data_member,schema_col)
master_df = master_df.withColumn("ID", sha2(concat_ws("||", *master_df.columns), 256))
test_member = [["ALAN","AARDVARK","lbren.kei#malchik.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","99997492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","87463829","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
test_member_1 = [["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"]]
test_df = spark.createDataFrame(test_member,schema_col)
test_df_1 = spark.createDataFrame(test_member_1,schema_col)
matched_df = test_df.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
master_df = master_df.union(matched_df.select(schema_col_master))
# Here only second last record will match and get added back to the master along with the same ID as it was in master_df
matched_df = test_df_1.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
# Here the problem arises since i need the match to be only 2, since the test_df_1 has only two records but the match will be 3 since master_df(david record) will also be selected.
PS : (Is there a way i can take leftsemi and then use withColumn and a UDF which fetches me this ID based on the row values of leftsemi dataframe from B dataframe ?)
Can anyone propose a solution for this ?
My initial column looks like this:
spread%
0 0.002631183029370956687450895171
1 0.002624478865422741694443794361
2 0.002503969912244045131633932303
3 0.002634517528902797001731827513
(I have 95000 rows in total)
What I wanted to do is to divide these spreads into 100 bins. That's what I did:
spread_range = np.linspace(0.000001, 0.0001, num=300)
dfspread = pd.DataFrame(spread_range,columns=['spread%'])
sorted_array = np.sort(df['spread%'])
dfspread['spread%']=np.array_split(sorted_array, 300)
dfspread['spread%'] = dfspread['spread%'].str[1]
I had to first create a dataframe with random values (spread_range) then replace these values by the good values(last line). I did not know how to do it in one step...
This is my output:
spread%
295 0.006396490507889923995723419182
296 0.006601856970328614032555077092
297 0.006874901899230889970177366191
298 0.007286400912994813194530809917
299 0.008012436834225554885192314445
but I do not find my maximum value which is: 0.02828190624663463264290952354
Any idea why?
I have a CSV file created like this:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
Now I want the fourth row to get appended to the existing CSV file as followings:
First column: Remains same: 1213
Second column: Get max value: 898
Third column: Get min value: 009
Fourth column: Get avg value: 422.6
So the final CSV file should be:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.6
Please help me to achieve the same. It's not mandatory to use Pandas.
Thanks in advance!
df.agg(...) accepts a dict where the dict keys are the names of the columns and the values are strings that perform an aggregation that you want:
df_agg = df.agg({'keep_same': 'mode', 'get_max': 'max',
'get_min': 'min', 'get_avg': 'mean'})[df.columns]
Produces:
keep_same get_max get_min get_avg
0 1213 898 9 422.666667
Then you just append df_agg to df:
df = df.append(df_agg, ignore_index=False)
Result:
keep_same get_max get_min get_avg
0 1213 176 901 517.000000
1 1213 198 9 219.000000
2 1213 898 201 532.000000
0 1213 898 9 422.666667
Notice that the index of the appended row is 0. You can pass ignore_index=True to append if you desire.
Also note that if you plan to do this append operation a lot, it will be very slow. Other approaches do exist in that case but for once-off or just a few times, append is OK.
assuming you do not care about the index you can use loc[-1] to add the row:
df = pd.read_csv('file.csv', sep=';', dtype={'get_min':'object'}) # read csv set dtype to object for leading 0 col
row = [df['keep_same'].values[0], df['get_max'].max(), df['get_min'].min(), df['get_avg'].mean()] # create new row
df.loc[-1] = row # add row to a new line
df['get_avg'] = df['get_avg'].round(1) # round to 1
df['get_avg'] = df['get_avg'].apply(lambda x: '%g'%(x)) # strip .0 from the other records
df.to_csv('file1.csv', index=False, sep=';') # to csv file
out:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.7
I have a DataFrame df which contain three columns: ['mid','2014_amt','2015_amt']
I want to extract rows of a particular merchant. For example, consider my data is:
df['mid'] = ['as','fsd','qww','fd']
df['2014_amt] = [144,232,45,121]
df['2015_amt] = [676,455,455,335]
I want to extract the whole rows corresponding to mid = ['fsd','qww'] How is this best done? I tried with the below code:
df.query('mid== "fsd"')
If I want to run a loop, how can I use the above code to extract rows for specified values of mid?
for val in mid:
print df.query('mid' == "val"')
This is giving an error, as val is not specified.
Option 1
df.query('mid in ["fsd", "qww"]')
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455
Option 2
df[df['mid'].isin(['fsd', 'qww'])]
mid 2014_amt 2015_amt
1 fsd 232 455
2 qww 45 455
I want to add subtotals to my dataframe: groupby some index level then append new dataframe to the main one. For some unknown reason temp["Узел"] = "Итого" does nothing. It doesn't add new column, though temp["Узел2"] = "Итого" adds one. I think, it's because my 'pvt' dataframe already has "Узел" index level, but what impact does it have on a new 'temp' dataframe?
temp = pvt.groupby(level=["Принадл"]).sum()
temp["Узел"] = "Итого"
print(temp)
print(temp["Узел"])
Россия \
ОКТ КЛГ МСК ГОР СЕВ СКВ ЮВС ПРВ КБШ СВР
Принадл
ИП 783 14 172 398 248 1178 460 235 314 644
ПС 93900 5049 89815 36197 85619 55213 91681 26764 33869 154280
... \
СНГ и др. Итого
Принадл
ИП 46 9342
ПС 51529 1299784
[2 rows x 21 columns]
Empty DataFrame
Columns: []
Index: [ИП, ПС]
pandas 0.16.1, numpy 1.9.2
UPD: that's because of manually added multiindex level "Узел" or multilevel columns... or both. I'm not sure yet.
UPD2: i've been able to avoid the problem when i temporary switched to one-level column names before adding new columns
columns, temp.columns = temp.columns, [None]*len(temp.columns)
temp[...] = ...
<...>
temp.columns = columns