count the lines in rdd depended on the lines context, pyspark - python

I try to understand currently, how RDD works. For example, I want to count the lines based on the context in some RDD object. I have some experince with DataFrames and my code for DF, which has for example columns A, B and probably some other columns, is looking like:
df = sqlContext.read.json("filepath")
df2 = df.groupBy(['A', 'B']).count()
The logical part of this code is clear for me - I do groupBy operation over column name in DF. In RDD I don't have some column name, just similar lines, which could be a tuple or a Row objects... How I can count similar tuples and add it as integer to the unique line? For example my first code is:
df = sqlContext.read.json("filepath")
rddob = df.rdd.map(lambda line:(line.A, line.B))
I do the map operation and create a tuple of the values from the keys A and B. The unique line doesn't have any keys anymore (this is most important difference to the DataFrame, which has column name).
Now I can produce something like this, but it calculate just a total number of lines in RDD.
rddcalc = rddob.distinct().count()
What I want for my output, is just:
((a1, b1), 2)
((a2, b2), 3)
((a2, b3), 1)
...
PS
I have found my personal solution for this question. Here: rdd is initial rdd, rddlist is a list of all lines, rddmod is a final modified rdd and consequently the solution.
rddlist = rdd.map(lambda line:(line.A, line.B)).map(lambda line: (line, 1)).countByKey().items()
rddmod = sc.parallelize(rddlist)

I believe what you are looking for here is a reduceByKey. This will give you a count of how many times each distinct pair of (a,b) lines appears.
It would look like this:
rddob = df.rdd.map(lambda line: (line.A + line.B, 1))
counts_by_key = rddob.reduceByKey(lambda a,b: a+b)
You will now have key, value pairs of the form:
((a,b), count-of-times-pair-appears)
Please note that this only works if A and B are strings. If they are lists, you have to create a "primary key" type of object to perform the reduce on. You can't perform a reduceByKey where the primary key is some complicated object.

Related

How to separate tuple into independent pandas columns?

I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

Remove columns that have substring similar to other columns Python

I have dataframe where the column names have the same format: data_sensor, where the date is in the format of yymmdd. Here is a subset of it:
Considering the last data (180722), I would like to keep the column according to sensor pre-defined priority. For example, I would like to define that SN1 is more important than SK3. So the desired result would be the same dataframe, only without column 180722_SK3. The number of columns with the same last date can be more than two.
This is the solution I implemented:
sensorsImportance = ['SN1', 'SK3'] #list of importence, first item is the most important
sensorsOrdering = {word: i for i, word in enumerate(sensorsImportance)}
def remove_duplicate_last_date(df,sensorsOrdering):
s = []
lastDate = df.columns.tolist()[-1].split('_')[0]
for i in df.columns.tolist():
if lastDate in i:
s.append(i.split('_')[1])
if len(s)>1:
keepCol = lastDate +'_'+sorted(s, key=sensorsOrdering.get)[0]
dropCols = [lastDate +'_'+i for i in sorted(s, key=sensorsOrdering.get)[1:]]
df.drop(dropCols,axis=1,inplace=True)
return df
It works fine, however, I feel that this is too cumbersome, is there a better way?
It can be done, with split the column then apply the argsort with the list , then reorder your dataframe , and join back the columns after groupby get the first value by date
df.columns=df.columns.str.split('_').map(tuple)
sensorsImportance = ['SN1', 'SK3']
idx=df.columns.get_level_values(1).map(dict(zip(sensorsImportance,range(len(sensorsImportance))))).argsort()
df=df.iloc[:,idx].T.groupby(level=0).head(1).T
df.columns=df.columns.map('_'.join)

How to iterate over numpy.ndarray which consists of objects? Also apply various functions on them

I have a dataframe of 100000+ rows
in which I have a column name 'type'
which as unique values like:
['healer' 'terminator' 'kill-la-kill' 'demonic' 'healer-fpp' 'terminator-fpp' 'kill-la-kill-fpp' 'demonic-fpp']
What I want is to count the number of each type in the dataframe. What I am doing now to count the row is:
len(df.loc[df['type'] == "healer"])
But in this case I have to write it manually as many times as there are unique values in that column.
Is there any other simpler way to do that?
Also I want to use these condition to filter out other columns as well
like the 'terminator' killed 78 in the 'kills' and had '0' heals
use value_counts?
df['type'].value_counts()
Numpy is great, and usually have already a one-liner that covers most requirements like this - I think what you might want is...
np.unique(yourArray, return_counts=True)
Which will return a list of unique values, and the number of times each one appears in your array.
try:
import numpy as np
np.unique(df['type'].values, return_counts=True)
Or, roll it up in a dict, so you can extract the counts keyed by value:
count_dict = dict(zip(*np.unique(df['type'].values, return_counts=True)))
count_dict["healer"]
>> 132
Then you can plug that into a format string and (assuming you make a similar dictionary called heals_dict) do something like:
for k in count_dict.keys():
print ( "the {k} killed {kills} in the 'kills' and had {heals} heals".format(k=k, kills=count_dict[k], heals=heals_dict[k]) )
You can iterate over unique values directly by using df["type"].unique()
for val in df["type"].unique():
print(val, len(df[df["type"] == val]))

Conditional iteration of key,value in DataFrameGroupBy

I have a pandas (v 0.12) dataframe data in python (2.7). I groupby() with respect to the A and B colmuns in data to form the groups object which is of type <class 'pandas.core.groupby.DataFrameGroupBy'>.
I want to loop through and apply a function to the dataframes within groups that have more than one row in them. My code is below, here each dataframe is the value in the key,value pair:
import pandas as pd
groups = data.groupby(['A','B'])
len(groups)
>> 196320 # too large - will be slow to iterate through all
for key, value in groups:
if len(value)>1:
print(value)
Since I am only interested in applying the function to values where len(value)>1, is it possible to save time by embedding this condition to filter and loop through only the key-value pairs that satisfy this condition. I can do something like below to ascertain the size of each value but I am not sure how to marry this aggreagation with the original groups object.
size_values = data.groupby(['A','B']).agg({'C' : [np.size]})
I am hoping the question is clear, please let me know if any clarification is needed.
You could assign length of the group back to column and filter by its value:
data['count'] = data.groupby(['A','B'],as_index=False)['A'].transform(np.size)
After that you could:
data[data['count'] > 1].groupby(['A','B']).apply(your_function)
Or just skip assignment if it is a one time operation:
data[data.groupby(['A','B'],as_index=False)['A'].transform(np.size) > 1].groupby(['A','B']).apply(your_function)

Categories