Pandas - column median applied on lambda function - python

Given the dataset:
matrix = [(222, 34, 23),
(333, 31, 11),
(444, 16, 21),
(555, 32, 22),
(666, 33, 27),
(777, 35, 11)
]
dfObj = pd.DataFrame(matrix, columns=list('abc'))
I want to apply the formula (value - column median) ^ 2. I am trying to do with lambda and functions, but I am not being successful, the issue is the column median.
value = each cell;
how could I apply that function?
Edit
dfObj['d'] = dfObj['c'].apply(lambda x : math.pow(x, 2) / 10)

Is this what you need ?
dfObj.div(dfObj.median())**2
Out[116]:
a b c
0 0.197531 1.094438 1.144402
1 0.444444 0.909822 0.261763
2 0.790123 0.242367 0.954029
3 1.234568 0.969467 1.047052
4 1.777778 1.031006 1.577069
5 2.419753 1.159763 0.261763

Related

How can I replace pd intervals with integers in python

How can I replace pd intervals with integers
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
output:
age age_bands
0 43 (40, 50]
1 76 (70, 80]
2 27 (20, 30]
3 8 (0, 10]
4 57 (50, 60]
5 32 (30, 40]
6 12 (10, 20]
7 22 (20, 30]
now I want to add another column to replace the bands with a single number (int). but I could not
for example this did not work :
df['age_code']= df['age_bands'].replace({'(40, 50]':4})
how can I get a column looks like this?
age_bands age_code
0 (40, 50] 4
1 (70, 80] 7
2 (20, 30] 2
3 (0, 10] 0
4 (50, 60] 5
5 (30, 40] 3
6 (10, 20] 1
7 (20, 30] 2
Assuming you want to the first digit from every interval, then, you can use pd.apply to achieve what you want as follows:
df["age_code"] = df["age_bands"].apply(lambda band: str(band)[1])
However, note this may not be very efficient for a large dataframe,
To convert the column values to int datatype, you can use pd.to_numeric,
df["age_code"] = pd.to_numeric(df['age_code'])
As the column contains pd.Interval objects, use its property left
df['age_code'] = df['age_bands'].apply(lambda interval: interval.left // 10)
You can do that by simply adding a second pd.cut and define labels argument.
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
#This is the part of code you need to add
age_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
df['age_code']= pd.cut(df['age'], bins=age_band, labels=age_labels, ordered=True)
>>> print(df)
You can create a dictionary of bins and map it to the age_bands column:
bins_sorted = sorted(pd.cut(df['age'], bins=age_band, ordered=True).unique())
bins_dict = {key: idx for idx, key in enumerate(bins_sorted)}
df['age_code'] = df.age_bands.map(bins_dict).astype(int)

Convert a numpy int64 into a list

I want to transform a numpy int64 array
Zed 49
Kassadin 39
Cassiopeia 34
RekSai 33
Nidalee 30
Name: value, dtype: int64
into a list like this:
[(Zed, 49), (Kassadin, 39), (Cassiopeia, 34), (RekSai, 33), (Nidalee, 30)]
Till now I've tried:
l = l.tolist()
l.T
and
[row for row in l.T]
but the output looks like this:
[49, 39, 34, 33, 30]
One possible solution is list comprehenstion:
L = [(k, v) for k, v in series.items()]
Or convert values to DataFrame anf then to list ot tuples:
L = list(map(tuple, series.reset_index().values.tolist()))
Or to MultiIndex:
L = series.to_frame('a').set_index('a', append=True).index.tolist()
print (L)
[('Zed', 49), ('Kassadin', 39), ('Cassiopeia', 34), ('RekSai', 33), ('Nidalee', 30)]

How to update matrix based on multiple maximum value per row?

I am a newbie to Python. I have an NxN matrix and I want to know the maximum value per each row. Next, I want to nullify(update as zero) all other values except this maximum value. If the row contains multiple maximum values, all those maximum values should be preserved.
Using DataFrame, I tried to get the maximum of each row.Then I tried to get indices of these max values. Code is given below.
matrix = [(22, 16, 23),
(12, 6, 43),
(24, 67, 11),
(87, 9,11),
(66, 36,66)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
maxValuesObj = dfObj.max(axis=1)
maxValueIndexObj = dfObj.idxmax(axis=1)
The above code doesn't consider multiple maximum values. Only the first occurrence is returned.
Also,I am stuck with how to update the matrix accordingly. My expected output is:
matrix = [(0, 0, 23),
(0, 0, 43),
(0, 67, 0),
(87, 0,0),
(66, 0,66)
]
Can you please help me to sort out this?
Using df.where():
dfObj.where(dfObj.eq(dfObj.max(1),axis=0),0)
x y z
a 0 0 23
b 0 0 43
c 0 67 0
d 87 0 0
e 66 0 66
For an ND array instead of a dataframe , call .values after the above code:
dfObj.where(dfObj.eq(dfObj.max(1),axis=0),0).values
Or better is to_numpy():
dfObj.where(dfObj.eq(dfObj.max(1),axis=0),0).to_numpy()
Or np.where:
np.where(dfObj.eq(dfObj.max(1),axis=0),dfObj,0)
array([[ 0, 0, 23],
[ 0, 0, 43],
[ 0, 67, 0],
[87, 0, 0],
[66, 0, 66]], dtype=int64)
I'll show how to do it with a Python built-ins instead of Pandas, since you're new to Python and should know how to do it outside of Pandas (and the Pandas syntax isn't as clean).
matrix = [(22, 16, 23),
(12, 6, 43),
(24, 67, 11),
(87, 9,11),
(66, 36,66)
]
new_matrix = []
for row in matrix:
row_max = max(row)
new_row = tuple(element if element == row_max else 0 for element in row)
new_matrix.append(new_row)
You can do this with a short for loop pretty easily:
import numpy as np
matrix = np.array([(22, 16, 23), (12, 6, 43), (24, 67, 11), (87, 9,11), (66, 36,66)])
for i in range(0, len(matrix)):
matrix[i] = [x if x == max(matrix[i]) else 0 for x in matrix[i]]
print(matrix)
output:
[[ 0 0 23]
[ 0 0 43]
[ 0 67 0]
[87 0 0]
[66 0 66]]
I would also use numpy for matrices not pandas.
This isn't the most performant solution, but you can write a function for the row operation then apply it to each row:
def max_row(row):
row.loc[row != row.max()] = 0
return row
dfObj.apply(max_row, axis=1)
Out[17]:
x y z
a 0 0 23
b 0 0 43
c 0 67 0
d 87 0 0
e 66 0 66

Pythonic way to loop two lists

I have two arrays:
Array_a = [20, 30, 50, 20]
Array_b = [1 ,2 ,3 , 4]
would like to have the following output:
(20, '(1,Days Learning)')
(30, '(2,Days Learning)')
(50, '(3,Days Learning)')
(20, '(4,Days Learning)')
My code looks like the following:
for i,j in zip(Array_a, Array_b):
msg = (i, "(" + str(j) + ",Days Learning)")
print(msg)
but I would like to have it somehow easier like the way:
for a, b in []
Try this one:
msg = [(a, '({}, Days Learning)'.format(b)) for a, b in zip(Array_a, Array_b)]
print(msg)
Will output:
[(20, '(1, Days Learning)'), (30, '(2, Days Learning)'), (50, '(3, Days Learning)'), (20, '(4, Days Learning)')]
NOTE:
To print the elements line by line you can use print with join and another list-comprehension:
print('\n'.join(str(m) for m in msg))
How about this :
it looks more pythonic way to me
Array_a = [20, 30, 50, 20]
Array_b = [1 ,2 ,3 , 4]
sample = tuple(zip(Array_a,zip(Array_b,["Days Learning" for i in range(len(Array_b))])))
print(sample)
it will give you this result:
((20, (1, 'Days Learning')), (30, (2, 'Days Learning')), (50, (3, 'Days Learning')), (20, (4, 'Days Learning')))

Spark Python - how to use reduce by key to get minmum/maximum values

I have a sample data of maximum and minimum temperatures of some cities in csv format.
Mumbai,19,30
Delhi,5,41
Kolkata,20,40
Mumbai,18,35
Delhi,4,42
Delhi,10,44
Kolkata,19,39
I want to find out all time lowest temperature recorded for each city using a spark script in Python.
Here is my script
cityTemp = sc.textFile("weather.txt").map(lambda x: x.split(','))
# convert it to pair RDD for performing reduce by Key
cityTemp = cityTemp.map(lambda x: (x[0], tuple(x[1:])))
cityTempMin = cityTemp.reduceByKey(lambda x, y: min(x[0],y[0]))
cityTempMin.collect()
My expected output is as follows
Delhi, 4
Mumbai, 18
Kolkata, 19
However the script is producing the following output.
[(u'Kolkata', u'19'), (u'Mumbai', u'18'), (u'Delhi', u'1')]
How do I get the desired output?
Try the below solution, if you have to use reduceByKey function :
SCALA:
val df = sc.parallelize(Seq(("Mumbai", 19, 30),
("Delhi", 5, 41),
("Kolkata", 20, 40),
("Mumbai", 18, 35),
("Delhi", 4, 42),
("Delhi", 10, 44),
("Kolkata", 19, 39))).map(x => (x._1,x._2)).keyBy(_._1)
df.reduceByKey((accum, n) => if (accum._2 > n._2) n else accum).map(_._2).collect().foreach(println)
PYTHON:
rdd = sc.parallelize([("Mumbai", 19, 30),
("Delhi", 5, 41),
("Kolkata", 20, 40),
("Mumbai", 18, 35),
("Delhi", 4, 42),
("Delhi", 10, 44),
("Kolkata", 19, 39)])
def reduceFunc(accum, n):
print(accum, n)
if accum[1] > n[1]:
return(n)
else: return(accum)
def mapFunc(lines):
return (lines[0], lines[1])
rdd.map(mapFunc).keyBy(lambda x: x[0]).reduceByKey(reduceFunc).map(lambda x : x[1]).collect()
Output:
(Kolkata,19)
(Delhi,4)
(Mumbai,18)
If you don't want to do a reduceByKey. Just a group by followed by min function would give you desired result.
val df = sc.parallelize(Seq(("Mumbai", 19, 30),
("Delhi", 5, 41),
("Kolkata", 20, 40),
("Mumbai", 18, 35),
("Delhi", 4, 42),
("Delhi", 10, 44),
("Kolkata", 19, 39))).toDF("city", "minTemp", "maxTemp")
df.groupBy("city").agg(min("minTemp")).show
Output :
+-------+------------+
| city|min(minTemp)|
+-------+------------+
| Mumbai| 18|
|Kolkata| 19|
| Delhi| 4|
+-------+------------+

Categories