How to use Numpy vectorize to calculate columns in Pandas - python

I have a pd Dataframe and would like to calculate one column based on two others from the same dataframe. I would like to use Numpy vectorisation for this as the dataset is large.
Here is the dataframe:
Input Dataframe
A B
0 567 345
1 123 456
2 568 354
Output Dataframe
A B C
0 567 345 567.345
1 123 456 123.456
2 568 354 568.354
where column C is a concatenation between A and B with dot between both values.
I am using apply():
df['C'] = df.apply(lambda row: str(row['A']) + '.' + str(row['B']), axis=1)
instead to iterate over rows/index etc. but still it is slow.
I know that I could do:
df['C'] = df['A'].values + df['B'].values
which is extremely faster, but this will not give me the desired result, and on the same time:
df['C'] = str(df['A'].values) + '.' + str(df['B'].values)
will give me something completely different.
The example is just for presentation purposes (the values of A and B could be of any type). The question is more general.
Thank you in advance!

A list comprehension should be faster then apply or such use case:
df['C'] = [f"{a}.{b}" for a,b in zip(df['A'],df['B'])]
Outputs
A B C
0 567 345 567.345
1 123 456 123.456
2 568 354 568.354

To convert numbers to strings you can use the method astype():
df['A'].astype('str') + '.' + df['B'].astype('str')

Related

How do I convert bytes to utf-8 without turning regular strings into NaNs?

I have a process that runs on multiple pandas dataframes. Sometimes the data comes in the form of bytes, such as:
>>> pd.DataFrame[['x']]
['x']
b'123'
b'111'
b'110'
And other times it comes in the form of regular integers
>>> pd.DataFrame[['x']]
['x']
80
123
491
I want to convert the bytes to unicode-8 and leave the regular integers untouched. Right now, I tried pd.Dataframe['x'].str.decode('unicode-8') and it works when the dataframe comes in the form of bytes, but it turns all the values to NaN when the dataframe comes in the form of integers.
I want the solution to be vectorized because speed is important. I can't use list comprehension, for example.
You can define a function to first check before decoding. Something like:
import pandas as pd
# Define the decode_if_bytes function
def decode_if_bytes(input_str):
if isinstance(input_str, bytes):
return input_str.decode('utf-8')
return input_str
Decode df
# Apply the function to the dataframe
df = pd.DataFrame({'x':[b'80',123,491]})
df['x'] = df['x'].apply(decode_if_bytes)
print(df)
Output:
x
0 80
1 123
2 491
Decode another df
df = pd.DataFrame({'x':[b'123',b'111',b'110']})
df['x'] = df['x'].apply(decode_if_bytes)
print(df)
Output:
x
0 123
1 111
2 110
One way to do what you've asked is to infer the dtype for the column and only attempt to convert it from bytes if it's non-numeric:
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
df['x'] = df['x'].str.decode('utf-8')
Test code:
import pandas as pd
df = pd.DataFrame({'x':[b'123',b'111',b'110']})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')
df = pd.DataFrame({'x':[80,123,491]})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')
Output:
before
x
0 b'123'
1 b'111'
2 b'110'
after
x
0 123
1 111
2 110
before
x
0 80
1 123
2 491
after
x
0 80
1 123
2 491
UPDATE: If the column is partially in bytes, such as x b'80' 123, this will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x':[b'80',123,491]})
print('','before',df,sep='\n')
df.x = np.where(df.x.astype(np.int64) == df.x, df.x.astype(str).str.encode('utf-8'), df.x)
df.x = df.x.str.decode('utf-8')
print('','after',df,sep='\n')
Output:
before
x
0 b'80'
1 123
2 491
after
x
0 80
1 123
2 491

Count mutual followers in a relation table using pandas

I have a pandas DataFrame like so:
from_user to_user
0 123 456
1 894 135
2 179 890
3 456 123
Where each row contains two IDs that reflect whether the from_user "follows" the to_user. How can I count the total number of mutual followers in the DataFrame using pandas?
In the example above, the answer should be 1 (users 123 & 456).
One way is to use MultiIndex set operations:
In [11]: i1 = df.set_index(["from_user", "to_user"]).index
In [12]: i2 = df.set_index(["to_user", "from_user"]).index
In [13]: (i1 & i2).levels[0]
Out[13]: Int64Index([123, 456], dtype='int64')
To get the count you have to divide the length of this index by 2:
In [14]: len(i1 & i2) // 2
Out[14]: 1
Another way to do is to concat the values and sort them as string.
Then count how many times the values occur:
# concat the values as string type
df['concat'] = df.from_user.astype(str) + df.to_user.astype(str)
# sort the string values of the concatenation
df['concat'] = df.concat.apply(lambda x: ''.join(sorted(x)))
# count the occurences of each and substract 1
count = (df.groupby('concat').size() -1).sum()
Out[64]: 1
Here is another slightly more hacky way to do this:
df.loc[df.to_user.isin(df.from_user)]
.assign(hacky=df.from_user * df.to_user)
.drop_duplicates(subset='hacky', keep='first')
.drop('hacky', 1)
from_user to_user
0 123 456
The whole multiplication hack exists to ensure we don't return 123 --> 456 and 456 --> 123 since both are valid given the conditional we provide to loc

How to get the number of unique combinations of two columns that occur in a python pandas dataframe

Let's say I have this dataframe in pandas
a b
1 203 487
2 876 111
3 203 487
4 876 487
There are more columns that I don't care about not shown
I know len(df.a.unique()) will return 2 to indicate there are two unique values of a, as will len(df.b.unique()). I want something similar to this, but returns the number of unique combinations of a AND b that occur. So in this example, I would want it to return 3.
Any guidance on how I can go about doing this are appreciated
Use drop_duplicates:
print (df.drop_duplicates(['a','b']))
a b
1 203 487
2 876 111
4 876 487
a = len(df.drop_duplicates(['a','b']).index)
Or duplicated with inverting condition:
a = (~df.duplicated(['a','b'])).sum()
a = len(df.index) - df.duplicated(['a','b']).sum()
Or convert columns to strings and join together, then get nunique:
a = (df.a.astype(str) + '_' + df.b.astype(str)).nunique()
print (a)
3
Do you count cases like below as two different combinations or one?
1) 'a' is 203 and 'b' is 487
2) 'a' is 487 and 'b' is 203
If you want it as two, just use drop_duplicates as jezrael said. If you want them to count as one unique combination I would create a new column so it will always be: the smaller number_the bigger number and do the drop_duplicates on this column.
Import numpy as np re
df['c']=np.where(df['a']<df['b'], \
df['a'].astype('str')+"_"+df['b'].astype('str'), \
df['b'].astype('str')+"_"+df['a'].astype('str'))
print(len(df.drop_duplicates('c')))

Pandas: replace values in dataframe from pivot_table

I have dataframe and Pivot Table and I need to replace some values in dataframe from pivot_table's columns.
Dataframe:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1 2
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 1 2
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 1 2
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
Pivot Table:
type 1 2 \
access_code ID member_id
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1045794 1023 923 1 122
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 768656 203 243 1 169
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 604095 392 919 1 35
g06q0itlmkqmz5cv f4a3b3f2fca77c443cd4286a4c91eedc 1457307 243 1
g074qx58cmuc1a2f 13f2674f6d5abc888d416ea6049b57b9 5637836 1
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 5732738 111 2343 1
Desire output:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1023 923
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 111 2343
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 392 919
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
If I use
df.ix[df.cat1 == 1] = pivot_table['1']
It returns error ValueError: cannot set using a list-like indexer with a different length than the value
As long as your dataframe is not exceedingly large, you can make it happen in some really ugly ways. I am sure someone else will provide you with a more elegant solution, but in the meantime this duct tape might point you in the right direction.
Keep in mind that in this case I did this with 2 dataframes instead of 1 dataframe and 1 pivot table, as I already had enough trouble formatting the dataframes from the textual data.
As there are empty fields in your data and my dataframes did not like this, first convert the empty fields to zeros.
df = df.replace(r'\s+', 0, regex=True)
Now ensure that your data is actually floats, else the comparisons will fail
df[['cat1', 'cat2', 'cat3']] = df[['cat1', 'cat2', 'cat3']].astype(float)
And for the fizzly fireworks:
df.cat1.loc[df.cat1 == 1] = piv['1'].loc[df.loc[df.cat1 == 1].index].dropna()
df.cat1 = df.cat1.fillna(1)
df.cat2.loc[df.cat2 == 2] = piv['2'].loc[df.loc[df.cat2 == 2].index].dropna()
df.cat2 = df.cat2.fillna(2)
df = df.replace(0, ' ')
The fillna is just to recreate your intended output, in which you clearly did not process some lines yet. I guess this column-by-column NaN-filling will not happen in your actual use.

Combine arbitrary number of columns into one in pandas

This question is a general version of a specific case asked about here.
I have a pandas dataframe with columns that contain integers. I'd like to concatenate all of those integers into a string in one column.
Given this answer, for particular columns, this works:
(dl['ungrd_dum'].map(str) +
dl['mba_dum'].map(str) +
dl['jd_dum'].map(str) +
dl['ma_phd_dum'].map(str))
But suppose I have many (hundreds) of such columns, whose names are in a list dummies. I'm certain there's some cool pythonic way of doing this with one magical line that will do it all. I've tried using map with dummies, but haven't yet been able to figure it out.
IIUC you should be able to do
df[dummies].astype(str).apply(lambda x: ''.join(x), axis=1)
Example:
In [12]:
df = pd.DataFrame({'a':np.random.randint(0,100, 5), 'b':np.arange(5), 'c':np.random.randint(0,10,5)})
df
Out[12]:
a b c
0 5 0 2
1 46 1 3
2 86 2 4
3 85 3 9
4 60 4 4
In [15]:
cols=['a','c']
df[cols].astype(str).apply(''.join, axis=1)
Out[15]:
0 52
1 463
2 864
3 859
4 604
dtype: object
EDIT
As #JohnE has pointed out you could call sum instead which will be faster:
df[cols].astype(str).sum(axis=1)
However, that will implicitly convert the dtype to float64 so you'd have to cast back to str again and slice the decimal point off if necessary:
df[cols].astype(str).sum(axis=1).astype(str).str[:-2]
from operator import add
reduce(add, (df[c].astype(str) for c in cols), "")
For example:
df = pd.DataFrame({'a':np.random.randint(0,100, 5),
'b':np.arange(5),
'c':np.random.randint(0,10,5)})
cols = ['a', 'c']
In [19]: df
Out[19]:
a b c
0 6 0 4
1 59 1 9
2 13 2 5
3 44 3 1
4 79 4 4
In [20]: reduce(add, (df[c].astype(str) for c in cols), "")
Out[20]:
0 64
1 599
2 135
3 441
4 794
dtype: object
The first thing you need to do is to convert your Dataframe of numbers in a Dataframe of strings, as efficiently as possible:
dl = dl.astype(str)
Then, you're in the same situation as this other question, and can use the same Series.str accessor techniques as in this answer:
.str.cat()
Using str.cat() you could do:
dl['result'] = dl[dl.columns[0]].str.cat([dl[c] for c in dl.columns[1:]], sep=' ')
str.join()
To use .str.join() you need a series of iterables, say tuples.
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
Don't try the above with list instead of tuple or the apply() methdo will return a Dataframe and dataframes don't have the .str accessor like Series.

Categories