Splitting columns in pandas - python

I have pandas dataframe with the following structure:
df1 = pd.DataFrame({'id': 1, 'coords':{0: [(-43.21,-22.15),(-43.22,-22.22)]}})
How can I separate the values from the coords column so that the first item in each list forms the column called latitude and the second the column called longitude, as below?
id| latitude |longitude
1 |(-43.21,-43.22)|(-22.15, -22.22)

Using join with column explode
df1=df1.join(pd.DataFrame(df1.coords.tolist(),index=df1.index,columns=['latitude','longitude']))
Out[138]:
id coords latitude longitude
0 1 [(-43.21, -22.15), (-43.22, -22.22)] (-43.21, -22.15) (-43.22, -22.22)

apply is a straightforward way:
df1['latitude'] = df1.coords.apply(lambda x: x[0])
df1['longitude'] = df1.coords.apply(lambda x: x[1])
Output:
id coords latitude longitude
0 1 [(-43.21, -22.15), (-43.22, -22.22)] (-43.21, -22.15) (-43.22, -22.22)

Simply using the .str accessor
df1['latitude'] = df1['coords'].str[0]
df1['longitude'] = df1['coords'].str[1]
Time difference:
df1['latitude'] = df1['coords'].str[0]
# 539 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df1['latitude'] = df1.coords.apply(lambda x: x[0])
# 624 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

take the tuple for lat:
lat = [(x[0][0],x[1][0]) for x in df1['coords'].values]
df1['latitude'] = lat
same as for longt:
longt = [(x[0][1],x[1][1]) for x in df1['coords'].values]
df1['longtitude'] = longt
drop coords columns:
df1.drop(columns='coords')
hope this helps!

Related

pandas split string and extract upto n index position

I have my input data like as below stored in a dataframe column
active_days_revenue
active_days_rate
total_revenue
gap_days_rate
I would like to do the below
a) split the string using _ delimiter
b) extract n elements from the delimiter
So, I tried the below
df['text'].split('_')[:1] # but this doesn't work
df['text'].split('_')[0] # this works but returns only the 1st element
I expect my output like below. Instead of just getting items based on 0 index position, I would like to get from 0 to 1st index position
active_days
active_days
total_revenue
gap_days
You can use str.extract with a dynamic regex (fastest):
N = 2
df['out'] = df['text'].str.extract(fr'([^_]+(?:_[^_]+){{,{N-1}}})', expand=False)
Or slicing and agg:
df['out'] = df['text'].str.split('_').str[:2].agg('_'.join)
Or str.extractall and groupby.agg:
df['out'] = df['text'].str.extractall('([^_]+)')[0].groupby(level=0).agg(lambda x: '_'.join(x.head(2)))
Output:
text out
0 active_days_revenue active_days
1 active_days_rate active_days
2 total_revenue total_revenue
3 gap_days_rate gap_days
timings
On 4k rows:
# extract
2.17 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# split/slice/agg
3.56 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# extractall
361 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
using as grouper for columns
import re
N = 2
df1.groupby(lambda x: m.group() if (m:=re.match(fr'([^_]+(?:_[^_]+){{,{N-1}}})', x)) else x, axis=1, sort=False)

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

Replacing cell values with column header value

I've got a dataframe:
a-line abstract ... zippered
0 0 ... 0
0 1 ... 0
0 0 ... 1
Where the value of the cell is 1 I need to replace it with the header name.
df.dtypes returns Length: 1000, dtype: object
I have tried df.apply(lambda x: x.astype(object).replace(1, x.name))
but get TypeError: Invalid "to_replace" type: 'int'
other attempts:
df.apply(lambda x: x.astype(object).replace(str(1), x.name)) == TypeError: Invalid "to_replace" type: 'str'
df.apply(lambda x: x.astype(str).replace(str(1), x.name)) == Invalid "to_replace" type: 'str'
The key idea to all three solutions below is to loop through columns. The first method is with replace.
for col in df:
df[col]=df[col].replace(1, df[col].name)
Alternatively, per your attempt to apply a lambda:
for col in df_new:
df_new[col]=df_new[col].astype(str).apply(lambda x: x.replace('1',df_new[col].name))
Finally, this is with np.where:
for col in df_new:
df_new[col]=np.where(df_new[col] == 1, df_new[col].name, df_new[col])
Output for all three:
a-line abstract ... zippered
0 0 0 ... 0
1 0 abstract ... 0
2 0 0 ... zippered
You might consider to play from this idea
import pandas as pd
df = pd.DataFrame([[0,0,0],
[0,1,0],
[0,0,1],
[0,1,0]],
columns=["a","b","c"])
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
UPDATE: Timing
#David Erickson solution it's perfect but you can avoid the loop. In particular if you have many columns.
Generate data
import pandas as pd
import numpy as np
n = 1_000
columns = ["{:04d}".format(i) for i in range(n)]
df = pd.DataFrame(np.random.randint(0, high=2, size=(4,n)),
columns=columns)
# we test against the same dataframe
df_bk = df.copy()
David's solution #1
%%timeit -n10
for col in df:
df[col]=df[col].replace(1, df[col].name)
1.01 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #2
%%timeit -n10
df = df_bk.copy()
for col in df:
df[col]=df[col].astype(str).apply(lambda x: x.replace('1',df[col].name))
890 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #3
%%timeit -n10
for col in df:
df[col]=np.where(df[col] == 1, df[col].name, df[col])
886 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Avoiding loops
%%timeit -n10
df = df_bk.copy()
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
455 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

how to improve searching index in dataframe

Given a pandas dataframe with a timestamp index, sorted.
I have a label and I need to find the closest index to that label.
Also, I need to find a smaller timestamp, so the search should be computed in the minor timestamps.
Here is my code:
import pandas as pd
import datetime
data = [i for i in range(100)]
dates = pd.date_range(start="01-01-2018", freq="min", periods=100)
dataframe = pd.DataFrame(data, dates)
label = "01-01-2018 00:10:01"
method = "pad"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print("Closest idx:"+str(idx))
print("Closest date:"+str(dataframe.index[idx]))
the searching is too slow. Is there a way to improve it?
To improve performance, I recommend a transformation of what you're searching. Instead of using get_loc, you can convert your DateTimeIndex to Unix Time, and use np.searchsorted on the underlying numpy array (As the name implies, this requires a sorted index).
get_loc:
(Your current approach)
label = "01-01-2018 00:10:01"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print(dataframe.iloc[idx])
0 10
Name: 2018-01-01 00:10:00, dtype: int64
And it's timings:
%timeit dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
2.03 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.searchsorted:
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
idx = np.max(np.searchsorted(arr, l, side='left')-1, 0)
print(dataframe.iloc[idx])
0 10
Name: 2018-01-01 00:10:00, dtype: int64
And the timings:
%timeit np.max(np.searchsorted(arr, l, side='left')-1, 0)
56.6 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(I didn't include the setup costs, because the initial array creation should be something you do once, then use for every single query, but even if I did include the setup costs, this method is faster):
%%timeit
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
np.max(np.searchsorted(arr, l, side='left')-1, 0)
394 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The above method does not enforce a tolerance of 60s, although this is trivial to check:
>>> np.abs(arr[idx]-l)<60
True

Pandas: Replace a string with 'other' if it is not present in a list of strings

I have the following data frame, df, with column 'Class'
Class
0 Individual
1 Group
2 A
3 B
4 C
5 D
6 Group
I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?
I have seen examples for replace, such as:
df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})
but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.
I think you need:
df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
Another solution (slower):
m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')
Another solution:
df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
Performance (in real data depends of number of replacements):
#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)
In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another approach could be:
df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
You can do it this way for example
get list of unique items list = df['Class'].unique()
remove your known class list.remove('Individual')....
then list all Other rows df[df.class is in list]
replace class values df[df.class is in list].class = 'Other'
Sorry for this pseudo-pseudo code, but principle is same.
You can use pd.Series.where:
df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other', inplace=True)
print(df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
This should be efficient versus map + fillna:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other')
# 60.3 ms per loop
%timeit df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
# 133 ms per loop
Another way using apply :
df['Class'] = df['Class'].apply(lambda cl : cl if cl in ["Individual","Group"] else "Other"]

Categories