Split / extract a Pandas Column of nested tuples into multiple columns - python

I am currently working with an .xml file that I have converted into a data frame that looks like so:
I want to split the Coordinates column into 4 separate columns with the following layout:
to_longitude, to_latitude, from_longitude, from_latitude
I am attempting to do this with the code below:
pd.concat([df[[0]], df[1].str.split(',', expand=True)], axis=1)
However, this gives me the following error:
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"
My question is what am I doing wrong and how can I correct my code to make it work as intended?

Consider using Pandas apply function -
def my_func(record):
record['to_longitude']=record['Coordinates'][0][0]
record['to_latitude']=record['Coordinates'][0][1]
record['from_longitude']=record['Coordinates'][1][0]
record['from_latitude']=record['Coordinates'][1][1]
return record
new_df = df.apply(my_func, axis=1)

You can use the string accessor str[] to get the values of nested tuples to set up the 4 columns, as follows:
df['to_longitude'] = df['Coordinates'].str[0].str[0]
df['to_latitude'] = df['Coordinates'].str[0].str[1]
df['from_longitude'] = df['Coordinates'].str[1].str[0]
df['from_latitude'] = df['Coordinates'].str[1].str[1]
Demo
data = {'Link': ['abd', 'abe'],
'Coordinates': [((-4.21, 55.85), (-4.22, 55.86)), ((-4.25, 55.82), (-4.26, 55.83))]}
df = pd.DataFrame(data)
Link Coordinates
0 abd ((-4.21, 55.85), (-4.22, 55.86))
1 abe ((-4.25, 55.82), (-4.26, 55.83))
df['to_longitude'] = df['Coordinates'].str[0].str[0]
df['to_latitude'] = df['Coordinates'].str[0].str[1]
df['from_longitude'] = df['Coordinates'].str[1].str[0]
df['from_latitude'] = df['Coordinates'].str[1].str[1]
Link Coordinates to_longitude to_latitude from_longitude from_latitude
0 abd ((-4.21, 55.85), (-4.22, 55.86)) -4.21 55.85 -4.22 55.86
1 abe ((-4.25, 55.82), (-4.26, 55.83)) -4.25 55.82 -4.26 55.83
Execution time comparison:
Test data of 40,000 rows
df2 = pd.concat([df] * 20000, ignore_index=True)
Solution 1: Tom Ron's solution
def my_func(record):
record['to_longitude']=record['Coordinates'][0][0]
record['to_latitude']=record['Coordinates'][0][1]
record['from_longitude']=record['Coordinates'][1][0]
record['from_latitude']=record['Coordinates'][1][1]
return record
%timeit new_df = df2.apply(my_func, axis=1)
Result:
1min 16s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2: SeaBean's solution
%%timeit
df2['to_longitude'] = df2['Coordinates'].str[0].str[0]
df2['to_latitude'] = df2['Coordinates'].str[0].str[1]
df2['from_longitude'] = df2['Coordinates'].str[1].str[0]
df2['from_latitude'] = df2['Coordinates'].str[1].str[1]
Result:
165 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3: Anurag Dabas' solution
%%timeit
cols=['to_longitude','to_latitude','from_longitude','from_latitude']
out=pd.DataFrame(np.hstack(df2['Coordinates'].values),columns=cols)
#OR
#out=pd.DataFrame(np.concatenate(df['Coordinates'].values,axis=1),columns=cols)
Result:
Can't get the benchmarking since got error for both options:
ValueError: Shape of passed values is (2, 80000), indices imply (2, 4)
Summary
Solution 1: Tom Ron's solution
1min 16s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2: SeaBean's solution
165 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3: Anurag Dabas' solution
Can't get the benchmarking since got error for large dataset both options
For the first 2 solutions with benchmarking results, SeaBean's solution is 460x times faster than Tom Ron's solution (165ms vs 1min 16s) for 40,000 rows of data.
The faster execution time is contributed by all vectorized Pandas operations (optimized with fast C or CPython codes) used in codes, instead of the slow apply() on axis=1 which under the hood is slow Python looping.

Related

Concat strings from dataframe columns in a loop (Python 3.8)

Suppose I have a DataFrame "DS_df" containing strings ands numbers. The three columns "LAultimateparentcountry", "borrowerultimateparentcountry" and "tot" form a relationship.
How can I create a dictionary out of those three columns (for the entire dataset, while order matters)? I would need to access the two countries as one variable, and tot as another. I've tried the code below so far, but this merely yields me a list with separate items. For some reason, I am also not able to get .join to work, as the df is quite big (+900k rows).
new_list =[]
for i, row in DS_df.iterrows():
new_list.append(row["LAultimateparentcountry"])
new_list.append(row["borrowerultimateparentcountry"])
new_list.append(row["tot"])
Preferred outcome would be a dictionary, where I could access "Germany_Switzerland": 56708 for example. Any help or advice is much appreciated.
Cheers
You can use a dict this way:
countries_map = {}
for index, row in DS_df.iterrows():
curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
countries_map[curr_rel] = row["tot"]
If you are not wishing to not run over existing keys values
(and use their first appearance):
countries_map = {}
for index, row in DS_df.iterrows():
curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
if curr_rel not in countries_map.keys():
countries_map[curr_rel] = row["tot"]
When performing operations on a dataframe it's always good to think for a solution column-wise and not row-wise.
If your dataframe is having 900k+ rows then it might be a good option to apply vectorized operations on dataframe.
Below are two solutions:
Using pd.Series + to_dict():
pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
Using zip() + dict():
dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
Test Dataframe:
DS_df = DataFrame({
'LAultimateparentcountry':['India', 'Germany', 'India'],
'borrowerultimateparentcountry':['France', 'Ireland', 'France'],
'tot':[56708, 87902, 91211]
})
DS_df
LAultimateparentcountry borrowerultimateparentcountry tot
0 India France 56708
1 Germany Ireland 87902
2 India France 91211
Output of both solutions:
{'India_France': 91211, 'Germany_Ireland': 87902}
If the formed key has duplicates then the value will be updated.
Which solution is more performant?
short answer -
zip() + dict() # if the rows are approx. below 1000000
pd.Series + to_dict() # if the rows are approx. above 1000000
Long answer - Below are the tests:
Test with 30 rows and 3 columns
zip() + dict()
%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
297 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Series + to_dict():
%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
506 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Test with 6291456 rows and 3 columns
pd.Series + to_dict()
%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
3.92 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zip + dict()
%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
3.97 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Create a new dataframe based on two columns of value in pandas dataframe

I have a dataframe df1 in python as below:
Type Category
a 1
b 2
c 3
d 4
Expected output:
Type
a/1
b/2
c/3
d/4
The actual dataframe is way larger than this thus i can't type out every cells for the new dataframe.
How can I extract the columns and output to another dataframe with the '/' seperated? Maybe using some for loop?
Using str.cat
The right pandas-y way to proceed is by using str.cat
df['Type'] = df.Type.str.cat(others=df.Category.astype(str), sep='/')
others contains the pd.Series to concatenate, and sep the separator to use.
Result
Type
0 a/1
1 b/2
2 c/3
3 d/4
Performance comparison
%%timeit
df.Type.str.cat(others=df.Category.astype(str), sep='/')
>> 286 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['Type'] + '/' + df['Category'].astype(str)
>> 348 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both solutions give the same result, but using str.cat is about ~20% faster.

Efficiently aggregate results into a Python Data Structure

I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations
You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.

how to flatten array in pandas dataframe

Assuming I have a pandas dataframe such as
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
I want to extract the series which contains the flatten arrays in each row whilst preserving the order
The expected result is a pandas.core.series.Series
This question is not a duplicate because my expected output is a pandas Series, and not a dataframe.
The solutions using melt are slower than OP's original method, which they shared in the answer here, especially after the speedup from my comment on that answer.
I created a larger dataframe to test on:
df = pd.DataFrame({'name_array': np.random.rand(1000, 3).tolist()})
And timing the two solutions using melt on this dataframe yield:
In [16]: %timeit pd.melt(df.name_array.apply(pd.Series).reset_index(), id_vars=['index'],value_name='name_array').drop('variable', axis=1).sort_values('index')
173 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [17]: %timeit df['name_array'].apply(lambda x: pd.Series([i for i in x])).melt().drop('variable', axis=1)['value']
175 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The OP's method with the speedup I suggested in the comments:
In [18]: %timeit pd.Series(np.concatenate(df['name_array']))
18 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And finally, the fastest solution as provided here but modified to provide a series instead of dataframe output:
In [14]: from itertools import chain
In [15]: %timeit pd.Series(list(chain.from_iterable(df['name_array'])))
402 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This last method is faster than melt() by 3 orders of magnitude and faster than np.concatenate() by 2 orders of magnitude.
This is the solution I've figured out. Don't know if there are more efficient ways.
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
data = pd.DataFrame( {'column':np.concatenate(df_p['name_array'].values)} )['column']
output:
[0 20130101
1 320903902
2 239032902
3 20130101
4 3253453
5 239032902
6 65756
7 4342452
8 32425432523
Name: column, dtype: int64]
You can use pd.melt:
pd.melt(df_p.name_array.apply(pd.Series).reset_index(),
id_vars=['index'],
value_name='name_array') \
.drop('variable', axis=1) \
.sort_values('index')
OUTPUT:
index name_array
0 20130101
0 320903902
0 239032902
1 20130101
1 3253453
1 239032902
2 65756
2 4342452
2 32425432523
you can flatten list of column's lists, and then create series of that, in this way:
pd.Series([element for row in df_p.name_array for element in row])

Vectorized way for applying a function to a dataframe to create lists

I have seen few questions like these
Vectorized alternative to iterrows ,
Faster alternative to iterrows , Pandas: Alternative to iterrow loops ,
for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach.
My questions is also again about .iterrows.
I am trying to pass the first and second row to a function and create a list out of it.
What I have:
I have a pandas DataFrame with two columns that look like this.
I.D Score
1 11 26
3 12 26
5 13 26
6 14 25
What I did:
where the term Point is a function I earlier defined.
my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
What I am trying to do:
The faster and vectorized form of the above.
Try list comprehension:
score = pd.concat([score] * 1000, ignore_index=True)
def Points(a,b):
return (a,b)
In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Have you ever tried the method .itertuples()?
my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]
Is a faster way to iterate over a pandas dataframe.
I hope it help.
The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.
You can use pandas.DataFrame.apply with axis set to 1:
df.apply(func, axis=1)
To put in a list, it depends what your function returns but you could:
df.apply(Points, axis=1).tolist()
If you want to apply on only some columns:
df[['Score', 'I.D']].apply(Points, axis=1)
If you want to apply on a func that takes multiple args use numpy.vectorize for speed:
np.vectorize(Points)(df['Score'], df['I.D'])
Or a lambda:
df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()

Categories