how to flatten array in pandas dataframe - python

Assuming I have a pandas dataframe such as
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
I want to extract the series which contains the flatten arrays in each row whilst preserving the order
The expected result is a pandas.core.series.Series
This question is not a duplicate because my expected output is a pandas Series, and not a dataframe.

The solutions using melt are slower than OP's original method, which they shared in the answer here, especially after the speedup from my comment on that answer.
I created a larger dataframe to test on:
df = pd.DataFrame({'name_array': np.random.rand(1000, 3).tolist()})
And timing the two solutions using melt on this dataframe yield:
In [16]: %timeit pd.melt(df.name_array.apply(pd.Series).reset_index(), id_vars=['index'],value_name='name_array').drop('variable', axis=1).sort_values('index')
173 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [17]: %timeit df['name_array'].apply(lambda x: pd.Series([i for i in x])).melt().drop('variable', axis=1)['value']
175 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The OP's method with the speedup I suggested in the comments:
In [18]: %timeit pd.Series(np.concatenate(df['name_array']))
18 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And finally, the fastest solution as provided here but modified to provide a series instead of dataframe output:
In [14]: from itertools import chain
In [15]: %timeit pd.Series(list(chain.from_iterable(df['name_array'])))
402 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This last method is faster than melt() by 3 orders of magnitude and faster than np.concatenate() by 2 orders of magnitude.

This is the solution I've figured out. Don't know if there are more efficient ways.
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
data = pd.DataFrame( {'column':np.concatenate(df_p['name_array'].values)} )['column']
output:
[0 20130101
1 320903902
2 239032902
3 20130101
4 3253453
5 239032902
6 65756
7 4342452
8 32425432523
Name: column, dtype: int64]

You can use pd.melt:
pd.melt(df_p.name_array.apply(pd.Series).reset_index(),
id_vars=['index'],
value_name='name_array') \
.drop('variable', axis=1) \
.sort_values('index')
OUTPUT:
index name_array
0 20130101
0 320903902
0 239032902
1 20130101
1 3253453
1 239032902
2 65756
2 4342452
2 32425432523

you can flatten list of column's lists, and then create series of that, in this way:
pd.Series([element for row in df_p.name_array for element in row])

Related

Split / extract a Pandas Column of nested tuples into multiple columns

I am currently working with an .xml file that I have converted into a data frame that looks like so:
I want to split the Coordinates column into 4 separate columns with the following layout:
to_longitude, to_latitude, from_longitude, from_latitude
I am attempting to do this with the code below:
pd.concat([df[[0]], df[1].str.split(',', expand=True)], axis=1)
However, this gives me the following error:
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"
My question is what am I doing wrong and how can I correct my code to make it work as intended?
Consider using Pandas apply function -
def my_func(record):
record['to_longitude']=record['Coordinates'][0][0]
record['to_latitude']=record['Coordinates'][0][1]
record['from_longitude']=record['Coordinates'][1][0]
record['from_latitude']=record['Coordinates'][1][1]
return record
new_df = df.apply(my_func, axis=1)
You can use the string accessor str[] to get the values of nested tuples to set up the 4 columns, as follows:
df['to_longitude'] = df['Coordinates'].str[0].str[0]
df['to_latitude'] = df['Coordinates'].str[0].str[1]
df['from_longitude'] = df['Coordinates'].str[1].str[0]
df['from_latitude'] = df['Coordinates'].str[1].str[1]
Demo
data = {'Link': ['abd', 'abe'],
'Coordinates': [((-4.21, 55.85), (-4.22, 55.86)), ((-4.25, 55.82), (-4.26, 55.83))]}
df = pd.DataFrame(data)
Link Coordinates
0 abd ((-4.21, 55.85), (-4.22, 55.86))
1 abe ((-4.25, 55.82), (-4.26, 55.83))
df['to_longitude'] = df['Coordinates'].str[0].str[0]
df['to_latitude'] = df['Coordinates'].str[0].str[1]
df['from_longitude'] = df['Coordinates'].str[1].str[0]
df['from_latitude'] = df['Coordinates'].str[1].str[1]
Link Coordinates to_longitude to_latitude from_longitude from_latitude
0 abd ((-4.21, 55.85), (-4.22, 55.86)) -4.21 55.85 -4.22 55.86
1 abe ((-4.25, 55.82), (-4.26, 55.83)) -4.25 55.82 -4.26 55.83
Execution time comparison:
Test data of 40,000 rows
df2 = pd.concat([df] * 20000, ignore_index=True)
Solution 1: Tom Ron's solution
def my_func(record):
record['to_longitude']=record['Coordinates'][0][0]
record['to_latitude']=record['Coordinates'][0][1]
record['from_longitude']=record['Coordinates'][1][0]
record['from_latitude']=record['Coordinates'][1][1]
return record
%timeit new_df = df2.apply(my_func, axis=1)
Result:
1min 16s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2: SeaBean's solution
%%timeit
df2['to_longitude'] = df2['Coordinates'].str[0].str[0]
df2['to_latitude'] = df2['Coordinates'].str[0].str[1]
df2['from_longitude'] = df2['Coordinates'].str[1].str[0]
df2['from_latitude'] = df2['Coordinates'].str[1].str[1]
Result:
165 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3: Anurag Dabas' solution
%%timeit
cols=['to_longitude','to_latitude','from_longitude','from_latitude']
out=pd.DataFrame(np.hstack(df2['Coordinates'].values),columns=cols)
#OR
#out=pd.DataFrame(np.concatenate(df['Coordinates'].values,axis=1),columns=cols)
Result:
Can't get the benchmarking since got error for both options:
ValueError: Shape of passed values is (2, 80000), indices imply (2, 4)
Summary
Solution 1: Tom Ron's solution
1min 16s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2: SeaBean's solution
165 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3: Anurag Dabas' solution
Can't get the benchmarking since got error for large dataset both options
For the first 2 solutions with benchmarking results, SeaBean's solution is 460x times faster than Tom Ron's solution (165ms vs 1min 16s) for 40,000 rows of data.
The faster execution time is contributed by all vectorized Pandas operations (optimized with fast C or CPython codes) used in codes, instead of the slow apply() on axis=1 which under the hood is slow Python looping.

Get DataFrame of indexed value of value indexed of from a DF for the whole column

I have a Pandas DataFrame:
>>> df
a b c
foo john george micheal
bar sean david sam
Now I want a DataFrame which has only first two characters of all columns by processing the above.
So after some statement, df should become:
>>> df
a b c
foo jo ge mi
bar se da sa
I have tried options like : df['a'].str[:2], this works but it only satisfies for one column, if I try for multiple columns like df[df.columns].str[:2] or df[:].str[:2] It throws error.
So how can I achieve that?
You could use apply
print(df.apply(lambda x: x.str[:2]))
1 2 3
0 jo ge mi
1 se da sa
I was looking for some quick vectorized solution. I have got one solution which seems faster than other solutions by creating a new DataFrame of values of Old DataFrame whose Indexing I have done using numpy's dtype operations and used column name of older DataFrame:
>>> pd.DataFrame(df.values.astype('<U2'),columns=df.columns)
As I had DataFrame with huge number of columns, when I did timeit on a dummy df with same number of columns:
#ScootCork's Answer:
>>> %t -n10 df.apply(lambda x: x.str[:2])
3.23 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#anky's Comment:
>>> %t -n10 df.applymap(lambda x: x[:2])
2.1 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Shubham Sharma's Comment:
>>> %t -n10 df.transform(lambda s: s.str[:2])
2.56 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
My Solution:
>>> %t -n10 pd.DataFrame(df.values.astype('<U2'),columns=df.columns)
600 µs ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Create a new dataframe based on two columns of value in pandas dataframe

I have a dataframe df1 in python as below:
Type Category
a 1
b 2
c 3
d 4
Expected output:
Type
a/1
b/2
c/3
d/4
The actual dataframe is way larger than this thus i can't type out every cells for the new dataframe.
How can I extract the columns and output to another dataframe with the '/' seperated? Maybe using some for loop?
Using str.cat
The right pandas-y way to proceed is by using str.cat
df['Type'] = df.Type.str.cat(others=df.Category.astype(str), sep='/')
others contains the pd.Series to concatenate, and sep the separator to use.
Result
Type
0 a/1
1 b/2
2 c/3
3 d/4
Performance comparison
%%timeit
df.Type.str.cat(others=df.Category.astype(str), sep='/')
>> 286 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['Type'] + '/' + df['Category'].astype(str)
>> 348 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both solutions give the same result, but using str.cat is about ~20% faster.

avoid repeating the dataframe name when operating on pandas columns

Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?
In R, data.table allows to operate on a column without repeating the dataframe name like this
very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))
# operate on the columns without repeating the dt name:
very_long_dt_name[,ratio:=round(col1/col2,2)]
I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:
data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)
# operate on the columns requires repeating the df name
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.
Try assign:
very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))
Output:
col1 col2 ratio
0 1 3 0.33
1 2 3 0.67
2 3 1 3.00
Edit: to reflect comments, tests on 1 million rows:
%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And with np.round, assign
%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and not-assign:
%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
SO it appears that assign is vectorized, just not as well tuned.

obtaining last value of dataframe column without index

Suppose I have a DataFrame such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
and I would like to retrieve the last value in column e. I could do:
df['e'].tail(1)
but this would return a series which has index 9 with it. Ideally, I just want to obtain the value as a number that I can work with directly. I could also do:
np.array(df['e'].tail(1))
but this would then require me to access/call the 0'th element of it before I can really work with it. Is there a more direct/easy way to do this?
You could try iloc method of dataframe:
In [26]: df
Out[26]:
a b c d e
0 -1.079547 -0.722903 0.457495 -0.687271 -0.787058
1 1.326133 1.359255 -0.964076 -1.280502 1.460792
2 0.479599 -1.465210 -0.058247 -0.984733 -0.348068
3 -0.608238 -1.238068 -0.126889 0.572662 -1.489641
4 -1.533707 -0.218298 -0.877619 0.679370 0.485987
5 -0.864651 -0.180165 -0.528939 0.270885 1.313946
6 0.747612 -1.206509 0.616815 -1.758354 -0.158203
7 -2.309582 -0.739730 -0.004303 0.125640 -0.973230
8 1.735822 -0.750698 1.225104 0.431583 -1.483274
9 -0.374557 -1.132354 0.875028 0.032615 -1.131971
In [27]: df['e'].iloc[-1]
Out[27]: -1.1319705662711321
Or if you want just scalar you could use iat which is faster. From docs:
If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures
In [28]: df.e.iat[-1]
Out[28]: -1.1319705662711321
Benchmarking:
In [31]: %timeit df.e.iat[-1]
100000 loops, best of 3: 18 µs per loop
In [32]: %timeit df.e.iloc[-1]
10000 loops, best of 3: 24 µs per loop
Try
df['e'].iloc[[-1]]
Sometimes,
df['e'].iloc[-1]
doesn't work.
We can also access it by indexing df.index and at:
df.at[df.index[-1], 'e']
It's faster than iloc but slower than without indexing.
If we decide to assign a value to the last element in column "e", the above method is much faster than the other two options (9-11 times faster):
>>> %timeit df.at[df.index[-1], 'e'] = 1
11.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit df['e'].iat[-1] = 1
107 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df['e'].iloc[-1] = 1
127 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)```

Categories