Efficient way to unnest pandas dataframe - python

I'm accessing a fairly large series of json files and storing them in a pandas series, part of a larger dataframe. There are several fields I want in said json, some of which are nested. I've been extracting them using json_normalize. The goal in to then merge these new fields with my original dataframe.
My problem is when I do so, instead of getting a dataframe with J rows and K columns, I get a J length series with each element being 1xK dataframe. I'm wondering if there is either an efficient vectorized way to turn this nested series/dataframe into a regular dataframe or get a regular dataframe from the start.
I've used map/lambda to create my nested series. Right now I'm unnesting with iteritems/append, but there has to be a more efficient way.
url_base = 'http:\\foo.bar='
df['http'] = df['id'].map(lambda x: url_base + x)
df['json'] = df['http'].map(lambda x: nf.get_json(x))
nest_ser = df['json'].map(lambda x: json_normalize(x))
df = pd.DataFrame()
for index, item in nest_ser.iteritems():
df = df.append(item)
json_normalize produces:
pd.Series([pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]))
instead of
pd.DataFrame([col1,col2...])

suppose your name of the output series out of json_normalize is sr:
pd.concat(sr.tolist())

Related

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)

How to get the difference of 2 lists in a Pandas DataFrame?

I'm new to python Pandas. I faced a problem to find the difference for 2 lists within a Pandas DataFrame.
Example Input with ; separator:
ColA; ColB
A,B,C,D; B,C,D
A,C,E,F; A,C,F
Expected Output:
ColA; ColB; ColC
A,B,C,D; B,C,D; A
A,C,E,F; A,C,F; E
What I want to do is similar to:
df['ColC'] = np.setdiff1d(df['ColA'].str.split(','), df['ColB'].str.split(','))
But it returns an error:
raise ValueError('Length of values does not match length of index',data,index,len(data),len(index))
Kindly advise
You can apply a lambda function on the DataFrame to find the difference like this:
import pandas as pd
# creating DataFrame (can also be loaded from a file)
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
# apply a lambda function to get the difference
df['ColC'] = df[['ColA','ColB']].apply(lambda x: [i for i in x[0] if i not in x[1]], axis=1)
Please notice! this will find the asymmetric difference ColA - ColB
Result:
A lot faster way to do this would be a simple set subtract:
import pandas as pd
#Creating a dataframe
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
#Finding the difference
df['ColC']= df['ColA'].map(set)-df['ColB'].map(set)
As the dataframe grows in row numbers, it will be computationally pretty expensive to do any row by row operation.

apply a function to each row of the dataframe

What is a more elegant way of implementing below?
I want to apply a function: my_function to a dataframe where each row of the dataframe contains the parameters of the function. Then I want to write the output of the function back to the dataframe row.
results = pd.DataFrame()
for row in input_panel.iterrows():
(index, row_contents) = row
row_contents['target'] = my_function(*list(row_contents))
results = pd.concat([results, row_contents])
We'll iterate through the values and build a DataFrame at the end.
results = pd.DataFrame([my_function(*x) for x in input_panel.values.tolist()])
The less recommended method is using DataFrame.apply:
results = input_panel.apply(lambda x: my_function(*x))
The only advantage of apply is less typing.

fuzzywuzzy to normalize string in pandas column

I have a dataframe like this
now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output dataframe should have the word 'election' in the 'comments' column like this
Assume that i have around 100k rows and possible combinations for the word 'election' can be many.
Kindly guide me on this part.
with the answer you gave, you can use pandas apply, stack and groupby functions to accelerate your code. you have input such as:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Merchant details': ['Alpha co','Bravo co'],
'Comments':['electionsss are around',
'vote in eelecttions']})
For the column 'comments', you can create a temporary mutiindex DF containing a word per row by splitting and using stack function:
df_temp = pd.DataFrame(
{'split_comments':df['Comments'].str.split(' ',expand=True).stack()})
Then you create the column with corrected word (according to your idea), using apply and the comparision of fuzz.ratio:
df_temp['corrected_comments'] = df_temp['split_comments'].apply(
lambda wd: 'election' if fuzz.ratio(wd, 'election') > 75 else wd)
Finally, you write back in your column Comments of df with the corrected data using groupby and join functions:
df['Comments'] = df_temp.reset_index().groupby('level_0').apply(
lambda wd: ' '.join(wd['corrected_comments']))
Don't operate on the dataframe. The overhead will kill you. Turn the column into a list, then iteratecover that. And finally assign that list back to the column.
Ok i tried this myself and came up with this code -
for i in range(len(df)):
a = []
a = df.comments[i].split()
for j in word:
for k in range(len(a)):
if fuzz.ratio(j,a[k]) > 75:
a[k] = j
df.comments[i] = a
df.comments[i] = ' '.join(df.comments[i])
But this approach seems slow for a large dataframe.
Can someone provide a better pythonic way of implementing this.

Categories