How to build a Python function to build Pandas DataFrames dynamically? - python

I have a dataframe in pandas that I need to use to create other dataframes from.
the dataframe contains naics codes along with related data. I am trying to create a new dataframe per code in essence and getting stuck on an error.
fdf is a dataframe with 2 digit numbers ie: 10,11,12,13.
I want to loop through this dataframe to query and build many others. Here is what I have so far:
for x in fdf:
'Sdf' + str(x) = df[df['naics'].astype(str).str[2:4]==str(x)]
if I run this by itself:
df[df['naics'].astype(str).str[2:4]==str(57)]
it returns the dataframe I want, but I am not sure how to build this into a function.
'SyntaxError: can't assign to function call' is the error I get. I think the issue is how I am trying to dynamically build the dataframe name?
any help is greatly appreciated.

Do it with use of dictionary.
df_list = {}
for x in fdf:
df_list[str(x)] = df[df['naics'].astype(str).str[2:4]==str(x)]

Related

create dynamic column names in pandas

I am trying to create multiple dataframes inside a for loop using the below code:
for i in range(len(columns)):
f'df_v{i+1}' = df.pivot(index="no", columns=list1[i], values=list2[i])
But I get the error "Cannot assign to literal". Not sure whether there is a way to create the dataframes dynamically in pandas?
This syntax
f'df_v{i+1}' = df.pivot(index="no", columns=list1[i], values=list2[i])
means that you are trying to assign DataFrames to a string, which is not possible. You might try using a dictionary, instead:
my_dfs = {}
for i in range(len(columns)):
my_dfs[f'df_v{i+1}'] = df.pivot(index="no", columns=list1[i], values=list2[i])
Since it allows the use of named keys, which seems like what you want. This way you can access your dataframes using my_dfs['df_v1'], for example.

how do I create a new column out of a dictionary's sub string on a pandas dataframe

I have the following repo for the files: https://github.com/Glarez/learning.git
dataframe
I need to create a column with the bold part of that string under the params column: "ufield_18":"ONLY" I dont see how can I get that since I'm learning to code from scratch. The solution to this would be nice, but what I would really appreciate is you to point me at the right direction to get the answer for myself. THANKS!
Since you do not want the exact answer. I will provide you one of the ways to achieve this:
filter the params column into a dictionary variable
create a loop to access the keys of the dictionary
append it to the pandas df you have (df[key] = np.nan) - Make sure you add some values while appending the column if your df already has some rows or just add np.nan
note np is numpy library which needs to be imported

Python pandas clarity on syntax for groupby

I run into this problem frequently and it isn't clear to me why the below python code will run
groups = session['time'].dt.total_seconds().groupby(session['user'])
but this python code will not run
groups = session['time'].dt.total_seconds().groupby(session[['user','date']])
or
groups = session['time'].dt.total_seconds().groupby(session['user','date'])
Why can't I tack on another column to groupby in this way? How can I write this statement better?
Thank you for guidance, I'm a newbie with Python
You are creating a series with session['time'], and thus a SeriesGroupBy object with this code, but you seem to want to access other columns in the dataframe.
The more common syntax is grouped = df.groupby(columns_to_group_by)[columns_to_keep]. I wouldn't name the variable groups because that is also the name of a property of the GroupBy object.

Replicating Excel's VLOOKUP in Python Pandas

Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.
I want to take the following excel file:
Before
and:
1)convert the 'before' file into a pandas data frame
2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
3)the result should look like the 'After' file. I would like to convert this back into CSV format.
After
So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.
I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.
I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.
Thanks in advance for any help, i really appreciate it.
Colin
I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.
import pandas as pd
import numpy as np
df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df
The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.

find column name in dataframe

Using ipython for interactive manipulation, the autocomplete feature helps expanding columns names quickly.
But given the column object, I'd like to get it's name but I haven't found a simple way to do it. Is there one?
I'm trying to avoid typing the full "ALongVariableName"
x = "ALongVariableName"
relevantColumn = df[x]
instead I type "df.AL<\Tab>" to get my series. So I have:
relevantColumn = df.ALongVariableName #now how can I get x?
But that series object doesn't carry its name or index in the dataframe. Did I miss it?
Thanks!

Categories