Pandas function to add field to dataframe does not work

Pandas function to add field to dataframe does not work - python

I have some code which I want to use in a dynamic python function. This code adds a field to an existing dataframe and does some adjustments to it. However, I got the error "TypeError: string indices must be integers". What am I doing incorrectly?
See below the function plus the code for calling the function.
import pandas as pd
#function
def create_new_date_column_in_df_based_on_other_date_string_column(df,df_field_existing_str,df_field_new):
df[df_field_new] = df[df_field_existing_str]
df[df_field_new] = df[df_field_existing_str].str.replace('12:00:00 AM','')
df[df_field_new] = df[df_field_existing_str].str.strip()
df[df_field_new]=pd.to_datetime(df[df_field_existing_str]).dt.strftime('%m-%d-%Y')
return df[df_field_new]
#calling the function
create_new_date_column_in_df_based_on_other_date_string_column(df='my_df1',df_field_existing_str='existingfieldname',df_field_new='newfieldname')

The parameters df you are giving the function is of type str, and so is df_field_existing_str.
What basically you're doing is trying to slice a string/get a specific characters by using the [] (or the .__getitem__() method) with another string which is impossible.
You are not using a DataFrame here, only strings, thus you are getting this TypeError.

Related

Why is "numpy.int32" not able to be printed here? (Using geopandas + python 3.9.5)

Here is the relevant code:
import geopandas as gpd
#A shape file (.shp) is imported here, contents do not matter, since the "size()" function gets the size of the contents
shapefile = 'Data/Code_Specific/ne_50m_admin_1_states_provinces/ne_50m_admin_1_states_provinces.shp'
gdf = gpd.read_file(shapefile)[['admin', 'adm0_a3', 'postal', 'geometry']]
#size
#Return an int representing the number of elements in this object.
print(gdf.size())
I am getting an error for the last line of code,
TypeError: 'numpy.int32' object is not callable
The main purpose for this is that I am trying to integrade gdf.size() into a for loop:
for index in range(gdf.size()):
print("test", index)
#if Austrailia, remove
if gdf.get('adm0_a3')[index] == "AUS":
gdf = gdf.drop(gdf.index[index])
I have absolutely no clue what to do here, this is my first post on this site ever. Hope I don't get guilded with a badge of honor for how stupid or simple this is, I'm stumped.

gpd.read_file will return either a GeoDataFrame or a DataFrame object, both of which have the attribute size which returns an integer. The attribute is simply accessed with gdf.size and by adding brackets next to it, you get your error.
size is the wrong attribute to use, as for a table it returns the number of rows times the number of columns. At first glance the following should work
for index in gdf.index:
...
but you're modifying the length of an iterable while iterating from it. This can throw everything out of sync and cause a KeyError if you drop an index and before you try to access it. Since all you want to do is filter some rows, simply use
gdf = gdf[gdf['adm0_a3'] != 'AUS']

I think the function you are looking for is,
gdf.shape[0]
or
len(gdf.index)
I think the first option is more readable but the second one is faster.

Is there a way to find a substring in a DataFrame?

Well, I got this problem:
I have a pandas DataFrame and I'm trying to find a the value that starts with "THRL-" and delete that exact same prefix, I've tried to make it a string and use the result = df.toString() method as it follows (Where result is a DataFrame):
a = result.replace('THRL-', '')
But it doesn't work, I still see the same THRL- prefix in the string that I'm returning.
Is there a better way to do it? I also tried with a dictionary but it didn't seem to work because apparently the method .to_dict() returns a list instead of a dictionary

Python Type-error: string indices must be integers, creating a new column using existing columns in a data frame

I am trying to create an additional custom column using existing column of a data-frame, however the function I am using throws the type error while execution. I am very new to python, can someone please help.
The dataframe used is as below
match_all = match[['country_id','league_id','season','stage','date',
'home_team_api_id','away_team_api_id','home_team_goal','away_team_goal']]
And the function I am using is as below
def goal_diff(matches):
for i in matches:
i['home_team_goal']-i['away_team_goal']
goal_diff(match_all)

The reason your function did not work is because matches in your function is a dataframe. When you do:
for i in matches:
print(i)
You would see that column names are returned of your current df. This is how a for loop operates on a df. So in your function, when you are using i in your subtraction call:
i['home_team_goal'] -i['away_team_goal']
it is like doing
['country_id']['home_team_goal'] - ['country_id']['away_team_goal']
['league_id']['home_team_goal'] - ['league_id']['away_team_goal']
...
This operation in pandas doesn't make any sense. So what you actually want to do when you are calling specific dataframe columns is the name of the df with the column:
matches['home_team_goal'] - matches['away_team_goal']
remember, matches is your function's input df. Lastly, in your for loop you are neither returning any value or storing any value, you are just calling a subtraction method on 2 columns. In your text editor or IDE you might see something print to screen, but in the future you will probably want to use these values for the next step in your code. So in a function, we use the return call to have the function actually give us values when we call it on something.
In your case, if I write my function below without the return call, and then call the function on my dataframe, the operation would complete, and no value would be "returned" to me, it would just be produced and disappear.
Pre-edit answer.
You do not need to create a loop for this, pandas will do it for you:
def goal_dff(matches):
return matches['home_team_goal'] - matches['away_team_goal']
match_all['home_away_goal_diff'] = goal_diff(match_all)
This function takes an input df and uses the columns 'home_team_goal' and 'away_team_goal' to calculate the difference. You also don't need a function for this. If you wanted to create a new column in your existing match_all df you could do this:
match_all['home_away_goal_diff'] = match_all['home_team_goal'] - match_all['away_team_goal']

What is the Python equivalent for the R function names( )?

The function names() in R gets or sets the names of an object. What is the Python equivalent to this function, including import?
Usage:
names(x)
names(x) <- value
Arguments:
(x) an R object.
(value) a character vector of up to the same length as x, or NULL.
Details:
Names() is a generic accessor function, and names<- is a generic replacement function. The default methods get and set the "names" attribute of a vector (including a list) or pairlist.
Continue R Documentation on Names( )

In Python (pandas) we have .columns function which is equivalent to names() function in R:
Ex:
# Import pandas package
import pandas as pd
# making data frame
data = pd.read_csv("Filename.csv")
# Extract column names
list(data.columns)

not sure if there is anything directly equivalent, especially for getting names. some objects, like dicts, provide .keys() method that allows getting things out
sort of relevant are the getattr and setattr primitives, but it's pretty rare to use these in production code
I was going to talk about Pandas, but I see user2357112 has just pointed that out already!

There is no equivalent. The concept does not exist in Python. Some specific types have roughly analogous concepts, like the index of a Pandas Series, but arbitrary Python sequence types don't have names for their elements.

passing a column as parameter in python function

I have been trying to store unqiue values in a column from pandas data frame using the following code to further use in the function. Code snippet:
def mergeFields(data,field):
oldlist = pd.unique(data[field])
data_rcvd_from_satish.tags = mergeFields("data_rcvd_from_satish","tags")
Error:
TypeError: string indices must be integers, not list
I know the error which i have been getting is similar to many other question still I am not able to resolve the error. I would request not to consider this duplicate and please answer.

data is a string. Please review the arguments passed to mergeFields. you basically wrote:
"data_rcvd_from_satish"["tags"]
which is invalid.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas function to add field to dataframe does not work - python

Related

Why is "numpy.int32" not able to be printed here? (Using geopandas + python 3.9.5)

Is there a way to find a substring in a DataFrame?

Python Type-error: string indices must be integers, creating a new column using existing columns in a data frame

What is the Python equivalent for the R function names( )?

passing a column as parameter in python function

Categories

Resources