Is there a way to find a substring in a DataFrame?

Is there a way to find a substring in a DataFrame? - python

Well, I got this problem:
I have a pandas DataFrame and I'm trying to find a the value that starts with "THRL-" and delete that exact same prefix, I've tried to make it a string and use the result = df.toString() method as it follows (Where result is a DataFrame):
a = result.replace('THRL-', '')
But it doesn't work, I still see the same THRL- prefix in the string that I'm returning.
Is there a better way to do it? I also tried with a dictionary but it didn't seem to work because apparently the method .to_dict() returns a list instead of a dictionary

Related

Remove spaces from strings in pandas DataFrame not working

Trying to remove spaces from a column of strings in pandas dataframe. Successfully did it using this method in other section of code.
for index, row in summ.iterrows():
row['TeamName'] = row['TeamName'].replace(" ", "")
summ.head() shows no change made to the column of strings after this operation, however no error as well.
I have no idea why this issue is happening considering I used this exact same method later in the code and accomplished the task successfully.

Why not use str.replace:
df["TeamName"] = df["TeamName"].str.replace(r' ', '', regex=False)

I may be proven wrong here, but I am wondering if its because you are iterating over it, and maybe working on a copy that isn't changing the data. From pandas.DataFrame.iterrows documentation, this is what I found there:
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
just a thought... hth

Python / Pyspark Indexing and Slicing issue on Databricks

I'm not entirely sure if I need to index or slice to retrieve elements from an output in Python.
For example, the variable "Ancestor" produces the following output.
Out[30]: {'ancestorPath': '/mnt/lake/RAW/Internal/origination/dbo/xpd_opportunitystatushistory/1/Year=2022/Month=11/Day=29/Time=05-11',
'dfConfig': '{"sparkConfig":{"header":"true"}}',
'fileFormat': 'SQL'}
The element "xpd_opportunitystatushistory" is a table and I would like to retrieve "xpd_opportunitystatushistory" from the output.
I was thinking of something like:
table = Ancestor[:6]
But it fails.
Any thoughts?
I have been working on this while waiting for help.
the following
Ancestor['ancestorPath']
Give me
Out[17]: '/mnt/lake/RAW/Internal/origination/dbo/xpd_opportunitystatushistory/1/Year=2022/Month=11/Day=29/Time=05-11'
If someone could help with the remaining code to pull out 'xpd_opportunitystatushistory' that would be most helpful
ta

Ancestor is a dictionary (key value pairs) and hence has to be accessed using a key which in this case is ancestorPath.
I have assigned the value similar to yours and was able to retrieve ancesterPath as you have figured out.
Now to get the xpd_opportunitystatushistory you can use the following code. Since the value of Ancestor['ancestorPath'] is a string, you can split and then extract the required value from the resulting array:
req_array = Ancestor['ancestorPath'].split("/")
print(req_array)
print(req_array[7])
If you want to retrieve complete path until xpd_opportunitystatushistory, then you can use the following instead:
req_array = Ancestor['ancestorPath'].split("/")
print(req_array)
print('/'.join(req_array[:8]))

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)

This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

Python Type-error: string indices must be integers, creating a new column using existing columns in a data frame

I am trying to create an additional custom column using existing column of a data-frame, however the function I am using throws the type error while execution. I am very new to python, can someone please help.
The dataframe used is as below
match_all = match[['country_id','league_id','season','stage','date',
'home_team_api_id','away_team_api_id','home_team_goal','away_team_goal']]
And the function I am using is as below
def goal_diff(matches):
for i in matches:
i['home_team_goal']-i['away_team_goal']
goal_diff(match_all)

The reason your function did not work is because matches in your function is a dataframe. When you do:
for i in matches:
print(i)
You would see that column names are returned of your current df. This is how a for loop operates on a df. So in your function, when you are using i in your subtraction call:
i['home_team_goal'] -i['away_team_goal']
it is like doing
['country_id']['home_team_goal'] - ['country_id']['away_team_goal']
['league_id']['home_team_goal'] - ['league_id']['away_team_goal']
...
This operation in pandas doesn't make any sense. So what you actually want to do when you are calling specific dataframe columns is the name of the df with the column:
matches['home_team_goal'] - matches['away_team_goal']
remember, matches is your function's input df. Lastly, in your for loop you are neither returning any value or storing any value, you are just calling a subtraction method on 2 columns. In your text editor or IDE you might see something print to screen, but in the future you will probably want to use these values for the next step in your code. So in a function, we use the return call to have the function actually give us values when we call it on something.
In your case, if I write my function below without the return call, and then call the function on my dataframe, the operation would complete, and no value would be "returned" to me, it would just be produced and disappear.
Pre-edit answer.
You do not need to create a loop for this, pandas will do it for you:
def goal_dff(matches):
return matches['home_team_goal'] - matches['away_team_goal']
match_all['home_away_goal_diff'] = goal_diff(match_all)
This function takes an input df and uses the columns 'home_team_goal' and 'away_team_goal' to calculate the difference. You also don't need a function for this. If you wanted to create a new column in your existing match_all df you could do this:
match_all['home_away_goal_diff'] = match_all['home_team_goal'] - match_all['away_team_goal']

Pythonic way not working with list

I have a seemingly simple problem, but I the code that I believe should solve it is not behaving as expected -- but a less elegant code that I find functionally equivalent behaves as expected. Can you help me understand?
The task: create a list, drop a specific value.
The specific usecase is that I am dropping a specific list of columns of pd.df, but that is not the part I want to focus on. It's that I seem to be unable to do it in a nice, pythonic single-line operation.
What I think should work:
result = list(df.columns).remove(x)
This results in object of type 'NoneType'
However, the following works fine:
result = list(df.columns)
result.remove(X)
These look functionally equivalent to me -- but the top approach is clearer and preferred, but it does not work. Why?

The reason is that remove changes the list, and does not return a new one, so you can't chain it.
What about the following way?
result = [item for item in df.columns if item != x]
Please note that this code is not exactly equivalent to the one you provided, as it will remove all occurrences of x, not just the first one as with remove(x).

Those are definitely not functionally equivalent.
The first piece of code puts the result of the last called method into result, so whatever remove returns. remove always returns None since it returns nothing.
The second piece of code puts the list into result, then removes from the list (which is already stored in result) the item. You are discarding the return of remove, as you should. The equivalent and wrong thing to do would be:
:
result = list(df.columns)
result = result.remove(X)

The two pieces of code are not really equivalent. In the second one, the variable result holds your list. You then call remove on that list, and the element is removed. So far so good.
In the first piece of code you try to assign the return value of remove() to result, so this would be the same as:
result = list(df.columns)
result = result.remove(X)
And since remove has no return value, the result will be NoneType.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a way to find a substring in a DataFrame? - python

Related

Remove spaces from strings in pandas DataFrame not working

Python / Pyspark Indexing and Slicing issue on Databricks

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

Python Type-error: string indices must be integers, creating a new column using existing columns in a data frame

Pythonic way not working with list

Categories

Resources