I am completely new to python and Pandas of course. I am trying to run a function "get url" which is function to get the complete/ extended url from small Url . I have a data frame in python consists all the short URLs. Now I am trying to do with following ways. One is to use "for" loop which loops and apply function on all the elements and will create a another series of extended URL but I am not able to , dont know why , I tried to write it like
for i in df2:
expanded(i) = get_real(df2[[i]])
print(expanded)df2.[i,'expanded']
next()
and i want also pass a function which will resume next on error but not sure how to do it.
again second solution i tried was passing a whole array to applymap fucntion
df4 = df3.applymap(get_real)
but this also doesnt work for me .
Thanks for all the help !
If the short urls are a column in the pandas dataFrame, you can use the apply function (though I am not sure if they would resume on error, most probably not).
Syntax -
df['<newcolumn>'] = df['<columnname>'].apply(<functionname>)
I am hoping all the short urls would be different rows in a single column.
If you want to use for loop , then you can do something like -
for idx in df.index:
try:
df['<newcolumn>'][idx] = <functionname>(df['<columnname>'][idx])
except <TheError you want to catch or if you do not know, leave empty>:
<Do logic for handling the error>
I think the problem you are having is treating the dataframe just like a dictionary that has keys and values.
I think all you need to do is use
new_df = df2['expanded']
But you should show us what the df2 looks like
Related
I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.
Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)
No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]
Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.
I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))
my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).
I am trying to create an additional custom column using existing column of a data-frame, however the function I am using throws the type error while execution. I am very new to python, can someone please help.
The dataframe used is as below
match_all = match[['country_id','league_id','season','stage','date',
'home_team_api_id','away_team_api_id','home_team_goal','away_team_goal']]
And the function I am using is as below
def goal_diff(matches):
for i in matches:
i['home_team_goal']-i['away_team_goal']
goal_diff(match_all)
The reason your function did not work is because matches in your function is a dataframe. When you do:
for i in matches:
print(i)
You would see that column names are returned of your current df. This is how a for loop operates on a df. So in your function, when you are using i in your subtraction call:
i['home_team_goal'] -i['away_team_goal']
it is like doing
['country_id']['home_team_goal'] - ['country_id']['away_team_goal']
['league_id']['home_team_goal'] - ['league_id']['away_team_goal']
...
This operation in pandas doesn't make any sense. So what you actually want to do when you are calling specific dataframe columns is the name of the df with the column:
matches['home_team_goal'] - matches['away_team_goal']
remember, matches is your function's input df. Lastly, in your for loop you are neither returning any value or storing any value, you are just calling a subtraction method on 2 columns. In your text editor or IDE you might see something print to screen, but in the future you will probably want to use these values for the next step in your code. So in a function, we use the return call to have the function actually give us values when we call it on something.
In your case, if I write my function below without the return call, and then call the function on my dataframe, the operation would complete, and no value would be "returned" to me, it would just be produced and disappear.
Pre-edit answer.
You do not need to create a loop for this, pandas will do it for you:
def goal_dff(matches):
return matches['home_team_goal'] - matches['away_team_goal']
match_all['home_away_goal_diff'] = goal_diff(match_all)
This function takes an input df and uses the columns 'home_team_goal' and 'away_team_goal' to calculate the difference. You also don't need a function for this. If you wanted to create a new column in your existing match_all df you could do this:
match_all['home_away_goal_diff'] = match_all['home_team_goal'] - match_all['away_team_goal']
I used below code to split a dataframe using dask:
result=dd.from_pandas(df, chunksize=75)
I use below code to create a custom json file:
for z in result:
createjson (z)
It just didnt work! how can I access to each chunk?
There may be a more native way (feels like there should be) but you can do:
for i in range(result.npartitions):
partition = result.get_partition(i)
# your code here
We do not know what your createjson function does, but perhaps it is covered by to_json().
Alternatively, if you really want to do something unique to each of your partition, and this is not unique to JSON, then you will want the method map_partitions().
just started learning python. trying to change a columns data type from object to float to take out the mean. I have tried to change [] to () and even the "". I dont know whether it makes a difference or not. Please help me figure out what the issue is. thanks!!
My code:
df["normalized-losses"]=df["normalized-losses"].astype(float)
error which i see: attached as imageenter image description here
Use:
df['normalized-losses'] = df['normalized-losses'][~(df['normalized-losses'] == '?' )].astype(float)
Using df.normalized-losses leads to interpreter evaluating df.normalized which doesn't exist. The statement you have written executes (df.normalized) - (losses.astype(float)).There appears to be a question mark in your data which can't be converted to float.The above statement converts to float only those rows which don't contain a question mark and drops the rest.If you don't want to drop the columns you can replace them with 0 using:
df['normalized-losses'] = df['normalized-losses'].replace('?', 0.0)
df['normalized-losses'] = df['normalized-losses'].astype(float)
Welcome to Stack Overflow, and good luck on your Python journey! An important part of coding is learning how to interpret error messages. In this case, the traceback is quite helpful - it is telling you that you cannot call normalized after df, since a dataframe does not have a method of this name.
Of course you weren't trying to call something called normalized, but rather the normalized-losses column. The way to do this is as you already did once - df["normalized-losses"].
As to your main problem - if even one of your values can't be converted to a float, the columnwide operation will fail. This is very common. You need to first eliminate all of the non-numerical items in the column, one way to find them is with df[~df['normalized_losses'].str.isnumeric()].
The "df.normalized-losses" does not signify anything to python in this case. you can replace it with df["normalized-losses"]. Usually, if you try
df["normalized-losses"]=df["normalized-losses"].astype(float)
This should work. What this does is, it takes normalized-losses column from dataframe, converts it to float, and reassigns it to normalized column in the same dataframe. But sometimes it might need some data processing before you try the above statement.
You can't use - in an attribute or variable name. Perhaps you mean normalized_losses?