How to keep looping Pandas Dataframe - python

I have a function that is very repetitive. I would like to keep looping instead of having all this code

You can use this syntax by appropriate modification:
for i in range(2,6):
df['finalvalue{}'.format(i)] = df.iloc[::-1, :].groupby([df.id, df['finalvalue{}'.format(i-1)].diff().lt(0).cumsum()])['finalvalue{}'.format(i-1)].cumsum()

Related

print index and do logic in list comprehension

I have a really large list and while processing it I want to know which index I am on during the process.
A simple example:
l = ['a','b','c']
[ print(i), char.capitalize() for i,char in enumerate(l)]
But here i is not defined. Is it possible to print and run some logic in list comprehension?
Update: Seems normal for loop is the way. fyi, my motivation is to use asyncio.gather, for which I've only seen examples in list comprehension, e.g.
async def gather():
await asyncio.gather(*[slowtask() for _ in range(10)])
All of the comment are good but you must consider that print function is an "io" operation in operation systems and is very slow. If your list is too big, this function make your code very slow.
I think that because of long runtime you want to see the index of your list.
If it is true, I advise you to use from pandas package. Convert your list to a pandas series and use from pandas methods for your computation. Pandas is very very fast and optimize your computation.

variable dataframe name - loop works by itself, but not inside of function

I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.
Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)
No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]
Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.
I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))

Applying for loops on dataframe?

I am applying for loop to a column in python. But I am not able to execute it. It is producing error. I want square of a column. Please see where I am committing mistake. I know i can do this with lambda. But I want to perform it in traditional way.
import pandas as pd
output=[]
for i in pd.read_csv("infy.csv"):
output.append(i['Close']**2)
print(output)
the whole point of pandas is not to loop
output = pd.read_csv("infy.csv")['Close']**2

Memory error when using Pandas built-in divide, but looping works?

I have two DataFrames, each having 100,000 rows. I am trying to do the following:
new = dataframeA['mykey']/dataframeB['mykey']
and I get an 'Out of Memory' error. I get the same error if I try:
new = dataframeA['mykey'].divide(dataframeB['mykey'])
But if I loop through each element, like this, it works:
result = []
for idx in range(0,dataframeA.shape[0]):
result.append(dataframeA.ix[idx,'mykey']/dataframeB.ix[idx,'mykey'])
What's going on here? I'd think that the built-in Pandas functions would be much more memory efficient.
#ayhan got it right off the bat.
My two dataframes were not using the same indices. Resetting them worked.

Call a function over elements of a list in python

I am completely new to python and Pandas of course. I am trying to run a function "get url" which is function to get the complete/ extended url from small Url . I have a data frame in python consists all the short URLs. Now I am trying to do with following ways. One is to use "for" loop which loops and apply function on all the elements and will create a another series of extended URL but I am not able to , dont know why , I tried to write it like
for i in df2:
expanded(i) = get_real(df2[[i]])
print(expanded)df2.[i,'expanded']
next()
and i want also pass a function which will resume next on error but not sure how to do it.
again second solution i tried was passing a whole array to applymap fucntion
df4 = df3.applymap(get_real)
but this also doesnt work for me .
Thanks for all the help !
If the short urls are a column in the pandas dataFrame, you can use the apply function (though I am not sure if they would resume on error, most probably not).
Syntax -
df['<newcolumn>'] = df['<columnname>'].apply(<functionname>)
I am hoping all the short urls would be different rows in a single column.
If you want to use for loop , then you can do something like -
for idx in df.index:
try:
df['<newcolumn>'][idx] = <functionname>(df['<columnname>'][idx])
except <TheError you want to catch or if you do not know, leave empty>:
<Do logic for handling the error>
I think the problem you are having is treating the dataframe just like a dictionary that has keys and values.
I think all you need to do is use
new_df = df2['expanded']
But you should show us what the df2 looks like

Categories