Applying for loops on dataframe? - python

I am applying for loop to a column in python. But I am not able to execute it. It is producing error. I want square of a column. Please see where I am committing mistake. I know i can do this with lambda. But I want to perform it in traditional way.
import pandas as pd
output=[]
for i in pd.read_csv("infy.csv"):
output.append(i['Close']**2)
print(output)

the whole point of pandas is not to loop
output = pd.read_csv("infy.csv")['Close']**2

Related

Snowpark-Python Dynamic Join

I have searched through a large amount of documentation to try to find an example of what I'm trying to do. I admit that the bigger issue may be my lack of python expertise. So i'm reaching out here in hopes that someone can point me in the right direction. I am trying to create a python function that dynamically queries tables based on a function parameters. Here is an example of what i'm trying to do:
def validateData(_ses, table_name,sel_col,join_col, data_state, validation_state):
sdf_t1 = _ses.table(table_name).select(sel_col).filter(col('state') == data_state)
sdf_t2 = _ses.table(table_name).select(sel_col).filter(col('state') == validation_state)
df_join = sdf_t1.join(sdf_t2, [sdf_t1[i] == sdf_t2[i] for i in join_col],'full')
return df_join.to_pandas()
This would be called like this:
df = validateData(ses,'table_name',[col('c1'),col('c2')],[col('c2'),col('c3')],'AZ','TX')
this issue i'm having is with line 5 from the funtion:
df_join = sdf_t1.join(sdf_t2, [col(sdf_t1[i]) == col(sdf_t2[i]) for i in join_col],'full')
I know that code is incorrect, but I'm hoping it explains what i'm trying to do. If anyone has any advice on if this is possible or how, I would greatly appreciate it.
Instead of joining in data frame, i think its easier to use a direct SQL and pull the data in a snow frame and convert it to a pandas data frame.
from snowflake.snowpark import Session
import pandas as pd
#snow df creation using SQL
data = session.sql("select t1.col1, t2.col2, t2.col2 from mytable t1 full outer join mytable2 t2 on t1.id=t2.id where t1.col3='something'")
#Convert snow DF to Pandas DF. You can use this pandas data frame.
data= pd.DataFrame(data.collect())
Essentially what you need is to create a python expression from two lists of variables. I don't have a better idea than using eval.
Maybe try eval(" & ".join(["(col(sdf_t1[i]) == col(sdf_t2[i]))" for i in join_col]). Be mindful that I have not completely test this but just to toss an idea.

PySpark - how to convert code pandas to pyspark - nested when

how are you?
So, I need to convert some code written in pandas to PySpark, but so far I'm still having some problems, I've tried using PySpark's function when, but the result is wrong, can you help me? This is very simple and I'm starting with PySpark
df = df.assign(new_col=np.where(df.col1.isnull()
,np.where(df.col2.isnull(), df.col3,
np.where(df.col3.isnull(), df.col2,((df.col2+df.col3/2))
),df.col1)))
The main goal here is to understand how to use nested when.
I thank you in advance!!
Without sample input output hard to know what to expect especially given your code does not run in pandas either. Multiple when statements should replace nested np.where statement.
Code below
from pyspark.sql.functions import *
import pyspark.sql.functions as F
df.withColumn('newcol1', when(col('col1').isNull()&col('col2').isNull(), col('col3'))).withColumn('newcol1', when(col('col3').isNull()&col('col1').isNotNull(), col('col2')).otherwise(col('col3'))).withColumn('newcol1', when(col('col3').isNull()&col('col2').isNull(), col('col1')).otherwise(col('newcol1'))).show()

How to keep looping Pandas Dataframe

I have a function that is very repetitive. I would like to keep looping instead of having all this code
You can use this syntax by appropriate modification:
for i in range(2,6):
df['finalvalue{}'.format(i)] = df.iloc[::-1, :].groupby([df.id, df['finalvalue{}'.format(i-1)].diff().lt(0).cumsum()])['finalvalue{}'.format(i-1)].cumsum()

While true, pandas dataframe won't show?

Using Jupyter Notebook, if I put in the following code:
import pandas as pd
df = pd.read_csv('path/to/csv')
while True:
df
The dataframe won't show. Can anyone tell me why this is the case? I'm guessing it's because the constant looping is preventing the dataframe from loading fully. Is that what's happening here?
I need code that would let me get a user's input. If they type in a name, for example, I'll extract the person with that name's info from the dataframe and display it, then the program needs to ask them to give another name. This will continue until they type in "quit". I figured a while loop would be the best for that, but it looks like there's just something about while loops and pandas that won't mix. Does anyone have any suggestions on what I can do instead?

Memory error when using Pandas built-in divide, but looping works?

I have two DataFrames, each having 100,000 rows. I am trying to do the following:
new = dataframeA['mykey']/dataframeB['mykey']
and I get an 'Out of Memory' error. I get the same error if I try:
new = dataframeA['mykey'].divide(dataframeB['mykey'])
But if I loop through each element, like this, it works:
result = []
for idx in range(0,dataframeA.shape[0]):
result.append(dataframeA.ix[idx,'mykey']/dataframeB.ix[idx,'mykey'])
What's going on here? I'd think that the built-in Pandas functions would be much more memory efficient.
#ayhan got it right off the bat.
My two dataframes were not using the same indices. Resetting them worked.

Categories