I am trying to run the code below. It works fine for small data size, but for larger data size, it is taking almost a day.
Anyone who can help to optimise the code or who can tell me the approach. Can we use apply lambda to solve the issue?
for index in df.index:
for i in df.index:
if ((df.loc[index,"cityId"]==df.loc[i,"cityId"]) & (df.loc[index,"landingPagePath"]==df.loc[i,"landingPagePath"]) &
(df.loc[index,"exitPagePath"]==df.loc[i,"exitPagePath"]) &
(df.loc[index,"campaign"]==df.loc[i,"campaign"]) &
(df.loc[index,"pagePath"]==df.loc[i,"previousPagePath"]) &
((df.loc[index,"dateHourMinute"]+timedelta(minutes=math.floor(df.loc[index,"timeOnPage"]/60))==df.loc[i,"dateHourMinute"]) |
(df.loc[index,"dateHourMinute"]==df.loc[i,"dateHourMinute"]) |
((df.loc[index,"dateHourMinute"]+timedelta(minutes=math.floor(df.loc[index,"timeOnPage"]/60))+timedelta(minutes=1))==df.loc[i,"dateHourMinute"]))
):
if(df.loc[i,"sess"]==0):
df.loc[i,'sess']=df.loc[index,'sess']
elif(df.loc[index,"sess"]>df.loc[i,"sess"] ):
df.loc[index,'sess']=df.loc[i,'sess']
elif(df.loc[index,"sess"]==0):
df.loc[index,'sess']=df.loc[i,'sess']
elif(df.loc[index,"sess"]<df.loc[i,"sess"] ):
x=df.loc[i,"sess"]
for q in df.index:
if(df.loc[q,"sess"]==x):
df.loc[q,"sess"]=df.loc[index,'sess']
else:
if (df.loc[index,"sess"]==0):
df.loc[index,'sess'] = max(df["sess"])+1
looks like you're trying to do a database "join" manually, Pandas exposes this functionality as a merge and using this would go a long way to solving your issue
I'm having trouble following all your branches, but you should be able to get most of the way where if you use a merge and then maybe do some post-processing / filtering to get a final answer
Related
I have searched through a large amount of documentation to try to find an example of what I'm trying to do. I admit that the bigger issue may be my lack of python expertise. So i'm reaching out here in hopes that someone can point me in the right direction. I am trying to create a python function that dynamically queries tables based on a function parameters. Here is an example of what i'm trying to do:
def validateData(_ses, table_name,sel_col,join_col, data_state, validation_state):
sdf_t1 = _ses.table(table_name).select(sel_col).filter(col('state') == data_state)
sdf_t2 = _ses.table(table_name).select(sel_col).filter(col('state') == validation_state)
df_join = sdf_t1.join(sdf_t2, [sdf_t1[i] == sdf_t2[i] for i in join_col],'full')
return df_join.to_pandas()
This would be called like this:
df = validateData(ses,'table_name',[col('c1'),col('c2')],[col('c2'),col('c3')],'AZ','TX')
this issue i'm having is with line 5 from the funtion:
df_join = sdf_t1.join(sdf_t2, [col(sdf_t1[i]) == col(sdf_t2[i]) for i in join_col],'full')
I know that code is incorrect, but I'm hoping it explains what i'm trying to do. If anyone has any advice on if this is possible or how, I would greatly appreciate it.
Instead of joining in data frame, i think its easier to use a direct SQL and pull the data in a snow frame and convert it to a pandas data frame.
from snowflake.snowpark import Session
import pandas as pd
#snow df creation using SQL
data = session.sql("select t1.col1, t2.col2, t2.col2 from mytable t1 full outer join mytable2 t2 on t1.id=t2.id where t1.col3='something'")
#Convert snow DF to Pandas DF. You can use this pandas data frame.
data= pd.DataFrame(data.collect())
Essentially what you need is to create a python expression from two lists of variables. I don't have a better idea than using eval.
Maybe try eval(" & ".join(["(col(sdf_t1[i]) == col(sdf_t2[i]))" for i in join_col]). Be mindful that I have not completely test this but just to toss an idea.
I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!
My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:
means = {}
list_of_samples = df["sample"].unique()
for sample_number in list_of_samples:
sample = df[ df["sample"] == sample_number ]
list_of_subsamples = sample["subsample"].unique()
means[sample_number] = {}
for subsample_number in list_of_subsamples:
subsample = sample[ sample["subsample"] == subsample_number ]
means[sample_number][subsample_number] = subsample["value"].mean()
If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:
mean = df["value"].mean()
it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:
sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()
The program just hangs. I've left it for an hour and still not gotten a result!
How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!
edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.
Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:
import vaex
df = vaex.example()
df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']
Look at the variations of code I tried here
I'm trying to use Pandas to filter rows with multiple conditions and create a new csv file with only those rows. I've tried several different ways and then commented out each of those attempts (sometimes I only tried one condition for simplicity but it still didn't work). When the csv file is created, the filters weren't applied.
This is my updated code
I got it to work for condition #1, but I'm not sure how to add/apply condition #2. I tried a lot of different combinations. I know the code I put in the linked image wouldn't work for applying the 2nd condition because all I did was assign the variable, but it seemed too cumbersome to try to show all the ways I tried to do it. Any hints on that part?
df = pd.read_csv(excel_file_path)
#condition #1
is_report_period = (df["Report Period"]=="2015-2016") | \
(df["Report Period"]=="2016-2017") | \
(df["Report Period"]=="2017-2018") | \
(df["Report Period"]=="2018-2019")
#condition #2
is_zip_code = (df["Zip Code"]<"14800")
new_df = df[is_report_period]
you can easily achieve this by using the '&':
new_df = df[is_report_period & is_zip_code]
also, you can make your code more readable and easy for you to apply changes
in the filtering by using this method:
Periods = ["2015-2016","2016-2017","2017-2018","2018-2019"]
is_report_period = df["Report Period"].isin(Periods)
this way you can easily alter your filter when needed, and it's
easier for you to maintain.
I'm currently trying to normalize a DataFrame(~600k rows) with prices (pricevalue) in different currencies(pricecurrency) so that every row has prices in EUR.
I'd like to convert them with the daily rate taken from a column date.
My current "solution" (using the CurrencyConverter package found on PyPI) looks like this:
from currency_converter import CurrencyConverter
c = CurrencyConverter(fallback_on_missing_rate=True,fallback_on_missing_rate_method="last_known")
def convert_currency(row):
return c.convert(row["pricevalue"], row["pricecurrency"],row["date"])
df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)
However, this solution is taking forever to run.
Is there a faster way to accomplish that? Any help is appreciated :)
It sounds strange to say this, but unfortunately you're not doing anything wrong!
The currency interpolation code is doing what you need it to do, and not much else. In your code, you're doing everything right. This means there's no thing you can quickly fix to get performance. You have a double lambda where you only need a single, but that won't make much of a difference:
i.e.
df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)
should be
df["converted_eur"] = df.apply(convert_currency, axis=1)
The first thing to do is to understand how long this processing will actually take by adding some UI:
from tqdm import tqdm
df["converted_eur"] = df.progress_apply(convert_currency, axis=1)
Once you know how long the job will actually take, try out these, in order:
Live with it.
Single instance parallelization, with something like pandarallel
Multi instance parallelization, with something like Dask
I am trying to filter a Spark DataFrame (v. 1.5.0) and am seeing a peculiar result. First the results
df.groupby('behavior').count().show()
+---------+-------+
| behavior| count|
+---------+-------+
| Bought|1582345|
| Looked|2154789|
+---------+-------+
This is consistent with the number of rows in my data frame. Also, we see that there are only two "behaviors" in the column (Bought and Looked). Here's where things get weird.
df[df['behavior']=='Bought'].count()
1025879
df[df['behavior']=='Looked'].count()
698742
What happened? Where did the other rows of the data frame go? Finally, to make things even stranger, using the isin() method provides the correct results.
df[df['behavior'].isin(['Bought'])].count()
1582345
df[df['behavior'].isin(['Looked'])].count()
2154789
I have no clue what's going on here. I would have expected these two filters to at least return consistent results (both wrong or both right). Any help would be greatly appreciated!
Edit
Ran the following filter operations as suggested below, all results were consistent with the incorrect answer.
df.filter(df['behavior']=='Bought').count()
1025879
df.filter(df.behavior=='Bought').count()
1025879
df.filter(F.col('behavior')=='Bought').count()
1025879
So it seems like the equality check is what's wrong. What's interesting though is that the isin() functionality still seems to be working. I would have thought isin() used an equality check behind the scenes but if it is I don't know why it returns different results.
instead of doing
df[df['behavior']=='Bought'].count()
try
df.filter(df.behavior == 'Bought').count()
do the same for the rest of the queries.
df[df['behavior']=='Bought'].count() works well with Pandas.
In order to equate column in Pyspark, you can try below option.
df.filter(df.behavior == 'Bought').count()
df.filter(df["behavior"] == 'Bought').count()
from pyspark.sql.functions import col
df.filter(col("behavior") == 'Bought').count()
Hope this Helps.
Regards,
Neeraj