PySpark Equality Filter Issue

PySpark Equality Filter Issue - python

I am trying to filter a Spark DataFrame (v. 1.5.0) and am seeing a peculiar result. First the results
df.groupby('behavior').count().show()
+---------+-------+
| behavior| count|
+---------+-------+
| Bought|1582345|
| Looked|2154789|
+---------+-------+
This is consistent with the number of rows in my data frame. Also, we see that there are only two "behaviors" in the column (Bought and Looked). Here's where things get weird.
df[df['behavior']=='Bought'].count()
1025879
df[df['behavior']=='Looked'].count()
698742
What happened? Where did the other rows of the data frame go? Finally, to make things even stranger, using the isin() method provides the correct results.
df[df['behavior'].isin(['Bought'])].count()
1582345
df[df['behavior'].isin(['Looked'])].count()
2154789
I have no clue what's going on here. I would have expected these two filters to at least return consistent results (both wrong or both right). Any help would be greatly appreciated!
Edit
Ran the following filter operations as suggested below, all results were consistent with the incorrect answer.
df.filter(df['behavior']=='Bought').count()
1025879
df.filter(df.behavior=='Bought').count()
1025879
df.filter(F.col('behavior')=='Bought').count()
1025879
So it seems like the equality check is what's wrong. What's interesting though is that the isin() functionality still seems to be working. I would have thought isin() used an equality check behind the scenes but if it is I don't know why it returns different results.

instead of doing
df[df['behavior']=='Bought'].count()
try
df.filter(df.behavior == 'Bought').count()
do the same for the rest of the queries.

df[df['behavior']=='Bought'].count() works well with Pandas.
In order to equate column in Pyspark, you can try below option.
df.filter(df.behavior == 'Bought').count()
df.filter(df["behavior"] == 'Bought').count()
from pyspark.sql.functions import col
df.filter(col("behavior") == 'Bought').count()
Hope this Helps.
Regards,
Neeraj

Related

How do I efficiently calculate the mean of nested subsets of Vaex dataframes?

I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!
My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:
means = {}
list_of_samples = df["sample"].unique()
for sample_number in list_of_samples:
sample = df[ df["sample"] == sample_number ]
list_of_subsamples = sample["subsample"].unique()
means[sample_number] = {}
for subsample_number in list_of_subsamples:
subsample = sample[ sample["subsample"] == subsample_number ]
means[sample_number][subsample_number] = subsample["value"].mean()
If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:
mean = df["value"].mean()
it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:
sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()
The program just hangs. I've left it for an hour and still not gotten a result!
How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!
edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.

Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:
import vaex
df = vaex.example()
df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']

Filtering with multiple conditions and creating new csv

Look at the variations of code I tried here
I'm trying to use Pandas to filter rows with multiple conditions and create a new csv file with only those rows. I've tried several different ways and then commented out each of those attempts (sometimes I only tried one condition for simplicity but it still didn't work). When the csv file is created, the filters weren't applied.
This is my updated code
I got it to work for condition #1, but I'm not sure how to add/apply condition #2. I tried a lot of different combinations. I know the code I put in the linked image wouldn't work for applying the 2nd condition because all I did was assign the variable, but it seemed too cumbersome to try to show all the ways I tried to do it. Any hints on that part?
df = pd.read_csv(excel_file_path)
#condition #1
is_report_period = (df["Report Period"]=="2015-2016") | \
(df["Report Period"]=="2016-2017") | \
(df["Report Period"]=="2017-2018") | \
(df["Report Period"]=="2018-2019")
#condition #2
is_zip_code = (df["Zip Code"]<"14800")
new_df = df[is_report_period]

you can easily achieve this by using the '&':
new_df = df[is_report_period & is_zip_code]
also, you can make your code more readable and easy for you to apply changes
in the filtering by using this method:
Periods = ["2015-2016","2016-2017","2017-2018","2018-2019"]
is_report_period = df["Report Period"].isin(Periods)
this way you can easily alter your filter when needed, and it's
easier for you to maintain.

How do select specific rows in column that satisfy either argument in pandas?

I'm trying to learn pandas in python so I created a simple spreadsheet containing several films and imported it to python. How do I select films that are either action or comedy?
So far I have tried
df2=df[df['Genre']=='Action' or 'Comedy']
and
df2=df[(df['Genre']=='Action') or (df['Genre']=='Comedy')]
However, this works
df2 = df[df['Genre']=='Action']
df2 = df2.append(df[df['Genre']=='Comedy'])
but I believe this is an unorthodox way of doing it.
Is there a simpler or cleaner way of doing this?

You can do df[(df['Genre']=='Action') | (df['Genre']=='Comedy') & (df['year'] <='2000')]

For this kind of situations, I find the operator .isin easier to understand and more compact:
df[df['Genre'].isin(['Action','Comedy'])]
This way, if you have additional criteria, you don't need to repeat the line that many times. For example:
df[df['Genre'].isin(['Action','Comedy','Drama','Romance','Kids'])]
Is much better than:
df[(df['Genre']=='Action') | (df['Genre']=='Comedy') | (df['Genre']=='Comedy') |
(df['Genre']=='Romance') | (df['Genre']=='Kids')]

Try this, gives you more freedom of adding filter items(genres), instead to chaining more 'or' clauses.
required_genres = ['Action', 'Comedy']
df[df['Genre'].isin(required_genres)]
Obviously you can chain more arguments with this, to answer your second question :
df[(df['Genre'].isin(required_genres)) & (df['year'] <='2000')]

Pandas add column to new data frame at associated string value?

I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.

For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).

You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)

Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's

Pandas -- String of boolean conditions -- SettingWithCopyWarning

I am wondering if anyone can assist me with this warning I get in my code. The code DOES score items correctly, but this warning is bugging me and I can't seem to find a good fix, given that I need to string a few boolean conditions together.
Background: Imagine that I have a magical fruit identifier and I have a csv file that lists what fruit was identified and in which area (1, 2, etc.). I read in the csv file with columns of "FruitID" and "Area." An identification of "APPLE" or "apple" in Zone 1 is scored as correct/true (other identified fruits are incorrect/false). I apply similar logic for other areas, but I won't get into that.
Any ideas for how to correct this? Should I use .loc, although I'm not sure that this will work with multiple booleans. Thanks!
My code snippet that initiates the CopyWarning:
Area1_ID_df['Area 1, Score']=(Area1_ID_df['FruitID']=='APPLE')|(Area1_ID_df['FruitID']=='apple')
Stacktrace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Pandas finds it ambiguous what you are trying to do. Certain operations return a view of the dataset, whereas other operations make a copy of the dataset. The confusion is whether you want to modify a copy of the dataset or whether you want to modify the original dataset or are trying to create something new.
https://www.dataquest.io/blog/settingwithcopywarning/ is a great link to learn more about the problem you are having.

If the line that's causing this error is truly: s = t | u, where t and u are Boolean series indexed consistently, you should not worry about SettingWithCopyWarning.
This is a warning rather than an error. The latter indicates there is a problem, the former indicates there may be a problem. In this case, Pandas guesses you may be working with a copy rather than a view.
If the result is as you expect, you can safely ignore the warning.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark Equality Filter Issue - python

instead of doing df[df['behavior']=='Bought'].count() try df.filter(df.behavior == 'Bought').count() do the same for the rest of the queries.

Related

How do I efficiently calculate the mean of nested subsets of Vaex dataframes?

Filtering with multiple conditions and creating new csv

How do select specific rows in column that satisfy either argument in pandas?

Pandas add column to new data frame at associated string value?

Pandas -- String of boolean conditions -- SettingWithCopyWarning

Categories

Resources