I can't drop null values with Pyspark - python

https://i.stack.imgur.com/aSDrk.png
As you see in the picture I'm trying to remove some rows from the data with pyspark but after na.drop functions the data are not removed. So, what is the problem do you have any idea about this?
I checked column names and na.drop parameters but nothing has changed.

Related

showing cells with a particular symbols in pandas dataframe

i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?
You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

Unable to update new column values in rows which were derived from existing column having multiple values separeted by ','?

Original dataframe
Converted Dataframe using stack and split:
Adding new column to a converted dataframe:
What i am trying to is add a new column using np.select(condition, values) but it not updating the two addition rows derived from H1 its returning with 0 or NAN. Can someone please help me here ?
Please note i have already done the reset index but still its not helping.
I think using numpy in this situation is kind of unnecessary.
you can use something like the following code:
df[df.State == 'CT']['H3'] = 4400000

grouping issue in case of empty data frame python

Oncall_cntrl=Oncall_cntrl.groupby(['EMP NAME','SoW','Phase Name','EMPRemarks_Oncall','Location'],as_index=False)['HourInDecimal'].sum()
df_concat_Oncall=pd.concat([df_concat_OT, Oncall_cntrl],ignore_index=True)
This is fine when oncall_cntrl has observations , but sometimes when there is no call data this dataframe might be empty and as a results when group by its coming as empty dataframe without column structure, which results in concat issue in next step.
Could you please let me know atleast we should have column structure when oncall_cntrl df as empty . how do you do this so that i can concatenate in next step. Im stuck at this point.Kindly help.

Why is the `df.columns` an empty list while I can see the column names if I print out the dataframe? Python Pandas

import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.

Categories