I have searched many places but still can't come up neither with my own logic neither find on the internet ...
problem
I have students performance dataset while performing EDA , i came up with a small problem
like ,why students having zero 'absences' have zeroes in their final grades ..
that is practically impossible for a student to be present the whole year and still get a zero in their finals
So I decided to filter out all the rows with zeroes in those two columns using
dataset[(dataset['G3']==0)&(dataset['absences']==0)]
but this returned a dataframe
So i tried
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3','absences']]
which returned me two columns with the condition satisfied , what i wanted is to replace 'G3' column zeroes and 'absences' column zeroes to be replaced with their respective means and not disturb the dataframe too
i tried to replace them by
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3','absences']].replace(0,np.mean[dataset[['G3','absences']]])
which threw me error
function object cannot be subscriptable
I don't know what to do
I have tried many things but still can't get through this problem any solution may help
thanks in advance
In case you want to replace with the mean of subset of values != 0, the you can use
dataset = pd.DataFrame({'G3': np.random.randint(0,3,100),
'absences' : np.random.randint(0,3,100)})
dataset.loc[(dataset['G3']==0)&(dataset['absences']==0),['G3', 'absences']] = [dataset.loc[(dataset['G3']!=0)]['G3'].mean(), dataset.loc[(dataset['absences']!=0)]['absences'].mean()]
Related
I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".
try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.
I'm trying to add a new column to a dataframe using the following code.
labels = df1['labels']
df2['labels'] = labels
However, in the later part of my program, I found that there might be something wrong with the assignment. So, I checked it using
labels.equals(other=df2['labels'])
and I got a False. (I added this line instantly after assignment)
I also tried to
print out part of labels and df2, and it turns out that there are indeed some lines that are different.
check max and min values of both series, and they are different
check number of unique values in both series using len(set(labels)) and len(set(df2['labels'])), and they differs a lot
test with a smaller amount of data, but this works totally fine.
My dataframe is rather large (40 million+ lines), so I cannot print them all out and check the values. Does anyone have any idea about what might lead to this kind of problem? Or is there any suggestions for further tests?
I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.
I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.
My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!
Group with expressions and alias as needed. Lets try:
df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()
Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.
If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this
sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")
I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's
My issue is that I need to identify the patient "ID" if anything critical (high conc. XT or increase in Crea) is observed in their blood sample.
Ideally, the sick patients "ID" should be categorized into one of the three groups which could be called Bad_30, Bad_40, and Bad_40. If the patients don't make it into one of the "Bad" groups, then they are non-criticals
See answer
This might be the way:
critical = df[(df['hour36_XT']>=2.0) | (df['hour42_XT']>=1.5) | (df['hour48_XT']>=0.5)]
not_critical = df[~df.index.isin(critical.index)]
Before using this, you will have to convert the data type of all values to float. You can do that by using dtype=np.float32 while defining the data frame
You can put multiple conditions within one df.loc bracket. I tried this on your dataset and it worked as expected:
newDf = df.loc[(df['hour36_XT'] >= 2.0) & (df['hour42_XT'] >= 1.0) & (df['hour48_XT'] >= 0.5)])
print(newDF['ID'])
Explanation: I'm creating a new dataframe using your conditions and then printing out the IDs of the resulting dataframe.
Words of advice: You should avoid iterating over Pandas dataframe rows, and once you learn to utilize Pandas you'll be surprised how rarely you need to do this. This should be the first lesson when starting to use Pandas, but it's so ingrained in us programmers that we tend to skip over the powerful abilities of the Pandas package and immediately turn to using row iterations. If you rely on row iteration when working with Pandas you'll likely find an annoying slowness when you start working with larger datasets and/or more complex operations. I recommend reading up on this, I'm a beginner myself and have found this article to be a good reference point.