How to delete specific cells in a pandas dataframe giving some conditions?

How to delete specific cells in a pandas dataframe giving some conditions? - python

I want to clear the contents of the first two cells in location for every first 2 duplicates in last name.
For eg: i want to clear out the 1st 2 location occurances for Balchuinas and only keep the 3rd one. Same goes for London and Fleck. I ONLY want to clear out the location cells, not complete rows.
Any help?
I tried the .drop_duplicates,keep='last' method but that removes the whole row. I only want to clear the contents of the cells (or change it to NaN if thats possible)
Ps. This is my first time asking a question so im not sure how to paste the image without a link. Please help!

Rather than removing the duplicate rows. I would suggest, find the duplicate values and replace it with NaN while keeping the last cell value
Something like this:
df[df.duplicated(keep='last')] = float('nan')

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".

try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

Python. Deleting Excel rows while iterating. Alternative for OpenPyXl or solution for ws.max_rows wrong output

I'm working with Python on Excel files. Until now I was using OpenPyXl. I need to iterate over the rows and delete some of them if they do not meet specific criteria let's say I was using something like:
current_row = 1
while current_row <= ws.max_row
if 'something' in ws[f'L{row}'].value:
data_ws.delete_rows(current_row)
continue
current_row += 1
Everything was alright until I have encountered problem with ws.max_rows. In a new Excel file which I've received to process ws.max_rows was returning more rows than it was in the reality. After some googling I've found out why is it happening.
Here's a great explanation of the problem which I've found in the comment section on the Stack:
However, ws.max_row will not check if last rows are empty or not. If cell's content at the end of the worksheet is deleted using Del key or by removing duplicates, remaining empty rows at the end of your data will still count as a used row. If you do not want to keep these empty rows, you will have to delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete) –
V. Brunelle
Thanks V. Brunelle for very good explanation of the cause of the problem.
In my case it is because some of the rows are deleted by removing duplicates. For e.g. there's 400 rows in my file listed one by one (without any gaps) but ws.max_row is returning 500
For now I'm using a quick fix:
while current_row <= len([row for row in data_ws.iter_rows(min_row=min_row) if not all([cell.value is None for cell in row])])
But I know that it is very inefficient. That's the reason why I'm asking this question. I'm looking for possible solution.
From what I've found here on the Stack I can:
Create a copy of the worksheet and iterate over that copy and ws.delete_rows in the original worksheet so I will need to my fix only once
Iterate backwards with for_loop so I won't have to deal with ws.max_rows since for_loops works fine in that case (they read proper file dimensions). This method seems promising for me, but always I've got 4 rows at the top of the workbook which I'm not touching at all and potential debugging would need to be done backwards as well, which might not be very enjoyable :D.
Use other python library to process Excel files, but I don't know which one would be better, because keeping workbook styles is very important to me (and making changes in them if needed). I've read some promising things about pywin32 library (win32com.client), but it seems lacking documentation and it might be hard to work with it and also I don't know how does it look in performance matter. I was also considering pandas, but in kind words it's messing up the styles (in reality it deletes all styles in the worksheet).
I'm stuck now, because I really don't know which route should I choose.
I would appreciate every advice/opinion in the topic and if possible I would like to make a small discussion here.
Best regards!

If max rows doesn't report what you expect you'll need to sort the issue best you can and perhaps that might be by manually deleting; "delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete)" or making some other determination in your code as what the last row is, then perhaps programatically deleting all the rows from there to max_row so at least it reports correctly on the next code run.
You could also incorporate your fix code into your example code for deleting rows that meet specific criteria.
For example; a test sheet has 9 rows of data but cell B15 is an empty string so max_rows returns 15 rather than 9.
The example code checks each used cell in the row for None type in the cell value and only processes the 9 rows with data.
from openpyxl import load_workbook
filename = "foo.xlsx"
wb = load_workbook(filename)
data_ws = wb['Sheet1']
print(f"Max Rows Reports {data_ws.max_row}")
for row in data_ws:
print(f"Checking row {row[0].row}")
if all(cell.value is not None for cell in row):
if 'something' in data_ws[f'L{row[0].row}'].value:
data_ws.delete_rows(row[0].row)
else:
print(f"Actual Max Rows is {row[0].row}")
break
wb.save('out_' + filename)
Output
Max Rows Reports 15
Checking row 1
Checking row 2
Checking row 3
Checking row 4
Checking row 5
Checking row 6
Checking row 7
Checking row 8
Checking row 9
Actual Max Rows is 9
Of course this is not perfect, if any of the 9 rows with data had one cell value of None the loop would stop at that point. However if you know that's not going to be the case it may be all you need.

Why is duplicated() showing different results on similar codes, of the same dataframe?

I am currently working on an NLP project on my own, and am having some troubles even after reading through the documentations. scraped reddit posts and wanted to find out which posts are duplicated for the 'selftext' and 'title' column. The 3 codes shown below are what was inputted and the results are shown in the picture
May i ask why is there non duplicated posts for code 2 and 3 with reference to code 1?
(1)investing_data[['selftext','title']][investing_data.duplicated(subset=['selftext','title'])]
(2)investing_data[['selftext', 'title']][investing_data.duplicated(subset=['selftext'])]
(3)investing_data[['selftext', 'title']][investing_data.duplicated(subset=['title'])]
screenshot of the 3 codes above

You do have in fact different data that you check for duplicates:
What you see in all three lines are the duplicates, i.e. the second, third and so forth occurrence of the line.
In line (1): you check if both selftext and title are the same.
In line (2): you check for entries that have a duplicated selftext.
In line (3): you check for entries that have a duplicated title.
Using your visualisation you only see the duplicates, but not all entries that actually are duplicates (i.e. including the first occurrence).
For that you can use something like the following:
investing_data[investing_data['selftext'] == investing_data[investing_data.duplicated(subset=['selftext'])]['selftext']][['selftext', 'title']]
What is done with this is that you get all duplicates that contain selftext and and search in the original dataframe for exactly those entries and display selftext and title.

Using specific dataframe cell values to create a consolidated new dataframe

So this is an assignment from my python class. We have a dataframe that contains information about a machine. Every minute the machine stands still (as in: no product is being produced for whatever reason) it creates a new row in a dataframe containing the reason.
This is what the first standstill looks like in the DF (text looks weird because I had to translate it from german):
We see that Reason 1 for the standstill started at 2020-03-02 at 14:04 and was rectified at 14:07.
The assignment now is to create a new dataframe to consolidate this information, so that it would look somewhat like this:
I had the idea to use .shift() in order to check for the beginning and the end of a new Standstillreason. The column "SAP?" (Same as previous?) checks wether the Standstillreason in the current row is the same as in the previous. Same goes for "SAN?" (Same as next?) regarding the next row.
df["SAP?"] = df.Standstillreason.eq(df.Standstillreason.shift())
df["SAN?"] = df.Standstillreason.eq(df.Standstillreason.shift(periods=-1))
Everytime SAP = False, it means that the row contains the start of a new Standstillreason.
If SAN = False, the row contains the end of the current reason.
What I'm trying to do now is to extract the needed information to put it in the consolidated df. I thought of somewhat like this:
for index, row in df.iterrows():
if row["SAP?"] == False:
df_consolidated["Standstillreason"] = row["Standstillreason"]
But so far this hasn't worked. I'm not even sure wether I'm having the right approach to the problem.

If the reasons are all different, then simply
df.groupby("Standstillreason").agg({"Date":["first","last"], "Time":["first", "last"]})
Should work.
If not, it is a bit more complicated. You want to create too dataframes, one with SAP equal to true, and the other with SAN true, and get the first from the second df, and the last from the first, then concat together.

Pandas add column to new data frame at associated string value?

I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.

For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).

You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)

Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.