pandas df.loc returns blank df - python

I'm sure I'm making an obviously mistake, but can't see it.
I have a df that looks like this:
id year plan grade prior_grade
21 2017 text A B
56 2015 text B B
43 2016 text A C
and want to create a new df with only those rows where prior_grade = c. I'm using this to do so:
prior_c = (df.loc[(df['prior_grade']=='C')])
which returns an empty df (column names print but no rows when calling prior_c.head())
Again, I'm sure I'm making an obvious mistake, but just can't see it.
edit: also tried with less parens and got the same result:
prior_c = df.loc[df['prior_grade']=='C']

This should work, although I believe you should make a copy of the dataframe. This serves to explicitly make the new dataframe a copy (rather than a view of the original) so as to avoid unintentionally changing your original df. I would recommend the following:
prior_c = df.loc[df['prior_grade']=='C'].copy()

Related

Dropping Rows that Contain a Specific String wrapped in square brackets?

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]
Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever
You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

DF Column Values Replace fxn

I am trying to replace multiple names throughout my entire DF to match a certain output. For example how can I make it where the DF will replace all "Ronald Acuna" with "Ronald Acuna Jr." and "Corbin Burns" to "Corbin B"
lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
This works, but then when I make another line for Ronald Acuna, Corbin B goes back to his full name. Im sure there is a way to somehow loop it all together, but I can't find it.
Thanks
Most likely you will need to reassign the new replaced dataframe back to the dataframe
lineups = lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
lineups = lineups.replace(to_replace = ['Ronald Acuna'], value ='Ronald Acuna Jr')
And so on.

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

SettingwithCopy when creating new column and when dropping NaN rows [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 4 years ago.
I've been searching around reading the pandas docs here and trying different lines of code from questions posted around here and here and I can't seem to get away from the setting with copy warning. I'd prefer to learn to code it the "right" way as opposed to just ignoring the warnings.
The following lines of code are inside a for loop and I don't want to generate this warning a lot of times because it could slow things down.
I'm trying to make a new column with name: 'E'+vs where vs is a string in a list in the for loop
But for each one of them, I still get the following warning, even with the last 3 lines:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here are the troublesome lines I've tried so far:
#based on research, the first two seem to be the "wrong" way
df_out['E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out['E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out.loc[:,kvs].rolling(v).mean()
The other one that gives a SettingWithCopyWarning is this:
df_out.dropna(inplace=True,axis=0)
This one also gave a warning (but I figured this one would)
df_out = df_out.dropna(inplace=True,axis=0)
How do I do both of these operations correctly?
EDIT: Here is the code that produced the original df_out
df_out= pd.concat([vol.Date[1:-1], ret.Return_Time[:-2], vol.Freq_Time[:-2],
vol.Freq_Time[:-1].shift(-1), vol.Freq_Time[:].shift(-2)],
axis=1).dropna().set_index('Date')
This is a confusing topic. It's not the code you've posted that is the problem. It's the code you haven't posted. It's the code that generated the df_out
Consider this example and note the last line that generates the warning.
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other[:]
df_out['E'] = 5
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Now we'll try an equivalent thing that won't produce the warning
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other.loc[:]
df_out['E'] = 5
Then
print `df_out`
A B E
0 1 2 5
It boils down to pandas deciding to attach an is_copy attribute to a dataframe when it's constructed based on lots of criteria.
Notice the
df_other[:].is_copy
<weakref at 0x103323458; to 'DataFrame' at 0x116a684e0>
When
df_other.loc[:].is_copy
Returns None
So what types of construction trigger the copy? I still don't know everything, and not even the things I know all make sense to me.
Like why does this not trigger it?
df_other[['A', 'B', 'E']].is_copy
First off, I am not sure this is either efficient or the best approach. However, I had the same issue when I was adding a new column to the exist dataframe and I decided to use reset_index method.
Here I first drop Nan rows from EMPLOYEES column and assign this manipulated data frame to new data frame df1 then I add COMPANY_SIZE column to the df1 as follows:
df1 = all_merged_years.dropna(subset=['EMPLOYEES']).reset_index()
column = df1['EMPLOYEES']
Size =[]
df1['COMPANY_SIZE'] = ' '
for number in column:
if number <=999:
Size.append('Small')
elif 999<number<=9999:
Size.append('Medium')
elif 9999<number:
Size.append('Large')
else:
Size.append('UNKNOWN')
df1['COMPANY_SIZE'] = Size
This way I did NOT get a warning as such. Hope that helps.

Categories