I'm looking to create/update a new column, 'dept' if the text in column
A contains a string. It's working without a forloop involved but when I try to iterate it is setting the default instead of the detected value.
Surely I shouldn't manually add the same line 171 times, I've scoured the internet and SO for possible hints and or solutions and can't seem to locate any good info.
Working Code:
df['dept'] = np.where(df.a.str.contains("PHYS"), "PHYS", "Unknown")
But when I try:
depts = ['PHYS', 'PSYCH']
for dept in depts:
df['dept'] = np.where(df.a.str.contains(dept), dept, "Unknown")
print(dept)
I get all "Unknowns" but properly prints out each dept. I've also tried to make sure dept is fed in as a string by explicitly stating dept = str(dept) to no avail.
Thanks in advance for any and all help. I feel like this is a simple issue that should be easily sorted but I'm experiencing a block.
We usually do
df['dept'] = df.a.str.findall('|'.join(depts)).str[0]
I prefer str.extract:
df['depth'] = df['a'].str.extract(f"({'|'.join(depts)})").fillna("Unknown")
Or:
df['depth'] = df['a'].str.extract('(' + '|'.join(depts) + ')').fillna("Unknown")
Both codes output:
>>> df
a depth
0 ewfefPHYS PHYS
1 QWQiPSYCH PSYCH
2 fwfew Unknown
>>>
#U-12-Forward has a great solution if there is only supposed to be one new column entitled specifically with the string 'dept', not the value of each dept variable in the loop.
If the intent is to create a new column for each dept in depts, then remove the quotations around "dept" in the column indexer:
for dept in depts:
df[dept] = np.where(df.a.str.contains(dept), dept, "Unknown")
The example is confusing because it is not clear whether there is supposed to be a new column for each dept (i.e, PHYS, PSYCH) because of the variable name.
This excerpt will not "work" because it would overwrite df['dept'] on the second assignment with something that is only a combination of 'PSYCH' and 'Unknown' (there would be no 'PHYS').
df['dept'] = np.where(df.a.str.contains("PHYS"), "PHYS", "Unknown")
df['dept'] = np.where(df.a.str.contains("PSYCH"), "PSYCH", "Unknown")
What you are describing would certainly happen if there are no strings in column a that contain the final element in depts because the result of the last np.where would be all False, therefore return a full series of 'Unknown'.
Related
This seems like an elementary question with many online examples, but for some reason it does not work for me.
I am trying to replace any cells in column 'A' that have the value = "Facility-based testing-OH" with the value = "Facility based testing-OH". If you note, the only difference between the two is a single '-', however for my purposes I do not want to use the split function on a delimeter. Simply want to locate the values that need replacement.
I have tried the following code, but none have worked.
1st Method:
df = df.str.replace('Facility-based testing-OH','Facility based testing-OH')
2nd Method:
df['A'] = df['A'].str.replace(['Facility-based testing-OH'], "Facility based testing-OH"), inplace=True
3rd Method
df.loc[df['A'].isin(['Facility-based testing-OH'])] = 'Facility based testing-OH'
Try:
df["A"] = df["A"].str.replace(
"Facility-based testing-OH", "Facility based testing-OH", regex=False
)
print(df)
Prints:
A
0 Facility based testing-OH
1 Facility based testing-OH
df used:
A
0 Facility-based testing-OH
1 Facility based testing-OH
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
I am trying to replace multiple names throughout my entire DF to match a certain output. For example how can I make it where the DF will replace all "Ronald Acuna" with "Ronald Acuna Jr." and "Corbin Burns" to "Corbin B"
lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
This works, but then when I make another line for Ronald Acuna, Corbin B goes back to his full name. Im sure there is a way to somehow loop it all together, but I can't find it.
Thanks
Most likely you will need to reassign the new replaced dataframe back to the dataframe
lineups = lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
lineups = lineups.replace(to_replace = ['Ronald Acuna'], value ='Ronald Acuna Jr')
And so on.
I've seen a large number of similar questions but nothing quite answers what I am looking to do.
I have two dataframes
Conn_df that contains names and company details manually entered (e.g. Conn_df["Name", "Company_name", "Company_Address"]
Cleanse_df that contains cleaned up company names (e.g. Cleanse_df["Original_Company_Name", "Cleanse_Company_Name"]
The data for both is held in csv files that are imported into the script.
I want to change the company details in Conn_df.Company_Name using the values in Cleanse_df, where the Conn_df.Company_Name equals the Cleanse_df.Original_Company_Name and is replaced by Cleanse_df.Cleanse_Company_Name.
I have tried:
Conn_df["Company"] = Conn_df["Company"].replace(Conn_df["Company"], Cleanse_df["Cleansed"]) but got
replace() takes no keyword arguments
I also tried:
Conn_df["Company"] = Conn_df["Company"].map(Cleanse_df.set_index("Original")["Cleansed"]) but got
Reindexing only valid with uniquely valued Index objects
Any suggestions on how to get the values to be replaced. I would note that both dataframes run to many tens of thousands of rows, so creating a manual list is not possible.
I think you want something along the lines of this:
conn_df = pd.DataFrame({'Name':['Mac','K','Hutt'],
'Company_name':['McD','KFC','PH'],
'Company_adress':['street1','street2','street4']})
cleanse_df = pd.DataFrame({'Original_Company_Name':['McD'],'Cleanse_Company_Name':
['MacDonalds']})
cleanse_df = cleanse_df.rename(columns={'Original_Company_Name':'Company_name'})
merged_df = conn_df.merge(cleanse_df,on='Company_name',how='left')
merged_df['Cleanse_Company_Name'].fillna(merged_df['Company_name'],inplace=True)
final_df = merged_df[['Name','Company_adress','Cleanse_Company_Name']]\
.rename(columns={'Cleanse_Company_Name':'Company_name'})
This would return:
Name Company_adress Company_name
0 Mac street1 MacDonalds
1 K street2 KFC
2 Hutt street4 PH
You merge the two dataframes and then keep the replaced new value, if there is no value to replace the name then the name will just stay the same, this is done by the fillna command.
Let's say my Pandas DataFrame contains the following:
Row Firstname Middlename Lastname
…
10 Roy G. Biv
11 Cnyder M. Uk
12 Pan T. One
…
If a cell in ["Lastname"] contains "Biv", I would like to set the Python variable first_name = to "Roy".
I've been away from Pandas for a while and am unsure on the right way to accomplish this. I've looked at ways to combine df.iloc/df.loc etc with df.at/df.where and str.contains but have to admit I'm kind of lost on the proper way to put the conditional together with setting the variable.
In (poor, incorrect) pseudo-code, I'm looking for:
first_name = df["Firstname"][np.where([df["Lastname"].str.contains("Biv") ...
If you want just the scalar value then you can access the first value in the result array and assign this:
firstname = df.loc[df['Lastname'].str.contains('Biv'), 'Firstname'][0]
Note that really you should check if it's not empty though:
if len(df.loc[df['Lastname'].str.contains('Biv'), 'Firstname']) > 0:
firstname = df.loc[df['Lastname'].str.contains('Biv'), 'Firstname'][0]