I have a dataframe. I want to slice it by checking if the value contains a string. For example, this code works:
data_df[data_df['column1'].str.contains('test')]
But I first want to set my column1 to be all lowercase first. So being the n00b that I am, I tried:
data_df[data_df['column1'].lower().str.contains('test')]
Of course the Python gods gave me no mercy and gave me an AttributeError. Any tips on how I can slice a dataframe based on a substring but first make everything into lowercase first?
I feel like the following post is very close to my answer but I can't get it to work exactly how I described up there:
Python pandas dataframe slicing, with if condition
Thanks Python pros!!!
Try using apply()
data_df[data_df['column1'].apply(str.lower).str.contains('test')]
You can drop the apply:
data_df[data_df['column1'].str.lower().str.contains('test')]
Related
Trying to remove spaces from a column of strings in pandas dataframe. Successfully did it using this method in other section of code.
for index, row in summ.iterrows():
row['TeamName'] = row['TeamName'].replace(" ", "")
summ.head() shows no change made to the column of strings after this operation, however no error as well.
I have no idea why this issue is happening considering I used this exact same method later in the code and accomplished the task successfully.
Why not use str.replace:
df["TeamName"] = df["TeamName"].str.replace(r' ', '', regex=False)
I may be proven wrong here, but I am wondering if its because you are iterating over it, and maybe working on a copy that isn't changing the data. From pandas.DataFrame.iterrows documentation, this is what I found there:
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
just a thought... hth
I downloaded a .csv file to do some practice, a column named "year_month" is string with the format "YYYY-MM"
By doing:
df = pd.read_csv('C:/..../migration_flows.csv',parse_dates=["year_month"])
"year_month" is Dtype=object. So far so good.
By doing:
df["year_month"] = pd.to_datetime(df["year_month"],format='%Y-%m-%d')
it is converted to daterime64[ns]. So far so good.
I try to filter certain dates by doing:
filtered_df = df.loc[(df["year_month"]>= pd.Timestamp(2018-1-1))]
The program returns the whole column as if nothing happened. For instance, it starts displaying, starting from the date "2001-01-01"
Any thoughts on how to filter properly? Many thanks
how about this
df.loc[(df["year_month"]>= pd.to_datetime('2018-01-01'))]
or
df.loc[(df["year_month"]>= pd.Timestamp('2018-01-01'))]
I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's
just started learning python. trying to change a columns data type from object to float to take out the mean. I have tried to change [] to () and even the "". I dont know whether it makes a difference or not. Please help me figure out what the issue is. thanks!!
My code:
df["normalized-losses"]=df["normalized-losses"].astype(float)
error which i see: attached as imageenter image description here
Use:
df['normalized-losses'] = df['normalized-losses'][~(df['normalized-losses'] == '?' )].astype(float)
Using df.normalized-losses leads to interpreter evaluating df.normalized which doesn't exist. The statement you have written executes (df.normalized) - (losses.astype(float)).There appears to be a question mark in your data which can't be converted to float.The above statement converts to float only those rows which don't contain a question mark and drops the rest.If you don't want to drop the columns you can replace them with 0 using:
df['normalized-losses'] = df['normalized-losses'].replace('?', 0.0)
df['normalized-losses'] = df['normalized-losses'].astype(float)
Welcome to Stack Overflow, and good luck on your Python journey! An important part of coding is learning how to interpret error messages. In this case, the traceback is quite helpful - it is telling you that you cannot call normalized after df, since a dataframe does not have a method of this name.
Of course you weren't trying to call something called normalized, but rather the normalized-losses column. The way to do this is as you already did once - df["normalized-losses"].
As to your main problem - if even one of your values can't be converted to a float, the columnwide operation will fail. This is very common. You need to first eliminate all of the non-numerical items in the column, one way to find them is with df[~df['normalized_losses'].str.isnumeric()].
The "df.normalized-losses" does not signify anything to python in this case. you can replace it with df["normalized-losses"]. Usually, if you try
df["normalized-losses"]=df["normalized-losses"].astype(float)
This should work. What this does is, it takes normalized-losses column from dataframe, converts it to float, and reassigns it to normalized column in the same dataframe. But sometimes it might need some data processing before you try the above statement.
You can't use - in an attribute or variable name. Perhaps you mean normalized_losses?
I am willing to get subset of the dataframe. And the condition is that, the value of certain column starts with the string 'HOUS'. How should I do?.
df.loc[df.id.startswith('HOUS')]
I should have searched more.
Here is the solution.
df[df.id.str.startswith('HOUS')]