Im learning how to use Pandas and I've downloaded some data from Kaggle about car prices etc.
I'm trying to create a new dataframe by subsetting all the cars out that have the model "Golf".
golfs = df[df.model == "Golf"]
It does return a new dataframe but when i call it, its just empty besides the column names.
trying this:
others = df[df.model != "Golf"]
creates a new dataframe, but it has everything in it. The datatype for the column is an object. So i tried to create subsets by transmission, which is also an object.
man_trans = df[df.transmission == "Manual"]
creates a new data frame with solely Manual transmissions... I have no idea where its going wrong. I've tried subsetting all other columns but its just the first one that wont behave. Ive even tried copying and pasting the cell value directly into the code.
Ive even tried adding in:
df.reset_index()
to add in a new index as i thought that might be the problem.
The code looks like it is correct to me. If the golf dataframe is empty, it is possible you dont have any rows where df['model'] == 'Golf'?. Maybe it's =="golf" instead?
# It this doesnt work....
# golfs = df[df.model == "Golf"]
# Maybe try this (or something like this
golfs = df[df.model == "golf"]
So David was totally correct with the space, but it wasnt at the end it was at the beginning. I checked what David was saying by applying:
'''
len(df.model.iloc[798])
'''
To find out how many characters were in a cell to what I was looking for. "Golf" only has 4 but 5 was being returned.
'''
df.model = df.model.str.lstrip()
len(df.model.iloc[798])
4
'''
Thanks to the people that responded and helped me find an answer to this stupidly simple problem.
Related
I was trying to get a data frame of spam messages so I can analyze them. This is what the original CSV file looks like.
I want it to be like
This is what I had tried:
###import the original CSV (it's simplified sample which has only two columns - sender, text)
import pandas as pd
df = pd.read_csv("spam.csv")
### if any of those is in the text column, I'll put that row in the new data frame.
keyword = ["prize", "bit.ly", "shorturl"]
### putting rows that have a keyword into a new data frame.
spam_list = df[df['text'].str.contains('|'.join(keyword))]
### creating a new column 'detected keyword' and trying to show what was detected keyword
spam_list['detected word'] = keyword
spam_list
However, "detected word" is in order of the list.
I know it's because I put the list into the new column, but I couldn't think/find a better way to do this. Should I have used "for" as the solution? Or am I approaching it in a totally wrong way?
You can define a function that gets the result for each row:
def detect_keyword(row):
for key in keyword:
if key in row['text']:
return key
then get it done for all rows with pandas.apply() and save results as a new column:
df['detected_word'] = df.apply(lambda x: detect_keyword(x), axis=1)
You can use the code given below in the picture to solve your stated problem, I wasn't able to paste the code because stackoverflow wasn't allowing to paste short links. The link to the code is available.
The code has been adapted from here
Been working on this project all day and it's destroying me. Currently have finished web scraping and have a final .csv which contains the elements of a pandas dataframe. Working with this dataframe in a new file, and currently have the following:
df = pd.read_csv('active_homes.csv')
for i in range(len(df)):
add = df['Address'][i]
price = df['Price'][i]
if (price<100000) == True:
print(price)
'active_homes.csv' looks like this:
Address,Status,Price,Meta
"387 8th St, Burlington, CO 80807",For Sale,169500,"4bed2bath1,560sqft"
,and the resulting df's shape is (1764, 4).
This should, in theory, print the price for each iteration of price<100000.
In practice, it prints this:
I have confirmed that at each iteration of the above for loop, it is collecting the correct 'Price' and 'Address' information, and have also confirmed that at each interval the logic (price<100000) is working correctly. However, it is still doing the above. I was originally trying to just drop the rows of the dataframe that were <100000 but that wasn't doing anything. I was also trying to reassign the data to a new dataframe and it would either return an empty dataframe, or return a dataframe with duplicate data of this house (with the 'Price' of 58900).
So far, from all of that, I believe that the program is recognizing the amount of correct houses < 100000, but for some reason the assignment is sticking for the one address. It also does the same thing without assignment, as in:
for i in range(len(df)):
if (df['Price'][i]<100000) == True:
print(df['Price'][i])
Any help in identifying the error would be much appreciated.
With Pandas you try to never iterate everything in the traditional python way. Instead, you could achieve the desired result using the following method:
df = pd.read_csv('active_homes.csv')
temp_df = df[df["Price"]<100000] # initiating a new df isn't required, just a force of a habit
print(temp_df["Price"]) # displaying a series of houses that are below 100K; imo prettier print
Look at the variations of code I tried here
I'm trying to use Pandas to filter rows with multiple conditions and create a new csv file with only those rows. I've tried several different ways and then commented out each of those attempts (sometimes I only tried one condition for simplicity but it still didn't work). When the csv file is created, the filters weren't applied.
This is my updated code
I got it to work for condition #1, but I'm not sure how to add/apply condition #2. I tried a lot of different combinations. I know the code I put in the linked image wouldn't work for applying the 2nd condition because all I did was assign the variable, but it seemed too cumbersome to try to show all the ways I tried to do it. Any hints on that part?
df = pd.read_csv(excel_file_path)
#condition #1
is_report_period = (df["Report Period"]=="2015-2016") | \
(df["Report Period"]=="2016-2017") | \
(df["Report Period"]=="2017-2018") | \
(df["Report Period"]=="2018-2019")
#condition #2
is_zip_code = (df["Zip Code"]<"14800")
new_df = df[is_report_period]
you can easily achieve this by using the '&':
new_df = df[is_report_period & is_zip_code]
also, you can make your code more readable and easy for you to apply changes
in the filtering by using this method:
Periods = ["2015-2016","2016-2017","2017-2018","2018-2019"]
is_report_period = df["Report Period"].isin(Periods)
this way you can easily alter your filter when needed, and it's
easier for you to maintain.
So this is an assignment from my python class. We have a dataframe that contains information about a machine. Every minute the machine stands still (as in: no product is being produced for whatever reason) it creates a new row in a dataframe containing the reason.
This is what the first standstill looks like in the DF (text looks weird because I had to translate it from german):
We see that Reason 1 for the standstill started at 2020-03-02 at 14:04 and was rectified at 14:07.
The assignment now is to create a new dataframe to consolidate this information, so that it would look somewhat like this:
I had the idea to use .shift() in order to check for the beginning and the end of a new Standstillreason. The column "SAP?" (Same as previous?) checks wether the Standstillreason in the current row is the same as in the previous. Same goes for "SAN?" (Same as next?) regarding the next row.
df["SAP?"] = df.Standstillreason.eq(df.Standstillreason.shift())
df["SAN?"] = df.Standstillreason.eq(df.Standstillreason.shift(periods=-1))
Everytime SAP = False, it means that the row contains the start of a new Standstillreason.
If SAN = False, the row contains the end of the current reason.
What I'm trying to do now is to extract the needed information to put it in the consolidated df. I thought of somewhat like this:
for index, row in df.iterrows():
if row["SAP?"] == False:
df_consolidated["Standstillreason"] = row["Standstillreason"]
But so far this hasn't worked. I'm not even sure wether I'm having the right approach to the problem.
If the reasons are all different, then simply
df.groupby("Standstillreason").agg({"Date":["first","last"], "Time":["first", "last"]})
Should work.
If not, it is a bit more complicated. You want to create too dataframes, one with SAP equal to true, and the other with SAN true, and get the first from the second df, and the last from the first, then concat together.
I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's