This question already has answers here:
Binning a column with pandas
(4 answers)
Closed 3 years ago.
I have a dataframe of cars. I have its car price column and I want to create a new column carsrange that would have values like 'high','low' etc according to car price. Like for example :
if price is between 0 and 9000 then cars range should have 'low' for those cars. similarly, if price is between 9000 and 30,000 carsrange should have 'medium' for those cars etc. I tried doing it, but my code is replacing one value to the other. Any help please?
I ran a for loop in the price column, and use the if-else iterations to define my column values.
for i in cars_data['price']:
if (i>0 and i<9000): cars_data['carsrange']='Low'
elif (i<9000 and i<18000): cars_data['carsrange']='Medium-Low'
elif (i<18000 and i>27000): cars_data['carsrange']='Medium'
elif(i>27000 and i<36000): cars_data['carsrange']='High-Medium'
else : cars_data['carsrange']='High'
Now, When I run the unique function for carsrange, it shows only 'High'.
cars_data['carsrange'].unique()
This is the Output:
In[74]:cars_data['carsrange'].unique()
Out[74]: array(['High'], dtype=object)
I believe I have applied the wrong concept here. Any ideas as to what I should do now?
you can use list:
resultList = []
for i in cars_data['price']:
if (i>0 and i<9000):
resultList.append("Low")
else:
resultList.append("HIGH")
# write other conditions here
cars_data["carsrange"] = resultList
then find uinque values from cars_data["carsrange"]
Related
This question already has answers here:
Why do I get an IndexError (or TypeError, or just wrong results) from "ar[i]" inside "for i in ar"?
(4 answers)
How to iterate over rows in a DataFrame in Pandas
(31 answers)
Closed 3 months ago.
Can anyone tell me why df.loc can't seem to work in a loop like so
example_data = {
'ID': [1,2,3,4,5,6],
'score': [10,20,30,40,50,60]
}
example_data_df = pd.DataFrame(example_data)
for row in example_data_df:
print(example_data_df.loc[row,'ID'])
and is raising the error "KeyError: 'ID'"?
Outside of the loop, this works fine:
row = 1
print(example_data_df.loc[row,'ID']
I have been trying different version of this such as example_data_df['ID'].loc[row] and tried to see if the problem is with the type of object that is in the columns, but nothing worked.
Thank you in advance!
EDIT: If it plays a role, here is why I think I need to use the loop: I have two dataframes A and B, and need to append certain columns from B to A - however only for those rows where A and B have a matching value in a particular column. B is longer than A, not all rows in A are contained in B. I don't know how this would be possible without looping, that would be another question I might ask separately
If you check 'row' as each step, you'll notice that iterating directly over a DataFrame yields the column names.
You want:
for idx, row in example_data_df.iterrows():
print(example_data_df.loc[idx,'ID'])
Or, better:
for idx, row in example_data_df.iterrows():
print(row['ID'])
Now, I don't know why you want to iterate manually over the rows, but know that this should be limited to small datasets as it's the least efficient method of working with a DataFrame.
This question already has answers here:
Splitting dataframe into multiple dataframes
(13 answers)
Closed 1 year ago.
I have a dataset (see here) in which data are available for multiple countries in a period of time that its starting year is unknown (the starting point for each country is different), but we know that last year is 2016. I need to split this dataset into multiple datasets based on the "year" column in a way that gives me a dataset for each year with data for all countries.
I have tried this:
efyear = dict(tuple(eef.groupby('year')))
y = 2016
for y in eef['year']:
try:
exec(f'ef{y} = efyear{y}')
y -= 1
except:
print('Not Available')
but it doesn't work and ends up with 'Not Available' printed many times. I need to produce different names for each dataset or the variable that hold that dataset that was why I used formatting.
Thank you in advance.
You can see the dataset here.
Try:
out = {}
for year, g in df.groupby("year"):
out["ef{}".format(year)] = g
print(out)
This will create a dictionary with keys ef2013, ef2014 etc. and values are dataframes for the year.
I found my answer :))
efyear = dict(tuple(eef.groupby('year')))
y = 2016
for y in eef['year']:
exec(f'ef{y} = efyear[{y}]')
y -= 1
:))
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I am trying to return all correct data when a condition is met. I would like to return all the relevant records when there has been X amount of goals scored by the home team.
data = pd.read_csv("epl_data_v2.csv")
def highest_home_score():
data.loc[data['HG']==1]
The console is returning the value None. I'm not sure why this happens. I know the column name 'HG' is correct.
def highest_home_score():
print(data.loc[data['HG']==1])
highest_home_score()
The code above produces what I was expecting - a small set of results that feature 1 as the HG value.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
So here's my daily challenge :
I have an Excel file containing a list of streets, and some of those streets will be doubled (or tripled) based on their road type. For instance :
In another Excel file, I have the street names (without duplicates) and their mean distances between features such as this :
Both Excel files have been converted to pandas dataframes as so :
duplicates_df = pd.DataFrame()
duplicates_df['Street_names'] = street_names
dist_df=pd.DataFrame()
dist_df['Street_names'] = names_dist_values
dist_df['Mean_Dist'] = dist_values
dist_df['STD'] = std_values
I would like to find a way to append the values of mean distance and STD many times in the duplicates_df whenever a street has more than one occurence, but I am struggling with the proper syntax. This is probably an easy fix, but I've never done this before.
The desired output would be :
Any help would be greatly appreciated!
Thanks again!
pd.merge(duplicates_df, dist_df, on="Street_names")
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.