This question already has answers here:
Pandas Vectorized lookup of Dictionary
(2 answers)
Closed 2 years ago.
I have a dataframe with Country column. It has rows for around 15 countries. I want to add a Continent column using a mapping dictionary, ContinentDict, that has mapping from country name to Continent name)
I see that these two work
df['Population'] = df['Energy Supply'] / df['Energy Supply per Capita']
df['Continent'] = df.apply(lambda x: ContinentDict[x['Country']], axis='columns')
but this does not
df['Continent'] = ContinentDict[df['Country']]
Looks like the issue is that df['Country'] is a series object and so the statement is not smart enough to treat the last statement to be same as 2.
Questions
Would love to understand why statement 1 works but not 3. Is it because "dividing two series objects" is defined as an element wise divide?
Any way to change 3 to tell I want element wise operation without having to go the apply route?
From your statement a mapping dictionary, ContinentDict, it looks like ContinentDict is a Python dictionary. In this case,
ContinentDict[some_key]
is a pure Python call, regardless of what object some_key is. That's why the 3rd call fails since the df['Country'] is not in the dictionary key (and it never can be since dictionary keys are not mutable).
In which case, Python only allows to index the exact key and throws an error when the key is not in the dictionary.
Pandas does provide a tool for you to replace/map the values:
df['Continent'] = df['Country'].map(ContinentDict)
df['Continent']=df['Country'].map(ContinentDict)
In case 1, you are dealing with two pandas series, so it knows how to deal with them.
In case 2, you have a python dictionary and pandas series, pandas don't know how to deal with a dictionary(df['country'] is pandas series but not a key in the dictionary)
Related
This question already has answers here:
Python pandas insert list into a cell
(9 answers)
Closed 1 year ago.
It seems to me that this question has probably been asked before, if not on SO then elsewhere. I have not been able to find it. Apologies if this is a duplicate...
It turns out that there is indeed a very related question: Python pandas insert list into a cell. It is not a duplicate, though: indeed, the error there is different from the error here. However, the solution there can be applied here.
It has also helpfully been given below: https://stackoverflow.com/a/66852480/5065462.
I am trying to store a list inside a pandas DataFrame. In short, each row corresponds to some information on a group of people and I have a people_ids columns which is a list of their ids.
However, I am running into the following error.
ValueError: cannot set using a multi-index selection indexer with a different length than the value
Here is a minimal non-working example.
df = pd.DataFrame({"a": [[1], [1,2]], "b": range(1,3)})
print(df)
df.loc[0, "a"] = [10,2]
print(df)
Running the df.loc command gives the above error. I get the same error if I replace the lists by tuples and/or use lists/tuples each of the same length in the different rows.
If I remove column b, then it works fine with both lists and tuples.
Also, if I replace [10,2] with [10], then it runs, but the first entry in column a is not a list with the single entry 10, but is a single integer. Interestingly, this happens even with (10,) in place of [10].
My guess is that it thinks the right-hand side is some kind of index, hence the "multi-index error".
Use df.at as it can only access a single value at a time so you can easily assign a list.
df.at[0, "a"] = [10,2]
This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 11 months ago.
I have a Dataframe that is being output to a spreadsheet called 'All Data'. Let's say this data contains a business addresses (column for street, city, zip, state). However, I also want to create a worksheet for each unique state containing the exact same columns.
My basic idea was to iterate over every row using df.iterrows() and divide the dataframe like that by appending it to a new dataframe but that seems extremely inefficient. Is there a better way to do this?
I found this answer but that is just a boolean index.
The groupby answers on the other question will work for you too. In your case, something like:
df_list = [d for _, d in df.groupby(['state'])]
This uses a list comprehension to return a list of dataframes, with one dataframe for each state.
A simple way to do it would be to get the unique states and then filtering them out and saving them as individual CSVs or do any other operation after
Here's an example:
# df[column].unique() returns a list of unique values in that particular column
for state in df['state'].unique():
# Filter the dataframe using that column and value from the list
df[df['state']==state].to_csv()
df_collection = dict(tuple(data.groupby('LOCATION').head(2)))
I want to group the df by geographic information and make for each country an own df in a dict so that I can assign different names. Moreover I want that only the first two years of each country are assigned to this new Dataframe therefore I wanted to use head(2) but I receive the error message:
dictionary update sequence element #0 has length 8; 2 is required
df_collection[('AUS')].head(2)
this works but what is the difference?
Problem is using word tuple()
On calling tuple, it takes all column names as separate values while on calling direct head() considers one rows of 8 columns as 1 series value.
Thereby, dictionary expects 2 records via dataframe while, tuple gives 8 (number of columns in dataFrame)
This, should work for you.
df_collection = dict(data.groupby('LOCATION').head(2))
This question already has answers here:
Pandas DataFrame to List of Dictionaries
(5 answers)
Closed 4 years ago.
I was wondering how to make cleaner code, so I started to pay attention to some of my daily code routines. I frequently have to iterate over a dataframe to update a list of dicts:
foo = []
for index, row in df.iterrows():
bar = {}
bar['foobar0'] = row['foobar0']
bar['foobar1'] = row['foobar1']
foo.append(bar)
I think it is hard to maintain, because if df keys are changed, then the loop will not work. Besides that, write same index for two data structures is kind of code duplication.
The context is, I frequently make api calls to a specific endpoint that receives a list of dicts.
I'm looking for improviments for that routine, so how can I change index assignments to some map and lambda tricks, in order to avoid errors caused by key changes in a given dataframe(frequently resulted from some query in database)?
In other words, If a column name in database is changed, the dataframe keys will change too, So I'd like to create a dict on the fly with same keys of a given dataframe and fill each dict entry with dataframe corresponding values.
How can I do that?
The simple way to do this is to_dict, which takes an orient argument that you can use to specify how you want the result structured.
In particular, orient='records' gives you a list of records, each one a dict in {col1name: col1value, col2name: col2value, ...} format.
(Your question is a bit confusing. At the very end, you say, "I'd like to create a dict on the fly with same keys of a given dataframe and fill each dict entry with dataframe corresponding values." This makes it sound like you want a dict of lists (that's to_dict(orient='list') or maybe a dict of dicts (that's to_dict(orient='dict')—or just to_dict(), because that's the default), not a list of dicts.
If you want to know how to do this manually (which you don't want to actually do, but it's worth understanding): a DataFrame acts like a dict, with the column names as the keys and the Series as the values. So you can get a list of the column names the same way you do with a normal dict:
columns = list(df)
Then:
foo = []
for index, row in df.iterrows():
bar = {}
for key in keys:
bar[key] = row[key]
foo.append(bar)
Or, more compactly:
foo = [{key: row[key] for key in keys} for _, row in df.iterrows()}]
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe which is similar to:
grades=pd.DataFrame(columns=["person","course_code","grade"],data=[[1,101,2.0],[2,102,1.0],[3,103,3.0],[2,104,4.0],[1,102,5.0],[3,104,2.5],[2,101,1.0]])
On each row is the grade of a certain student in certain subject.
And want to convert it to another that looks like this:
students=pd.DataFrame(columns=[101,102,103,104],data [[2.0,5.0,"NaN","NaN"],[1.0,1.0,"Nan",4.0],["Nan","Nan",3.0,2.5]])
On each row is a student (codex of the row) with the different grades obtained in every subject (every column is a different subject).
I have tried doing this:
for subj in grades["COURSE_CODE"].unique():
grades_subj=grades[grades["COURSE_CODE"]==subj]
grades_subj = grades_subj.set_index("EXPEDIENT_CODE", drop = True)
for st in grades["EXPEDIENT_CODE"].unique():
grade_num=grades_subj.loc[st]["GRADE"]
student.loc[st][subj]=grade_num
But I get:
KeyError: 'the label [304208] is not in the [index]'
I have tried other ways too and get always errors...
Can someone help me, please?
try:
grades.pivot_table(index='person', columns='course_code', values='grade')
The value argument let you to choose the aggregation column.
In order to answer your comment below, you can always add different levels when indexing. This is simply done by passing a list rather than a single string to index. Note you can do the same in columns. SO, based in the example you provide.
grades.pivot_table(index=['person','school'], columns='course_code', values ='grade')
After this I usually recommend to reset_index() unless you are fluent slicing and indexing with MultiIndex.
Also, if the correspondence is 1 to 1, you could merge both dataframes using the appropiate join.
Here you have all the information about Reshaping and Pivot Tables in Pandas.