Dataframe iteration better practices for values assignment [duplicate] - python

This question already has answers here:
Pandas DataFrame to List of Dictionaries
(5 answers)
Closed 4 years ago.
I was wondering how to make cleaner code, so I started to pay attention to some of my daily code routines. I frequently have to iterate over a dataframe to update a list of dicts:
foo = []
for index, row in df.iterrows():
bar = {}
bar['foobar0'] = row['foobar0']
bar['foobar1'] = row['foobar1']
foo.append(bar)
I think it is hard to maintain, because if df keys are changed, then the loop will not work. Besides that, write same index for two data structures is kind of code duplication.
The context is, I frequently make api calls to a specific endpoint that receives a list of dicts.
I'm looking for improviments for that routine, so how can I change index assignments to some map and lambda tricks, in order to avoid errors caused by key changes in a given dataframe(frequently resulted from some query in database)?
In other words, If a column name in database is changed, the dataframe keys will change too, So I'd like to create a dict on the fly with same keys of a given dataframe and fill each dict entry with dataframe corresponding values.
How can I do that?

The simple way to do this is to_dict, which takes an orient argument that you can use to specify how you want the result structured.
In particular, orient='records' gives you a list of records, each one a dict in {col1name: col1value, col2name: col2value, ...} format.
(Your question is a bit confusing. At the very end, you say, "I'd like to create a dict on the fly with same keys of a given dataframe and fill each dict entry with dataframe corresponding values." This makes it sound like you want a dict of lists (that's to_dict(orient='list') or maybe a dict of dicts (that's to_dict(orient='dict')—or just to_dict(), because that's the default), not a list of dicts.
If you want to know how to do this manually (which you don't want to actually do, but it's worth understanding): a DataFrame acts like a dict, with the column names as the keys and the Series as the values. So you can get a list of the column names the same way you do with a normal dict:
columns = list(df)
Then:
foo = []
for index, row in df.iterrows():
bar = {}
for key in keys:
bar[key] = row[key]
foo.append(bar)
Or, more compactly:
foo = [{key: row[key] for key in keys} for _, row in df.iterrows()}]

Related

How to generate multiple dataframes from a dictionary?

I have a dictionary like the following
dict = {“df_text":df1, “df_logo":df2, “df_person":df3}
Each of the values in the dictionary is a dataframe.
Yet my actual dictionary is larger, so I want to make a loop that generate multiple dataframes from all of the components of this dict. In a way that “key” be the name of the dataframe and the corresponding value the dataframe’s value.
ex.
df_text=pd.DataFrame(df1)
How can I do this?
You can add the contents of your dict as variables to vars:
for k, v in dict.items():
vars()[k] = v
After that you can access them simply as df_text, df_logo etc.
(as you wrote in your question, the values of your dict are already dataframe, so I assume you don't want to wrap them once more into a dataframe)

Creating a new dataframe colum as a function of other columns [duplicate]

This question already has answers here:
Pandas Vectorized lookup of Dictionary
(2 answers)
Closed 2 years ago.
I have a dataframe with Country column. It has rows for around 15 countries. I want to add a Continent column using a mapping dictionary, ContinentDict, that has mapping from country name to Continent name)
I see that these two work
df['Population'] = df['Energy Supply'] / df['Energy Supply per Capita']
df['Continent'] = df.apply(lambda x: ContinentDict[x['Country']], axis='columns')
but this does not
df['Continent'] = ContinentDict[df['Country']]
Looks like the issue is that df['Country'] is a series object and so the statement is not smart enough to treat the last statement to be same as 2.
Questions
Would love to understand why statement 1 works but not 3. Is it because "dividing two series objects" is defined as an element wise divide?
Any way to change 3 to tell I want element wise operation without having to go the apply route?
From your statement a mapping dictionary, ContinentDict, it looks like ContinentDict is a Python dictionary. In this case,
ContinentDict[some_key]
is a pure Python call, regardless of what object some_key is. That's why the 3rd call fails since the df['Country'] is not in the dictionary key (and it never can be since dictionary keys are not mutable).
In which case, Python only allows to index the exact key and throws an error when the key is not in the dictionary.
Pandas does provide a tool for you to replace/map the values:
df['Continent'] = df['Country'].map(ContinentDict)
df['Continent']=df['Country'].map(ContinentDict)
In case 1, you are dealing with two pandas series, so it knows how to deal with them.
In case 2, you have a python dictionary and pandas series, pandas don't know how to deal with a dictionary(df['country'] is pandas series but not a key in the dictionary)

How to convert dictionary of DataFrames into individual DataFrames (Python, Pandas)

I have an original dataframe with 4 columns (for the example lets call them product_id, year_month, week, order_amount) and > 50,000 rows. There are 240 individual product_id values and each one of them behaves differently in the data, therefore I wanted to create individual dataframes from the original one based on individual product_id. I was able to do this by utilizing:
dict_of_productid = {k: v for k, v in df.groupby('product_id)}
this created a dictionary with the key being the product_id and the values being the columns: product_id, year_month, week, order_amount. Each item in the dictionary also maintained the index from the original df. for example: if product_id = dvvd56 was on row# 4035 then on the dictionary it will be on the dataframe created for product_id dvvd56 but with the index still being 4035.
What I'm stuck with now is a dictionary with df's as values but can't find a way to convert these values into individual dataframes I can use and manipulate. If there is a way to do this please let me know! I'll be very grateful. thank you
I found a way to go about this, but I dont know if this is the most appropriate way, but it might help for further answers in order to clarify what I want to do.
First step was to convert the unique values into a list and then sorting them in order:
product_id_list = df['product_id'].value_counts().index.to_list()
product_id_list = sorted(product_id_list)
After this was done I created a formula and then iterated over it with the individual values of the product_id_list:
def get_df(key):
for k in key:
df_productid = dict_of_productid[k]
return df_productid
for c, i in enumerate(product_id_list):
globals()[f'df_{c}'] = get_df([f'{i}'])
this allows me now to separate all the values of the dictionary that was created into separate dataframes that I can call without explicitly stating the product id. I can just do df_1 and get the dataframe.
(I dont know if this is the most efficient way to go about this)

Updating excel rows with data in the form of python dict_items

I have a list of dictionaries which have value which needs to be updated to an excel sheet with corresponding column headers ,
new=[{"slno":"1","region":"2","customer":"3"}]
I am not sure about data types in python as I am a beginner,
All I want to do is update an excel sheet with the data from the above dict using a for loop. I always end up with a unordered data,
In the excel file there are column headers with the name exactly that of the Key of the dict so I was hoping to insert the respecting value in the excel column.
Note: I was able to write it to excel using a for loop but dict was giving random numbers so the values were messed up when updated on sheet.
xfile = openpyxl.load_workbook('D:\\LoginLibrary\\test.xlsx')
sheet = xfile.get_sheet_by_name('OE')
charcounter="A"
i=i
for key in g:
sheet[charcounter+str(i)]=key
charcounter = (chr(ord(charcounter[0]) + 1))
xfile.save('D:\\LoginLibrary\\test.xlsx')
One of the difficulties of dictionaries is that when you iterate over it as a loop, the keys can be in any order. However, something you can do is get the whole list of keys, then sort that list. For example:
xfile = openpyxl.load_workbook('D:\\LoginLibrary\\test.xlsx')
sheet = xfile.get_sheet_by_name('OE')
charcounter="A"
i=i
new = {"slno":"1","region":"2","customer":"3"} #The outer brackets made it a list, unneeded
print(sorted(new.keys())) #Prints out all the keys in alphabetical order
list_of_sorted_keys = sorted(new.keys())
for key in list_of_sorted_keys:
sheet[charcounter+str(i)]=key
charcounter = (chr(ord(charcounter[0]) + 1))
xfile.save('D:\\LoginLibrary\\test.xlsx')
Note: I don't know much about writing to excel, so I'm assuming that you have that part right. My additions just modify the dictionary so that it is organized.
If alphabetical order for the keys doesn't do your job, you can order by the values as well, although it's more difficult to get your keys from values because dictionaries aren't supposed to work that way.
Another way could be to just make the original data set as a list of tuples, as so:
new=[("slno","1"),("region","2"),("customer","3")]
That will keep all your data in order that you put it in the list, because lists are accessed by integer indices.
I hope one of these ideas meets your needs!

Filtering pandas DataFrame

I'm reading in a .csv file using pandas, and then I want to filter out the rows where a specified column's value is not in a dictionary for example. So something like this:
df = pd.read_csv('mycsv.csv', sep='\t', encoding='utf-8', index_col=0,
names=['col1', 'col2','col3','col4'])
c = df.col4.value_counts(normalize=True).head(20)
values = dict(zip(c.index.tolist()[1::2], c.tolist()[1::2])) # Get odd and create dict
df_filtered = filter out all rows where col4 not in values
After searching around a bit I tried using the following to filter it:
df_filtered = df[df.col4 in values]
but that unfortunately didn't work.
I've done the following to make it works for what I want to do, but it's incredibly slow for a large .csv file, so I thought there must be a way to do it that's built in to pandas:
t = [(list(df.col1) + list(df.col2) + list(df.col3)) for i in range(len(df.col4)) if list(df.col4)[i] in values]
If you want to check against the dictionary values:
df_filtered = df[df.col4.isin(values.values())]
If you want to check against the dictionary keys:
df_filtered = df[df.col4.isin(values.keys())]
As A.Kot mentioned you could use the values method of the dict to search. But the values method returns either a list or an iterator depending on your version of Python.
If your only reason for creating that dict is membership testing, and you only ever look at the values of the dict then you are using the wrong data structure.
A set will improve your lookup performance, and simplify your check back to:
df_filtered = df[df.col4 in values]
If you use values elsewhere, and you want to check against the keys, then you're ok because membership testing against keys is efficient.

Categories