I want to read and prepare data from a excel spreadsheet containing many sheets with data.
I first read the data from the excel file using pd.read_excel with sheetname=None so that all the sheets can be written into the price_data object.
price_data = pd.read_excel('price_data.xlsx', sheetname=None)
This gives me an OrderedDict object with 5 dataframes.
Afterwards I need to obtain the different dataframes which compose the object price_data. I thought of using a for iteration for this, which gives me the opportunity to do other needed iterative operations such as setting the index of the dataframes.
This is the approach I tried
for key, df in price_data.items():
df.set_index('DeliveryStart', inplace=True)
key = df
With this code I would expect that each dataframe would be written into an object named by the key iterator, and at the end I would have as many dataframes as those inside my original data_price object. However I end up with two identical dataframes, one named key and one named value.
Suggestions?
Reason for current behaviour:
In your example, the variables key and df will be created (if not already existing) and overwritten in each iteration of the loop. In each iteration, you are setting key to point towards the object df (which also remains set in df, as Python allows multiple pointers to the same object). However, the key object is then overwritten in the next loop and set to the new value of df. At the end of the loop, the variables will remain in their last state.
To illustrate:
from collections import OrderedDict
od = OrderedDict()
od["first"] = "foo"
od["second"] = "bar"
# I've added an extra layer of `enumerate` just to display the loop progress.
# This isn't required in your actual code.
for loop, (key, val) in enumerate(od.items()):
print("Iteration: {}".format(loop))
print(key, val)
key = val
print(key,val)
print("Final output:", key, val)
Output:
Iteration: 0
first foo
foo foo
Iteration: 1
second bar
bar bar
Final output: bar bar
Solution:
It looks like you want to dynamically set the variables to be named the same as the value of key, which isn't considered a good idea (even though it can be done). See Dynamically set local variable for more discussion.
It sounds like a dict, or OrderedDict is actually a good format for you to store the DataFrames alongside the name of the sheet it originated from. Essentially, you have a container with the named attributes you want to use. You can then iterate over the items to do work like concatenation, filtering or similar.
If there's a different reason you wanted the DataFrames to be in standalone objects, leave a comment and I will try and make a follow-up suggestion.
If you are happy to set index of the DataFrames in-place, you could try this:
for key in price_data:
price_data[key].set_index('DeliveryStart', inplace=True)
Related
I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.
I found it to be extremely slow if we initialize a pandas Series object from a list of DataFrames. E.g. the following code:
import pandas as pd
import numpy as np
# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]
# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)
Initially I thought the Series initialization accidentally deep-copied the DataFrames which make it slow, but it turned out that it's just copy by reference as the usual = in python does.
On the other hand, if I just create a series and manually shallow copy elements over (in a for loop), it will be fast:
# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
s1[i] = l[i]
What is happening here?
Real-life usage: I have a table loader which reads something on disk and returns a pandas DataFrame (a table). To expedite the reading, I use a parallel tool (from this answer) to execute multiple reads (each read is for one date for example), and it returns a list (of tables). Now I want to transform this list to a pandas Series object with a proper index (e.g. the date or file location used in the read), but the Series construction takes ridiculous amount of time (as the sample code shown above). I can of course write it as a for loop to solve the issue, but that'll be ugly. Besides I want to know what is really taking the time here. Any insights?
This is not a direct answer to the OP's question (what's causing the slow-down when constructing a series from a list of dataframes):
I might be missing an important advantage of using pd.Series to store a list of dataframes, however if that's not critical for downstream processes, then a better option might be to either store this as a dictionary of dataframes or to concatenate into a single dataframe.
For the dictionary of dataframes, one could use something like:
d = {n: df for n, df in enumerate(l)}
# can change the key to something more useful in downstream processes
For concatenation:
w = pd.concat(l, axis=1)
# note that when using with the snippet in this question
# the column names will be duplicated (because they have
# the same names) but if your actual list of dataframes
# contains unique column names, then the concatenated
# dataframe will act as a normal dataframe with unique
# column names
Tried my best looking for a similar answer, but didn't seem to find the necessary one.
I have a dictionary of dataframe objects, where the key is the dataframe name, and the value is the actual dataframe
table_names_dict = {'name_1': dataframe_1, 'name_2': dataframe_2}
I am trying to loop over the dictionary and dynamically create separate dataframes, using the keys as their names:
name_1 = dataframe_1
name_2 = dataframe_2
I tried something of the sort
for key, value in table_names_dict.items():
key = value
This simply created one dataframe named value
I've also tried
locals().update(table_names_dict)
Which did create the necessary variables, but they are not accessible in Spyders variable explorer, and from what I've read, the use of locals() is frowned upon.
What am I doing wrong?
You can use globals() for this:
for i in table_names_dict:
globals()[i]=table_names_dict[i]
I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)
I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps