create dataframe from dictionary - python

In order to iterate a list through a function, I used the following code:
tot = {}
for i in list:
tot["tot{0}".format(i)] = stateagg(i) #previously defined function
The output of this is a pandas dictionary, I was wondering if there is a way to output to a dataframe or a way to convert this back to a dataframe.
I have tried
pd.Dataframe.from_dict(tot, orient = 'index')
which results in the following error:
ValueError: If using all scalar values, you must pass an index
Any help much appreciated.
Edit:
apologies I should've been clearer, the function pulls values out of a dataframe to create the dictionary, the data used isn't in list format. The list is used to pull the values out and aggregate data based on the list.

To create a dataframe from list, you can try something as follows
df = pd.DataFrame(my_list.items(), columns={"A", "B"})
For example
my_list = {"Math":25, "Diana":22, "Jhon":30}
df = pd.DataFrame(my_list.items(), columns={"Name", "Age"})
df
Name Age
0 Math 25
1 Diana 22
2 Jhon 30

Related

Apply function to multiple dataframes and create multiple dataframes from that

I have a list of multiple data frames on cryptocurrency. So I want to apply a function to all of these data frames, which should convert all the data frames so that I am only left with data from 2021.
The function looks like this:
dataframe_list = [bitcoin, have, binance, Cardano, chainlink, cosmos, crypto com, dogecoin, eos, Ethereum, iota, litecoin, monero, nem, Polkadot, Solana, stellar, tether, uni swap, usdcoin, wrapped, xrp]
def date_func(i):
i['Date'] = pd.to_datetime(i['Date'])
i = i.set_index(i['Date'])
i = i.sort_index()
i = i['2021-01-01':]
return(i)
for dataframe in dataframe_list:
dataframe = date_func(dataframe)
However, I am only left with one data frame called 'dataframe', which only contains values of the xrp dataframe.
I would like to have a new dataframe from each dataframe, called aave21, bitcoin21 .... which only contains values from 2021 onwards.
What am I doing wrong?
Best regards and thanks in advance.
You are overwriting dataframe when iterating over dataframe_list, i.e. you only keep the latest dataframe.
You can either try:
dataframe = pd.DataFrame()
for df in dataframe_list:
dataframe.append(date_func(df))
Or shorter:
dataframe = pd.concat([data_func(df) for df in dataframe_list])
You are overwriting dataframe variable in your for loop when iterating over dataframe_list. You need to keep appending results into a new variable.
final_df = pd.DataFrame()
for dataframe in dataframe_list:
final_df.append(date_func(dataframe))
print(final_df)

Indexing column in Pandas Dataframe returns NaN

I am running into a problem with trying to index my dataframe. As shown in the attached picture, I have a column in the dataframe called 'Identifiers' that contains a lot of redundant information ({'print_isbn_canonical': '). I only want the ISBN that comes after.
#Option 1 I tried
testdf2 = testdf2[testdf2['identifiers'].str[26:39]]
#Option 2 I tried
testdf2['identifiers_test'] = testdf2['identifiers'].str.replace("{'print_isbn_canonical': '","")
Unfortunately both of these options turn the dataframe column into a colum only containing NaN values
Please help out! I cannot seem to find the solution and have tried several things. Thank you all in advance!
Example image of the dataframe
If the contents of your column identifiers is a real dict / json type, you can use the string accessor str[] to access the dict value by key, as follows:
testdf2['identifiers_test'] = testdf2['identifiers'].str['print_isbn_canonical']
Demo
data = {'identifiers': [{'print_isbn_canonical': '9780721682167', 'eis': '1234'}]}
df = pd.DataFrame(data)
df['isbn'] = df['identifiers'].str['print_isbn_canonical']
print(df)
identifiers isbn
0 {'print_isbn_canonical': '9780721682167', 'eis': '1234'} 9780721682167
Try this out :
testdf2['new_column'] = testdf2.apply(lambda r : r.identifiers[26:39],axis=1)
Here I assume that the identifiers column is string type

How to store a DataFrame inside a DataFrame

I have a DataFrame with various text and numeric columns which I use like a database. Since a column can be of dtype object, I can also store more complex objects inside a single cell, like a numpy array.
How could I store another DataFrame inside a cell?
df1=pd.DataFrame([1,'a'])
df2=pd.DataFrame([2,'b'])
This assigment fails:
df1.loc[0,0] = df2
ValueError: Incompatible indexer with DataFrame
PS. It is not a duplicate question as suggested below since I do not want to concatenate the "sub"-DataFrames
You can use set_value:
df1.set_value(0,0,df2)
or:
df1.iat[0,0]=df2
Since .set_value has been deprecated since version 0.21.0.
Convert your df1 to a dict by using to_dict
df1.loc[0,0] = [df2.to_dict()]
df1
Out[862]:
0
0 [{0: {0: 2, 1: 'b'}}]
1 a
If you need convert it back to dataframe , You can using dataframe constructor
pd.DataFrame(df1.loc[0,0][0])
Out[864]:
0
0 2
1 b

JSON string within CSV data read by pandas [duplicate]

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file
with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (read_df)
read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)

Pandas append data frames, add a field, and then flood the field with a default value?

I have several data frames that contain all of the same column names. I want to append them into a master data frame. I also want to create a column that denotes the original field and then flood it with the original data frames name. I have some code that works.
df_combine = df_breakfast.copy()
df_combine['X_ORIG_DF'] = 'Breakfast'
df_combine = df_combine.append(df_lunch, ignore_index=True)
df_combine['X_ORIG_DF'] = df_combine['X_ORIG_DF'].fillna('Lunch')
# Rinse and repeat
However, it seems inelegant. I was hoping someone could point me to a more elegant solution. Thank you in advance for your time!
Note: Edited to reflect comment!
I would definitely consider restructuring you data in a way the names can be accessed neatly rather than as variable names (if they must be separate to begin with).
For example a dictionary:
d = {'breakfast': df_breakfast, 'lunch': df_lunch}
Create a function to give each DataFrame a new column:
def add_col(df, col_name, col_entry):
df = df.copy() # so as not to change df_lunch etc.
df[col_name] = col_entry
return df
and combine the list of DataFrame each with the appended column ('X_ORIG_DF'):
In [3]: df_combine = pd.DataFrame().append(list(add_col(v, 'X_ORIG_DF', k)
for k, v in d.items()))
Out[3]:
0 1 X_ORIG_DF
0 1 2 lunch
1 3 4 lunch
0 1 2 breakfast
1 3 4 breakfast
In this example: df_lunch = df_breakfast = pd.DataFrame([[1, 2], [3, 4]]).
I've encountered a similar problem as you when trying to combine multiple files together for the purpose of analysis in a master dataframe. Here is one method for creating that master dataframe by loading each dataframe independently, giving them each an identifier in a column called 'ID' and combining them. If your data is a list of files in a directory called datadir I would do the following:
import os
import pandas as pd
data_list = os.listdir(datadir)
df_dict = {}
for data_file in data_list:
df = read_table(data_file)
#add an ID column based on the file name.
#you could use some other naming scheme of course
df['ID'] = data_file
df_dict[data_file] = df
#the concat function is great for combining lots of dfs.
#it takes a list of dfs as an argument.
combined_df_with_named_column = pd.concat(df_dict.values())

Categories