Convert dictionaries to dataframe - python

I am trying to convert this dictionary:
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
data
to dataframe like:
date balance
Jan 2018 1000
Feb 2018 1100
Mar 2018 1400
Apr 2018 700
May 2018 800
I used the dataframe to convert, but it didn't give the format as above, how can i do it? Thank you!
pd.DataFrame.from_dict(data_c, orient='columns')

Here is my solution:
import pandas as pd
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
arr = [list(*d.items()) for d in data]
df = pd.DataFrame(arr, columns=['data', 'balance'])
you need get proper array from the tuple of dictionary before pass it to DataFrame.

Try this
df = pd.DataFrame.from_dict({k: v for d in data for k, v in d.items()},
orient='index',
columns=['balance']).rename_axis('date').reset_index()
Out[477]:
date balance
0 Jan 2018 1000
1 Feb 2018 1100
2 Mar 2018 1400
3 Apr 2018 700
4 May 2018 800

From the documentation of from_dict
orient : {‘columns’, ‘index’}, default ‘columns’
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
Since you want your keys to indicate rows, changing the orient to index will give the result your want. However first you need to put your data in a single dictionary. This code will give you the result you want.
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
d = {}
for i in data:
for k in i.keys():
d[k] = i[k]
df = pd.DataFrame.from_dict(d, orient='index')

What you have there is a tuple of single-element dictionaries. This is unidiomatic, and poor design. If all the dictionaries correspond to the same columns, then a list of tuples would do just fine.
Solutions
I believe the currently accepted answer relies on there being only one key:value pair in each dictionary. That’s unfortunate, since it automatically excludes most situations where this design makes any sense.
If, hypothetically, the "tuple of 1-element dicts" couldn't be changed, here is how I would suggest doing things:
import pandas as pd
import itertools as itt
raw_data = ({"Jan 2018": 1000}, {"Feb 2018": 1100}, {"Mar 2018": 1400}, {"Apr 2018": 700}, {"May 2018": 800})
data = itt.chain.from_iterable(curr.items() for curr in raw_data)
df = pd.DataFrame(data, columns=['date', 'balance'])
Here is the sensible alternative to all this.
import pandas as pd
data = [("Jan 2018", 1000), ("Feb 2018", 1100), ("Mar 2018", 1400), ("Apr 2018", 700), ("May 2018", 800)]
df = pd.DataFrame(data, columns=['date', 'balance'])
df:
date balance
0 Jan 2018 1000
1 Feb 2018 1100
2 Mar 2018 1400
3 Apr 2018 700
4 May 2018 800
It would probably be even better if those dates were actual date types, not strings. I will change that later.

Related

How to drop duplicates in a data frame and keep first with two exceptions?

I have a data frame that looks like this.
import pandas as pd
# intialise data of lists.
data = {'ID':[101762, 101762, 102842, 102842, 106755, 106755, 108615, 108615, 113402, 113402, 114711, 114711],
'Year':[2019, 2020, 2019, 2020, 2019, 2020, 2019, 2020, 2019, 2020, 2019, 2020],
'Amount':[2091.06, 3330.00, 846.19, 846.19, 16185.60, 800, 281496.00, 1363730.00, 19815.00, 9585.00, 64332.70, 5400.00]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Here is an image with some logic of what I am trying to do.
Need to drop anything where Amount = 0, or Year = 2012
df = df[df['Amount'] != 0]
df = df[df['Year'] != '2021']
Ok, so far. Now, I need to keep the max Amount if an ID & Year shows up more than once.
Here is the code that I am running.
df = df.sort_values(['ID','Year']).drop_duplicates(['ID','Year'], keep='first')
At this point, things are still fine, but I'm stuck on the next steps. How can I do the following?
If ID is repeated and 2020 Amount > 2019 Amount, sum these two Amounts together
If ID is repeated and 2020 Amount = 2019 Amount, keep only 2020
If ID is repeated and 2019 Amount > 2020 Amount, keep only 2019
How can I achieve these three objectives?
A little bit logic with sort_values
out = df.sort_values('Year',ascending=False).drop_duplicates(['ID','Amount']).sort_values('Amount').groupby('ID').agg({'Year':'last','Amount':'sum'}).reset_index()
ID Year Amount
0 101762 2020 5421.06
1 102842 2020 846.19
2 106755 2019 16985.60
3 108615 2020 1645226.00
4 113402 2019 29400.00
5 114711 2019 69732.70

Create a dataframe from multiple list of dictionary values

I have a code as below,
safety_df ={}
for key3,safety in analy_df.items():
safety = pd.DataFrame({"Year":safety['index'],
'{}'.format(key3)+"_CR":safety['CURRENT'],
'{}'.format(key3)+"_ICR":safety['ICR'],
'{}'.format(key3)+"_D/E":safety['D/E'],
'{}'.format(key3)+"_D/A":safety['D/A']})
safety_df[key3] = safety
Here in this code I'm extracting values from another dictionary. It will looping through the various companies that why I named using format in the key. The output contains above 5 columns for each company(Year,CR, ICR,D/E,D/A).
Output which is printing out is with plenty of NA values where after
Here I want common column which is year for all companies and print following columns which is C1_CR, C2_CR, C3_CR, C1_ICR, C2_ICR, C3_ICR,...C3_D/A ..
I tried to extract using following code,
pd.concat(safety_df.values())
Sample output of this..
Here it extracts values for each list, but NA values are getting printed out because of for loops?
I also tried with groupby and it was not worked?..
How to set Year as common column, and print other values side by side.
Thanks
Use axis=1 to concate along the columns:
import numpy as np
import pandas as pd
years = np.arange(2010, 2021)
n = len(years)
c1 = np.random.rand(n)
c2 = np.random.rand(n)
c3 = np.random.rand(n)
frames = {
'a': pd.DataFrame({'year': years, 'c1': c1}),
'b': pd.DataFrame({'year': years, 'c2': c2}),
'c': pd.DataFrame({'year': years[1:], 'c3': c3[1:]}),
}
for key in frames:
frames[key].set_index('year', inplace=True)
df = pd.concat(frames.values(), axis=1)
print(df)
which results in
c1 c2 c3
year
2010 0.956494 0.667499 NaN
2011 0.945344 0.578535 0.780039
2012 0.262117 0.080678 0.084415
2013 0.458592 0.390832 0.310181
2014 0.094028 0.843971 0.886331
2015 0.774905 0.192438 0.883722
2016 0.254918 0.095353 0.774190
2017 0.724667 0.397913 0.650906
2018 0.277498 0.531180 0.091791
2019 0.238076 0.917023 0.387511
2020 0.677015 0.159720 0.063264
Note that I have explicitly set the index to be the 'year' column, and in my example, I have removed the first year from the 'c' column. This is to show how the indices of the different dataframes are matched when concatenating. Had the index been left to its standard value, you would have gotten the years out of sync and a NaN value at the bottom of column 'c' instead.

Merging converted dataframes from multiple series

I receive some data in 11 different pandas series. I need to combine the whole data into one pandas dataframe to carry out further analysis and reporting.
The format in which the data is received is as under:
Series1:
Sales
Item Series Year
A Sal 2018 100
2019 200
B Sal 2018 300
2019 400
Series2:
Purchases
Item Series Year
A Pur 2018 50
2019 100
B Pur 2018 150
2019 200
Series3:
Expenses
Product Series Year
A Exp 2019 100
B Exp 2019 200
I have a number of series parameter. So, I created a loop where the following code merges two of the total series till the all series are merged. I have tried to consolidate all such series into one dataframe using this code:
df = pd.merge(df,series1,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
But even if we write separate lines for each two pairs for our example here, it will be:
df = pd.merge(series1,series2,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
df = pd.merge(df,series3,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
However the issue with this is:
It only allows to merge two series at a time.
When I merge the third series in this example, as it doesn't have data for 2018, instead of putting NULL there, it remove the 2018 rows for even the series 1 and series 2 data in the dataframe. So, I am only left with merged data from all three series for 2019.
I considered converting all the series to list individually and then converting those lists to a dictionary, which then is converted into a dataframe. That works, but requires a lot of effort and requires code change if number of series changes. So, this doesn't work for me.
Any other way to do this?
Did you try using the to_frame method?
For example, you could use
df = pd.Series["a", "b", "c"]
df.to_frame()
to convert.
Try using this method in your data frame.
Here's it in the docs.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html
Try pd.concat():
import pandas as pd
import pandas as pd
s1 = pd.Series([100, 200, 300, 400], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['1','1','2','2'], [2018, 2019, 2018, 2019]]))
s2 = pd.Series([50, 100, 150, 200], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['3','3','4','4'], [2018, 2019, 2018, 2019]]))
s3 = pd.Series([100, 200], index = pd.MultiIndex.from_arrays([['A','B'],['5','6'], [2019, 2019]]))
df = pd.concat([s.droplevel(1) for s in [s1, s2, s3]], axis = 1)
0 1 2
A 2018 100 50 NaN
2019 200 100 100.0
B 2018 300 150 NaN
2019 400 200 200.0

Pandas dataframe.set_index() deletes previous index and column

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

grouping and reordering by partial identifiers in python

I have data from a csv that produces a dataframe that looks like the following:
d = {"clf_2007": [20],
"e_2007": [25],
"ue_2007": [17],
"clf_2008": [300],
"e_2008": [20],
"ue_2008": [10]}
df = pd.DataFrame(d)
which produces a data frame (forgive me for not knowing how to properly code that into stackoverflow)
clf_2007 clf_2008 e_2007 e_2008 ue_2007 ue_2008
0 20 300 25 20 17 10
I want to manipulate that data to produce something that looks like this:
clf e ue
2007 20 25 17
2008 300 20 10
2007 and 2008 in the original column names represent dates, but they don't need to be datetime now. I need to merge them with another dataframe that has the same "dates" eventually, but I can figure that out later.
Thus far, I've tried groupbys and I've tried them by string indexes (like str[ :8]) and such, and, outside of it not working, I don't even think groupby is the right tool. I've also tried pd.PeriodIndex, but, again, that doesn't seem like the right tool to me.
Is there a standardized way to do something like this? Or is the brute force way (get it into an excel spreadsheet and just move the data around manually), the only way to get what I'm looking for here?
I think this will be a lot easier if you pre-process your data to have three columns: key, year and value. Something like:
rows = []
for k, v in d.iteritems():
key, year = k.split("_")
for val in v:
rows.append({'key': key, 'year': year, 'value': val})
Put those rows into a dataframe, call it dfA. I'm assuming you might have more than one value for each (key, year) pair and you want to aggregate them somehow. I'll assume you do that and end up with a dataframe called df, whose columns are still key, year, and value. At that point, you just need to pivot:
pd.pivot_table(df,index=['year'], columns=['key'])
You end up with multi-indexed rows/columns that you'll want to clean up, but I'll leave that to you.
You can generate a column multiindex:
df.columns = pd.MultiIndex.from_tuples([col.split("_") for col in df])
print(df.columns)
# clf e ue
# 2007 2008 2007 2008 2007 2008
And then stack the table:
df = df.stack()
print(df)
# clf e ue
#0 2007 20 25 17
# 2008 300 20 10
You can optionally flatten the index, too:
df.index = df.index.get_level_values(1)
print(df)
# clf e ue
#2007 20 25 17
#2008 300 20 10

Categories