Merging converted dataframes from multiple series - python

I receive some data in 11 different pandas series. I need to combine the whole data into one pandas dataframe to carry out further analysis and reporting.
The format in which the data is received is as under:
Series1:
Sales
Item Series Year
A Sal 2018 100
2019 200
B Sal 2018 300
2019 400
Series2:
Purchases
Item Series Year
A Pur 2018 50
2019 100
B Pur 2018 150
2019 200
Series3:
Expenses
Product Series Year
A Exp 2019 100
B Exp 2019 200
I have a number of series parameter. So, I created a loop where the following code merges two of the total series till the all series are merged. I have tried to consolidate all such series into one dataframe using this code:
df = pd.merge(df,series1,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
But even if we write separate lines for each two pairs for our example here, it will be:
df = pd.merge(series1,series2,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
df = pd.merge(df,series3,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
However the issue with this is:
It only allows to merge two series at a time.
When I merge the third series in this example, as it doesn't have data for 2018, instead of putting NULL there, it remove the 2018 rows for even the series 1 and series 2 data in the dataframe. So, I am only left with merged data from all three series for 2019.
I considered converting all the series to list individually and then converting those lists to a dictionary, which then is converted into a dataframe. That works, but requires a lot of effort and requires code change if number of series changes. So, this doesn't work for me.
Any other way to do this?

Did you try using the to_frame method?
For example, you could use
df = pd.Series["a", "b", "c"]
df.to_frame()
to convert.
Try using this method in your data frame.
Here's it in the docs.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html

Try pd.concat():
import pandas as pd
import pandas as pd
s1 = pd.Series([100, 200, 300, 400], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['1','1','2','2'], [2018, 2019, 2018, 2019]]))
s2 = pd.Series([50, 100, 150, 200], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['3','3','4','4'], [2018, 2019, 2018, 2019]]))
s3 = pd.Series([100, 200], index = pd.MultiIndex.from_arrays([['A','B'],['5','6'], [2019, 2019]]))
df = pd.concat([s.droplevel(1) for s in [s1, s2, s3]], axis = 1)
0 1 2
A 2018 100 50 NaN
2019 200 100 100.0
B 2018 300 150 NaN
2019 400 200 200.0

Related

Create a dataframe from multiple list of dictionary values

I have a code as below,
safety_df ={}
for key3,safety in analy_df.items():
safety = pd.DataFrame({"Year":safety['index'],
'{}'.format(key3)+"_CR":safety['CURRENT'],
'{}'.format(key3)+"_ICR":safety['ICR'],
'{}'.format(key3)+"_D/E":safety['D/E'],
'{}'.format(key3)+"_D/A":safety['D/A']})
safety_df[key3] = safety
Here in this code I'm extracting values from another dictionary. It will looping through the various companies that why I named using format in the key. The output contains above 5 columns for each company(Year,CR, ICR,D/E,D/A).
Output which is printing out is with plenty of NA values where after
Here I want common column which is year for all companies and print following columns which is C1_CR, C2_CR, C3_CR, C1_ICR, C2_ICR, C3_ICR,...C3_D/A ..
I tried to extract using following code,
pd.concat(safety_df.values())
Sample output of this..
Here it extracts values for each list, but NA values are getting printed out because of for loops?
I also tried with groupby and it was not worked?..
How to set Year as common column, and print other values side by side.
Thanks
Use axis=1 to concate along the columns:
import numpy as np
import pandas as pd
years = np.arange(2010, 2021)
n = len(years)
c1 = np.random.rand(n)
c2 = np.random.rand(n)
c3 = np.random.rand(n)
frames = {
'a': pd.DataFrame({'year': years, 'c1': c1}),
'b': pd.DataFrame({'year': years, 'c2': c2}),
'c': pd.DataFrame({'year': years[1:], 'c3': c3[1:]}),
}
for key in frames:
frames[key].set_index('year', inplace=True)
df = pd.concat(frames.values(), axis=1)
print(df)
which results in
c1 c2 c3
year
2010 0.956494 0.667499 NaN
2011 0.945344 0.578535 0.780039
2012 0.262117 0.080678 0.084415
2013 0.458592 0.390832 0.310181
2014 0.094028 0.843971 0.886331
2015 0.774905 0.192438 0.883722
2016 0.254918 0.095353 0.774190
2017 0.724667 0.397913 0.650906
2018 0.277498 0.531180 0.091791
2019 0.238076 0.917023 0.387511
2020 0.677015 0.159720 0.063264
Note that I have explicitly set the index to be the 'year' column, and in my example, I have removed the first year from the 'c' column. This is to show how the indices of the different dataframes are matched when concatenating. Had the index been left to its standard value, you would have gotten the years out of sync and a NaN value at the bottom of column 'c' instead.

Convert dictionaries to dataframe

I am trying to convert this dictionary:
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
data
to dataframe like:
date balance
Jan 2018 1000
Feb 2018 1100
Mar 2018 1400
Apr 2018 700
May 2018 800
I used the dataframe to convert, but it didn't give the format as above, how can i do it? Thank you!
pd.DataFrame.from_dict(data_c, orient='columns')
Here is my solution:
import pandas as pd
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
arr = [list(*d.items()) for d in data]
df = pd.DataFrame(arr, columns=['data', 'balance'])
you need get proper array from the tuple of dictionary before pass it to DataFrame.
Try this
df = pd.DataFrame.from_dict({k: v for d in data for k, v in d.items()},
orient='index',
columns=['balance']).rename_axis('date').reset_index()
Out[477]:
date balance
0 Jan 2018 1000
1 Feb 2018 1100
2 Mar 2018 1400
3 Apr 2018 700
4 May 2018 800
From the documentation of from_dict
orient : {‘columns’, ‘index’}, default ‘columns’
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
Since you want your keys to indicate rows, changing the orient to index will give the result your want. However first you need to put your data in a single dictionary. This code will give you the result you want.
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
d = {}
for i in data:
for k in i.keys():
d[k] = i[k]
df = pd.DataFrame.from_dict(d, orient='index')
What you have there is a tuple of single-element dictionaries. This is unidiomatic, and poor design. If all the dictionaries correspond to the same columns, then a list of tuples would do just fine.
Solutions
I believe the currently accepted answer relies on there being only one key:value pair in each dictionary. That’s unfortunate, since it automatically excludes most situations where this design makes any sense.
If, hypothetically, the "tuple of 1-element dicts" couldn't be changed, here is how I would suggest doing things:
import pandas as pd
import itertools as itt
raw_data = ({"Jan 2018": 1000}, {"Feb 2018": 1100}, {"Mar 2018": 1400}, {"Apr 2018": 700}, {"May 2018": 800})
data = itt.chain.from_iterable(curr.items() for curr in raw_data)
df = pd.DataFrame(data, columns=['date', 'balance'])
Here is the sensible alternative to all this.
import pandas as pd
data = [("Jan 2018", 1000), ("Feb 2018", 1100), ("Mar 2018", 1400), ("Apr 2018", 700), ("May 2018", 800)]
df = pd.DataFrame(data, columns=['date', 'balance'])
df:
date balance
0 Jan 2018 1000
1 Feb 2018 1100
2 Mar 2018 1400
3 Apr 2018 700
4 May 2018 800
It would probably be even better if those dates were actual date types, not strings. I will change that later.

Data frame formation

I need to create a data frame for 100 customer_id along with their expenses for each day starting from 1st June 2019 to 31st August 2019. I have customer id already in a list and dates as well in a list. How to make a data frame in the format shown.
CustomerID TrxnDate
1 1-Jun-19
1 2-Jun-19
1 3-Jun-19
1 Upto....
1 31-Aug-19
2 1-Jun-19
2 2-Jun-19
2 3-Jun-19
2 Upto....
2 31-Aug-19
and so on for other 100 customer id
I already have customer_id data frame using pandas function now i need to map each customer_id with the date ie assume we have customer id as 1 then 1 should have all dates from 1st June 2019 to 31 aug 2019 and then customerId 2 should have the same dates. Please see the data frame required.
# import module
import pandas as pd
# list of dates
lst = ['1-Jun-19', '2-Jun-19', ' 3-Jun-19']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
Repeat the operations for Customer ID and store in df2 or something and then
frames = [df, df2]
result = pd.concat(frames)
There are simpler methods , but this will give you a idea how it is carried out.
I see you want specific dataframes, so first creat the dataframes according to customer ID 1. then repeat same for Customer ID 2, and then concat those dataframes.

grouping and reordering by partial identifiers in python

I have data from a csv that produces a dataframe that looks like the following:
d = {"clf_2007": [20],
"e_2007": [25],
"ue_2007": [17],
"clf_2008": [300],
"e_2008": [20],
"ue_2008": [10]}
df = pd.DataFrame(d)
which produces a data frame (forgive me for not knowing how to properly code that into stackoverflow)
clf_2007 clf_2008 e_2007 e_2008 ue_2007 ue_2008
0 20 300 25 20 17 10
I want to manipulate that data to produce something that looks like this:
clf e ue
2007 20 25 17
2008 300 20 10
2007 and 2008 in the original column names represent dates, but they don't need to be datetime now. I need to merge them with another dataframe that has the same "dates" eventually, but I can figure that out later.
Thus far, I've tried groupbys and I've tried them by string indexes (like str[ :8]) and such, and, outside of it not working, I don't even think groupby is the right tool. I've also tried pd.PeriodIndex, but, again, that doesn't seem like the right tool to me.
Is there a standardized way to do something like this? Or is the brute force way (get it into an excel spreadsheet and just move the data around manually), the only way to get what I'm looking for here?
I think this will be a lot easier if you pre-process your data to have three columns: key, year and value. Something like:
rows = []
for k, v in d.iteritems():
key, year = k.split("_")
for val in v:
rows.append({'key': key, 'year': year, 'value': val})
Put those rows into a dataframe, call it dfA. I'm assuming you might have more than one value for each (key, year) pair and you want to aggregate them somehow. I'll assume you do that and end up with a dataframe called df, whose columns are still key, year, and value. At that point, you just need to pivot:
pd.pivot_table(df,index=['year'], columns=['key'])
You end up with multi-indexed rows/columns that you'll want to clean up, but I'll leave that to you.
You can generate a column multiindex:
df.columns = pd.MultiIndex.from_tuples([col.split("_") for col in df])
print(df.columns)
# clf e ue
# 2007 2008 2007 2008 2007 2008
And then stack the table:
df = df.stack()
print(df)
# clf e ue
#0 2007 20 25 17
# 2008 300 20 10
You can optionally flatten the index, too:
df.index = df.index.get_level_values(1)
print(df)
# clf e ue
#2007 20 25 17
#2008 300 20 10

Reindexing only valid with uniquely valued Index objects: Pandas DataFrame Panel

I am trying to average each cell of a bunch of .csv files to export as a single averaged .csv file using Pandas.
I have no problems, creating the dataframe itself, but when I try to turn it into a Panel (i.e. panel=pd.Panel(dataFrame)), I get the error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects pandas pd.panel
An example of what each csv file looks like:
Year, Month, Day, Latitude, Longitude, Value1, Value 2
2010, 06, 01, 23, 97, 1, 3.5
2010, 06, 01, 24, 97, 5, 8.2
2010, 06, 01, 25, 97, 6, 4.6
2010, 06, 01, 26, 97, 4, 2.0
Each .csv file is from gridded data so they have the same number of rows and columns, as well as some no data values (given a value of -999.9), which my code snippet below addresses.
The code that I have so far to do this is:
june=[]
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
june.append(csv1)
dfs={i: pd.DataFrame.from_csv(i) for i in june}
panel=pd.Panel(dfs)
panels=panel.replace(-999.9,np.NaN)
dfs_mean=panels.mean(axis=0)
I have seen questions where the user is getting the same error, but the solutions for those questions doesn't seem to work with my issue. Any help fixing this, or ideas for a better approach would be greatly appreciated.
pd.Panel has been deprecated
Use pd.concat with a dictionary comprehension and take the mean over level 1.
df1 = pd.concat({f: pd.read_csv(f) for f in glob('meansample[0-9].csv')})
df1.mean(level=1)
Year Month Day Latitude Longitude Value1 Value 2
0 2010 6 1 23 97 1 3.5
1 2010 6 1 24 97 5 8.2
2 2010 6 1 25 97 6 4.6
3 2010 6 1 26 97 4 2.0
I have a suggestion to change the approach a bit. Instead of converting each DF into panel, just concat them into one big DF but for each one give a unique ID. After you can just do groupby by the ID and use mean() to get the result.
It would look similar to this:
import Pandas as pd
df = pd.DataFrame()
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
temp_df = pd.read_csv(csv1)
temp_df['df_id'] = csv1
df = pd.concat([df, temp_df])
df.replace(-999.9, np.nan)
df = df.groupby("df_id").mean()
I hope it helps somehow, if you still have any issues with that let me know.

Categories