Problem of create Pandas Dataframe from dictionary - python

I have create a Dataframe from a dictionary like this:
import pandas as pd
data = {'Name': 'Ford Motor', 'AssetType': 'Common Stock', 'Exchange': 'NYSE'}
records = []
statement = {}
for key, value in data.items():
statement = {}
statement[key] = value
records.append(statement)
df = pd.DataFrame(records)
If I do this way, the output look like this:
Name AssetType Exchange
0 Ford Motor NaN NaN
1 NaN Common Stock NaN
2 NaN NaN NYSE
I want the values on the first row and the result look like this:
Name AssetType Exchange
0 Ford Common Stock NYSE

Just put data inside a list [] when creating dataframe:
import pandas as pd
data = {'Name': 'Ford Motor', 'AssetType': 'Common Stock', 'Exchange': 'NYSE'}
df = pd.DataFrame([data])
print(df)
Prints:
Name AssetType Exchange
0 Ford Motor Common Stock NYSE

There are a lot of ways you might want to turn data (dict, list, nested list, etc) into a dataframe. Pandas also includes many creation methods, some of which will overlap, making it hard to remember how to create dfs from data. Here are a few ways you could do this for your data:
df = pd.DataFrame([data])
df = pd.Series(data).to_frame().T
pd.DataFrame.from_dict(data, orient="index").T
pd.DataFrame.from_records(data, index=[0])
imo, from_dict is the least intuitive (I never get the arguments right on the first try). I find focusing on one construction method to be more memorable than using a different one each time; I use pd.DataFrame(...) and from_records(...) the most.

Related

group dataframe based on columns

I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows

How to get values from a dict into a new column, based on values in column

I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'.
I have a CSV file that looks like this:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000
A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000
I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector.
Code:
for ticker in company_dic:
sf1['sector'] = sf1['ticker'].apply(company_dic[ticker])
The code is giving me problems.
For example, the first sector is healthcare, I get this error:
ValueError: Healthcare is an unknown string function
Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation.
.map will return NaN if the ticker is not in the dict.
.apply can be used, but .map should be used
df['sector'] = df.ticker.apply(lambda x: company_dict.get(x))
.get will return None if the ticker isn't in the dict.
import pandas as pd
# test dataframe for this example
df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']})
# in your case, load the data from the file
df = pd.read_csv('file.csv')
# display(df)
ticker dimension calendardate datekey
0 AAPL ARQ 1999-12-31 2000-03-15
1 AAPL ARQ 2000-03-31 2000-06-12
2 AAPL ARQ 2000-06-30 2000-09-01
# dict of sectors
company_dict = {'AAPL': 'tech'}
# insert the sector column using map, into a specific column index
df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict))
# display(df)
ticker sector dimension calendardate datekey
0 AAPL tech ARQ 1999-12-31 2000-03-15
1 AAPL tech ARQ 2000-03-31 2000-06-12
2 AAPL tech ARQ 2000-06-30 2000-09-01
# write the updated data back to the csv file
df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop)
sf1['sector'] = temp
You can pass na_action='ignore' if you have NAN's in tickers column

Pandas DataFrame: Adding a new column with the average price sold of an Author

I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)

Identifying Column Values and then Assigning New Values to Each

I have a pandas DataFrame with cities and a separate list with multipliers for each city. I want to update the TaxAmount in the first df with the corresponding multiplier for each city, from the list.
My current code functions and runs fine but it sets the multiplier to being the same for all cities instead of updating to a new multiplier. So basically all the city's tax rates are the same when they should be different. Any suggestions on how to get this to work?
import pandas as pd
df = pd.DataFrame({
'City': ['BELLEAIR BEACH', 'BELLEAIR BEACH', 'CLEARWATER', 'CLEARWATER'],
'TaxAnnualAmount': [5672, 4781, 2193.34, 2199.14]
})
flag = True
flag = (df['City'] == 'Belleair Bluffs')
if (flag.any() == True):
df.loc['TaxAnnualAmount'] = ((df['CurrentPrice'] / 1000) * 19.9818)
flag = True
flag = (df['City'] == 'Belleair')
if (flag.any() == True):
df.loc['TaxAnnualAmount'] = ((df['CurrentPrice'] / 1000) * 21.1318)
flag = True
flag = (df['City'] == 'Belleair Shore')
if (flag.any() == True):
df.loc['TaxAnnualAmount'] = ((df['CurrentPrice'] / 1000) * 14.4641)
As per your comment, whenever you need to update all rows (or most of them) with a different factor, you can create a second dataframe with those values and merge it with your original.
# sample data
df = pd.DataFrame({
'City': ['BELLEAIR BEACH', 'BELLEAIR BEACH', 'CLEARWATER', 'Belleair'],
'TaxAnnualAmount': [5672, 4781, 2193.34, 500]
})
mults = pd.DataFrame([
['Belleair Bluffs', 19.9818],
['Belleair', 21.1318],
['Belleair Shore', 14.4641]
], columns=['City', 'factor'])
df = df.merge(mults, on='City', how='left')
df['NewTaxAmount'] = df['TaxAnnualAmount'].div(1000).mul(df['factor'])
print(df)
Output
City TaxAnnualAmount factor NewTaxAmount
0 BELLEAIR BEACH 5672.00 NaN NaN
1 BELLEAIR BEACH 4781.00 NaN NaN
2 CLEARWATER 2193.34 NaN NaN
3 Belleair 500.00 21.1318 10.5659
Notice two things:
The how='left' parameter tells pandas to include all rows from the main dataframe and fill nan on the rows that don't have a match.
You must be careful whenever overwriting columns on a dataframe, make sure you don't have lines like this inside a loop (as you would have with your previous method).
For more on merging you can look at the documentation and this excellent answer by cs95.

Pandas: Replacing column values with ones as retrieved from other dataframe

I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.

Categories