I have created a dictionary with the piece of code:
dat[r["author_name"]] = (r["num_deletions"], r["num_insertions"],
r["num_lines_changed"], r["num_files_changed"], r["author_date"])
I want to then take these dictionary and create a panda with columns
author_name | num_deletions | num_insertions | num_lines_changed |num_files changed | author_date
I tried this:
df = pd.DataFrame(list(dat.iteritems()),
columns=['author_name',"num_deletions", "num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
But it does not work since it is reading the key and the tuple of the dictionary as only two columns instead of six. So how can I take each of the five entries in the tuple and divide them into their own columns
You need the key and value at the same nesting level:
df = pd.DataFrame([(key,)+val for key, val in dat.items()],
columns=["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
You could also use
df = pd.DataFrame.from_dict(dat, orient='index').reset_index()
df.columns = ["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"]
Which seems to be a bit faster if you have roughly 10,000 rows or more.
This should work.
import pandas as pd
df = pd.DataFrame(columns=['author_name', 'num_deletions', 'num_insertions', 'num_lines_changed',
'num_files_changed','author_date'])
Related
I have a dataframe like this:
dataframe name: df_test
ID
Data
test-001
{"B":{"1":{"_seconds":1663207410,"_nanoseconds":466000000}},"C":{"1":{"_seconds":1663207409,"_nanoseconds":978000000}},"D":{"1":{"_seconds":1663207417,"_nanoseconds":231000000}}}
test-002
{"B":{"1":{"_seconds":1663202431,"_nanoseconds":134000000}},"C":{"1":{"_seconds":1663208245,"_nanoseconds":412000000}},"D":{"1":{"_seconds":1663203482,"_nanoseconds":682000000}}}
I want it to be unnested like this:
ID
B_1_seconds
B_1_nanoseconds
C_1_seconds
C_1_nanoseconds
D__seconds
D__nanoseconds
test-001
1663207410
466000000
1663207409
978000000
1663207417
231000000
test-002
1663202431
134000000
1663208245
412000000
1663203482
682000000
I tryed df_test.explode but it doesn't work for this
I used Dataiku to unnest the data and it worked perfectly, now I want to unnest the data within my python notebook, what should I do?
Edit:
I tried
df_list=df_test["data"].tolist()
then
pd.json_normalize.(df_list)
it returned an empty dataframe with only index but no value in it.
Since pd.json_normalize returns an empty dataframe I'd guess that df["Data"] contains strings? If that's the case you could try
import json
df_data = pd.json_normalize(json.loads("[" + ",".join(df["Data"]) + "]"), sep="_")
res = pd.concat([df[["ID"]], df_data], axis=1).rename(lambda c: c.replace("__", "_"), axis=1)
or
df_data = pd.json_normalize(df["Data"].map(eval), sep="_")
res = pd.concat([df[["ID"]], df_data], axis=1).rename(lambda c: c.replace("__", "_"), axis=1)
Result for both alternatives is:
ID B_1_seconds B_1_nanoseconds C_1_seconds C_1_nanoseconds \
0 test-001 1663207410 466000000 1663207409 978000000
1 test-002 1663202431 134000000 1663208245 412000000
D_1_seconds D_1_nanoseconds
0 1663207417 231000000
1 1663203482 682000000
I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01
Say I have a table that looks something like:
+----------------------------------+-------------------------------------+----------------------------------+
| ExperienceModifier|ApplicationId | ExperienceModifier|RatingModifierId | ExperienceModifier|ActionResults |
+----------------------------------+-------------------------------------+----------------------------------+
| | | |
+----------------------------------+-------------------------------------+----------------------------------+
I would like to grab all of the columns that lead with 'ExperienceModifier' and stuff the results of that into its own dataframe. How would I accomplish this with pandas?
You can try pandas.DataFrame.filter
df.filter(like='ExperienceModifier')
If you want to get columns only contains ExperienceModifier at the beginning.
df.filter(regex='^ExperienceModifier')
Ynjxsjmh's answer will get all columns that contain "ExperienceModifier". If you literally want columns that start with that string, rather merely contain it, you can do new_df = df[[col for col in df.columns if col[:18] == 'ExperienceModifier']]. If all of the desired columns have | after "ExperienceModifier", you could also do new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']]. All of these will create a view of the dataframe. If you want a completely separate dataframe, you should copy it, like this: new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']].copy(). You also might want to create a multi-index by splitting the column names on | rather than creating a separate dataframe.
The accepted answer does easly the job but I still attach my "hand made version" that works:
import pandas as pd
import numpy as np
import re
lst = [[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]]
column_names = [['ExperienceModifier|ApplicationId', 'ExperienceModifier|RatingModifierId', 'ExperienceModifier|ActionResults','OtherName|ActionResults']]
data = pd.DataFrame(lst, columns = column_names)
data
old_and_dropped_dataframes = []
new_dataframes=[]
for i in np.arange(0,len(column_names[0])):
column_names[0][i].split("|")
splits=re.findall(r"[\w']+", column_names[0][i])
if "ExperienceModifier" in splits:
new_dataframe = data.iloc[:,[i]]
new_dataframes.append(new_dataframe)
else:
old_and_dropped_dataframe = data.iloc[:,[i]]
old_and_dropped_dataframes.append(old_and_dropped_dataframe)
ExperienceModifier_dataframe = pd.concat(new_dataframes,axis=1)
ExperienceModifier_dataframe
OtherNames_dataframe = pd.concat(old_and_dropped_dataframes,axis=1)
OtherNames_dataframe
This script creates two new dataframes starting from the initial dataframe: one that contains the columns whose names start with ExperienceModifier and an other one that contains the columns that do not start with ExperienceModifier.
I am a first data frame looking like this
item_id | options
------------------------------------------
item_1_id | [option_1_id, option_2_id]
And a second like this:
option_id | option_name
---------------------------
option_1_id | option_1_name
And I'd like to transform my first data set to:
item_id | options
----------------------------------------------
item_1_id | [option_1_name, option_2_name]
What is an elegant way to do so using Pandas' data frames?
You can use apply.
For the record, storing lists in DataFrames is typically unnecessary and not very "pandonic". Also, if you only have one column, you can do this with a Series (though this solution also works for DataFrames).
Setup
Build the Series with the lists of options.
index = list('abcde')
s = pd.Series([['opt1'], ['opt1', 'opt2'], ['opt0'], ['opt1', 'opt4'], ['opt3']], index=index)
Build the Series with the names.
index_opts = ['opt%s' % i for i in range(5)]
vals_opts = ['name%s' % i for i in range(5)]
s_opts = pd.Series(vals_opts, index=index_opts)
Solution
Map options to names using apply. The lambda function looks up each option in the Series mapping options to names. It is applied to each element of the Series.
s.apply(lambda l: [s_opts[opt] for opt in l])
outputs
a [name1]
b [name1, name2]
c [name0]
d [name1, name4]
e [name3]
I have just discovered pandas and am impressed by its capabilities.
I am having difficulties understanding how to work with DataFrame with MultiIndex.
I have two questions :
(1) Exporting the DataFrame
Here my problem:
This dataset
import pandas as pd
import StringIO
d1 = StringIO.StringIO(
"""Gender,Employed,Region,Degree
m,yes,east,ba
m,yes,north,ba
f,yes,south,ba
f,no,east,ba
f,no,east,bsc
m,no,north,bsc
m,yes,south,ma
f,yes,west,phd
m,no,west,phd
m,yes,west,phd """
)
df = pd.read_csv(d1)
# Frequencies tables
tab1 = pd.crosstab(df.Gender, df.Region)
tab2 = pd.crosstab(df.Gender, [df.Region, df.Degree])
tab3 = pd.crosstab([df.Gender, df.Employed], [df.Region, df.Degree])
# Now we export the datasets
tab1.to_excel('H:/test_tab1.xlsx') # OK
tab2.to_excel('H:/test_tab2.xlsx') # fails
tab3.to_excel('H:/test_tab3.xlsx') # fails
One work-around I could think of is to change the columns (The way R does)
def NewColums(DFwithMultiIndex):
NewCol = []
for item in DFwithMultiIndex.columns:
NewCol.append('-'.join(item))
return NewCol
# New Columns
tab2.columns = NewColums(tab2)
tab3.columns = NewColums(tab3)
# New export
tab2.to_excel('H:/test_tab2.xlsx') # OK
tab3.to_excel('H:/test_tab3.xlsx') # OK
My question is : Is there a more efficient way to do this in Pandas that I missed in the documentation ?
2) Selecting columns
This new structure does not allow to select colums on a given variable (the advantage of hierarchical indexing in first place). How can I select columns containing a given string (e.g. '-ba') ?
P.S: I have seen this question which is related but have not understood the reply proposed
This looks like a bug in to_excel, for the moment as a workaround I would recommend using to_csv (which seems not to show this issue).
I added this as an issue on github.
To answer the second question, if you really need to use to_excel...
You can use filter to select only those columns which include '-ba':
In [21]: filter(lambda x: '-ba' in x, tab2.columns)
Out[21]: ['east-ba', 'north-ba', 'south-ba']
In [22]: tab2[filter(lambda x: '-ba' in x, tab2.columns)]
Out[22]:
east-ba north-ba south-ba
Gender
f 1 0 1
m 1 1 0