The DataFrame MultiIndex is kicking my butt. After struggling for quite a while, I was able to create a MutliIndex DataFrame with this code
columns = pd.MultiIndex.from_tuples([('Zip', ''),
('All Properties', 'Avg List Price'),('All Properties', 'Median List Price'),
('3 Bedroom', 'Avg List Price'),('3 Bedroom', 'Median List Price'),
('2 Bedroom', 'Avg List Price'),('2 Bedroom', 'Median List Price'),
('1 Bedroom', 'Avg List Price'),('1 Bedroom', 'Median List Price')])
data[0] = ['11111', 'Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6', 'Val7', 'Val8']
df = pd.DataFrame(data, columns=columns)
Everything looks fine until I try to write it to an excel file
writer = pd.ExcelWriter('testData.xlsx', engine='openpyxl')
df.to_excel(writer, 'Sheet1')
writer.save()
When I open the excel file this is what I get.
If I unmerge the columns in Excel all the data is there.
Here's an image of what I'm trying to create
I'm guessing that the problem has something to do with the way I'm creating the multi index columns, but I can't figure out what the problem is.
I'm running python 2.7 on a Mac.
Thanks for any input.
This was a bug that will be fixed in version 0.17.1, or you can use engine='xlsxwriter'
https://github.com/pydata/pandas/pull/11328
This is a great use for itertools.product. Try this instead in your multiindex creation:
from itertools import product
cols = product(
['All Properties', '3 Bedroom', '2 Bedroom', '1 Bedroom'],
['Avg List Price', 'Median List Price']
)
columns = pd.MultiIndex.from_tuples(list(cols))
ind = pd.Index(['11111'], name='zip')
vals = ['Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6', 'Val7', 'Val8']
df = pd.DataFrame(
vals, index=ind, columns=columns
)
The issue is: you included zip (which names your index) in the construction of your MultiIndex for your columns (tragically, nothing called MultiColumns exists to clear up that confusion). You need to create your Index (which is a single-level normal pandas.Index) and your columns (which are a two-level pandas.MultiIndex) separately, as above, and you should get expected behavior when you write to excel.
Related
I have a pandas dataframe (sample):
data = [['ABC', 'John', '123', 'Yes', '2022_Jan'], ['BCD', 'Amy', '456', 'Yes', '2022_Jan'], ['ABC', 'Michelle', '123', 'No', '2022_Feb'], ['CDE', 'John', '789', 'No', '2022_Feb'], ['ABC', 'Michelle', '012', 'Yes', '2022_Mar'], ['BCD', 'Amy', '123', 'No', '2022_Mar'], ['CDE', 'Jill', '789', 'No', '2022_Mar'], ['CDE', 'Jack', '789', 'No', '2022_Mar']]
tmp2 = pd.DataFrame(data, columns = ['Responsibility', 'Name', 'ID', 'Has Error', 'Year_Month'])
tmp3 = tmp2[['Responsibility', 'Name', 'ID', 'Has Error']]
The actual dataframe is a lot larger with more columns, but the above are the only fields I need right now. I already have the following code that generates a year-to-date table that groups by 'Responsibility' and 'Name' and gives me the number & % of unique 'ID's that have errors and don't have errors, and exports the table to a single Excel sheet:
result = pd.pivot_table(tmp3, index =['Responsibility', 'Name'], columns = ['Has Error'], aggfunc=len)
#cleanup
result.fillna(0, inplace=True)
result.columns = [s1 + "_" + str(s2) for (s1,s2) in result.columns.tolist()]
result = result.rename(columns={'ID_No': 'Does NOT Have Error (count)', 'ID_Yes': 'Has Error (count)'})
result = result.astype(int)
#create fields for %s and totals
result['Has Error (%)'] = round(result['Has Error (count)'] / (result['Has Error (count)'] + result['Does NOT Have Error (count)']) *100, 2).astype(str)+'%'
result['Does NOT Have Error (%)'] = round(result['Does NOT Have Error (count)'] / (result['Has Error (count)'] + result['Does NOT Have Error (count)']) *100, 2).astype(str) + '%'
result['Total Count'] = result['Has Error (count)'] + result['Does NOT Have Error (count)']
result = result.reindex(columns=['Has Error (%)', 'Does NOT Have Error (%)', 'Has Error (count)', 'Does NOT Have Error (count)', 'Total Count'])
#save to excel
Excelwriter = pd.ExcelWriter('./output/final.xlsx',engine='xlsxwriter')
workbook=Excelwriter.book
result.to_excel(Excelwriter,sheet_name='YTD Summary',startrow=0 , startcol=0)
Now, I want to keep this YTD summary sheet, and generate the 'result' table for data from each month (from the 'Year_Month' field in the original dataset tmp2), and export the same table with data for each month into separate Excel sheets within the same output file. I will be generating this entire output file on a recurring basis, so want to write the code so that when I read in a new dataframe, it will automatically identify each month available in the data, and generate separate tables for each month using the code that I've already written above, and export each table into separate tabs in the Excel file. I'm a beginner at Python and I'm finding this is harder to do than I originally thought and what I've tried so far is not working. I know one way to do this would be to use a for loop or matrix functions, but can't figure out how to make the code work. Any help would be greatly appreciated!
Assuming you don't care about the year, you can split the month from the last column and then iterate over groupby.
split the month
df['month'] = df['Year_Month'].str.split('_').str[1]
iterate with groupby
for month, df_month in df.groupby('month'):
# your processing stuff here
# 'df_month' is the sub-dataframe for one month
df_month_processed.to_excel(ExcelWriter, sheet_name=month, ...)
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
I have a dataframe that is dynamically created.
I create my first set of rows as:
df['tourist_spots'] = pd.Series(<A list of tourist spots in a city>)
To this df I add:
df['city'] = <City Name>
So far so good. A bunch of rows are created with the same city name for multiple tourist spots.
I want to add a new city. So I do:
df['tourist_spots'].append(pd.Series(<new data>))
Now, when I append a new city with:
df['city'].append('new city')
the previously updated city data is gone. It is as if every time the rows are replaced and not appended.
Here's an example of what I want:
Step 1:
df['tourist_spot'] = pd.Series('Golden State Bridge' + a bunch of other spots)
For all the rows created by the above data I want:
df['city'] = 'San Francisco'
Step 2:
df['tourist_spot'].append(pd.Series('Times Square' + a bunch of other spots)
For all the rows created by the above data, I want:
df['city'] = 'New York'
How can I achieve this?
Use dictionary to add rows to your data frame, it is faster method.
Here is an e.g.
STEP 1
Create dictionary:
dict_df = [{'tourist_spots': 'Jones LLC', 'City': 'Boston'},
{'tourist_spots': 'Alpha Co', 'City': 'Boston'},
{'tourist_spots': 'Blue Inc', 'City': 'Singapore' }]
STEP2
Convert dictionary to dataframe:
df = pd.DataFrame(dict_df)
STEP3
Add new entries to dataframe in dictionary format:
df = df.append({'tourist_spots': 'New_Blue', 'City': 'Singapore'}, ignore_index=True)
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html
I've got an output from an API call as a list:
out = client.phrase_this(phrase='ciao', database='it')
out
[{'Keyword': 'ciao',
'Search Volume': '673000',
'CPC': '0.05',
'Competition': '0',
'Number of Results': '205000000'}]
type(out)
list
I'd like to to create a dataframe and loop-append to that dataframe a new row, starting the API output from multiple keywords.
index = ['ciao', 'google', 'microsoft']
columns = ['Keyword', 'Search Volume', 'CPC', 'Competition', 'Number of Results']
df = pd.DataFrame(index=index, columns=columns)
For loop that is not working:
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=index, database='it')
Thanks!
The reason this is not working is because you are trying to assign a dictionary inside of a list to the data frame row, rather than just a list.
You are receiving a list containing a dictionary. If you only want to use the first entry of this list the following solution should work:
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=keyword, database='it')[0].values()
[0] gets the first entry of the list.
values() returns a list of all the values in the dictionary. https://www.tutorialspoint.com/python/dictionary_values.htm
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=keyword, database='it')
This passes the keyword to the phrase_this function, instead of the entire index list.
Thanks for the answers, I found a workaround:
index = ['ciao', 'google', 'microsoft']
columns = ['Keyword', 'Search Volume', 'CPC', 'Competition', 'Number of Results']
out = []
for query in index:
out.append(client.phrase_this(phrase=query, database='it')[0].values())
out
[dict_values(['ciao', '673000', '0.05', '0', '205000000']),
dict_values(['google', '24900000', '0.66', '0', '13020000000']),
dict_values(['microsoft', '110000', '0.12', '0.06', '77'])]
df = pd.DataFrame(out, columns=columns).set_index('Keyword')
I have a pandas df containing 'features' for stocks, which looks like this:
I am now trying to create a dictionary with unique sector as key, and a python list of tickers for that unique sector as values, so I end up having something that looks like this:
{'consumer_discretionary': ['AAP',
'AMZN',
'AN',
'AZO',
'BBBY',
'BBY',
'BWA',
'KMX',
'CCL',
'CBS',
'CHTR',
'CMG',
etc.
I could iterate over the pandas df rows to create the dictionary, but I prefer a more pythonic solution. Thus far, this code is a partial solution:
df.set_index('sector')['ticker'].to_dict()
Any feedback is appreciated.
UPDATE:
The solution by #wrwrwr
df.set_index('ticker').groupby('sector').groups
partially works, but it returns a pandas series as a the value, instead of a python list. Any ideas about how to transform the pandas series into a python list in the same line and w/o having to iterate the dictionary?
Wouldn't f.set_index('ticker').groupby('sector').groups be what you want?
For example:
f = DataFrame({
'ticker': ('t1', 't2', 't3'),
'sector': ('sa', 'sb', 'sb'),
'name': ('n1', 'n2', 'n3')})
groups = f.set_index('ticker').groupby('sector').groups
# {'sa': Index(['t1']), 'sb': Index(['t2', 't3'])}
To ensure that they have the type you want:
{k: list(v) for k, v in f.set_index('ticker').groupby('sector').groups.items()}
or:
f.set_index('ticker').groupby('sector').apply(lambda g: list(g.index)).to_dict()