state_codes = pd.read_csv('name-abbr.csv', header=None)
state_codes.columns = ['State', 'Code']
codes = state_codes['Code']
states = pd.Series(state_codes['State'], index=state_codes['Code'])
name-abbr.csv is a two-columned CSV file of US state names in the first column and postal codes in the second: "Alabama" and "AL" in the first row, "Alaska" and "AK" in the second, and so forth.
The above code correctly sets the index, but the Series is all NaN. If I don't set the index, the state names correctly show. But I want both.
I also tried this line:
states = pd.Series(state_codes.iloc[:,0], index=state_codes.iloc[:,1])
Same result. How do I get this to work?
Here is reason called alignment, it means pandas try match index of state_codes['State'].index with new index of state_codes['Code'] and because different get missing values in output, for prevent it is necessary convert Series to numpy array:
states = pd.Series(state_codes['State'].to_numpy(), index=state_codes['Code'])
Or you can use DataFrame.set_index:
states = state_codes.set_index('Code')['State']
Related
I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.
Here is my data
threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0)
And here is my code -
df = (threats
.query('threatened>0')
.groupby(['continent', 'threat_type'])
.agg({'threatened':'size'}))
However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame.
I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.
After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True.
After your code.
df = df.reset_index()
And then you will get required columns.
I have a Pandas Dataframe created from a dictionary with the following code:
import pandas as pd
pd.set_option('max_colwidth', 150)
df = pd.DataFrame.from_dict(data, orient= 'index', columns = ['text'])
df
The output is as follows:
text
./form/2003Q4/0001041379_2003-12-15.html \n10-K\n1\ng86024e10vk.htm\nAFC ENTERPRISES\n\n\n\nAFC ENTERPRISES\n\n\n\nTable of Contents\n\n\n\n\n\n\n\nUNITED STATES SECURITIES AND EXCHANGE\n...
./form/2007Q2/0001303804_2007-04-17.html \n10-K\n1\na07-6053_210k.htm\nANNUAL REPORT PURSUANT TO SECTION 13 AND 15(D)\n\n\n\n\n\n\n \nUNITED\nSTATES\nSECURITIES AND EXCHANGE\nCOMMISSION...
./form/2007Q2/0001349848_2007-04-02.html \n10-K\n1\nff060310k.txt\n\n UNITED STATES\n SECURITIES AND EXCHANGE COMMISSION\n ...
./form/2014Q1/0001141807_2014-03-31.html \n10-K\n1\nf32414010k.htm\nFOR THE FISCAL YEAR ENDED DECEMBER 31, 2013\n\n\n\nf32414010k.htm\n\n\n\n\n\n\n\n\n\n\nUNITED STATES\nSECURITIES AND EX...
./form/2007Q2/0001341853_2007-04-02.html \n10-K\n1\na07-9697_110k.htm\n10-K\n\n\n\n\n\n\n \n \nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n \nFORM 10-K\n ...
I need to split the first column (the index) into three separate columns, Year & Qtr, CIK, Filing Data. So the values in these columns from the first row would be: 2003Q4, 0001041379, 2003-12-15.
I think that if this was in a proper column that I could do this using code similar to Example #2 found here:
https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/
However I am thrown by the fact that it is the index that I need to split, and not a named column.
Is there a way to separate the index or do I need to somehow save this as another column, and is this possible?
I'd appreciate any help. I am a newbie, so I don't always understand the more difficult solutions. Thanks in advance.
The fact that the column is the index makes no difference when extracting components from it but you need to be careful when assigning those components back to the original dataframe.
# Extract the components from the index
# pandas allowed us to name the columns via named captured groups
pattern = r'(?P<Quarter>\d{4}Q\d)\/(?P<CIK>\d+)_(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})'
tmp = df.index.str.extract(pattern) \
.assign(Date=lambda x: pd.to_datetime(x[['Year', 'Month', 'Day']]))
# Since `df` and `tmp` are both dataframe, assignments between them
# will be based row label. We want them to align by position (i.e.
# line 1 to line 1) so we have to convert the right hand side to
# numpy array
cols = ['Quarter', 'CIK', 'Date']
df[cols] = tmp[cols].values
I'm trying to work out the correct method for cycling through a number of pandas dataframes using a 'for loop'. All of them contain 'year' columns from 1960 to 2016, and from each df I want to remove the columns '1960' to '1995'.
I created a list of dfs and also a list of str values for the years.
dflist = [apass,rtrack,gdp,pop]
dfnewlist =[]
for i in range(1960, 1996):
dfnewlist.append(str(i))
for df in dflist:
df = df.drop(dfnewlist, axis = 1)
My for loop runs without error, but it does not remove the columns.
Edit - Just to add, when I do this manually without the for loop, such as below, it works fine:
gdp = gdp.drop(dfnewlist, axis = 1)
This is a common issues for people in for loops. When you say
for df in dflist:
and then change df, the changes do not happen to the actual object in the list, just to df
use enumerate to fix
for i,df in enumerate(dflist):
dflist[i]=df.drop(dfnewlist,axis=1)
To ensure some robustness, you can us the errors='ignore' flag just in case one of the columns doesn't exist, the drop won't error out.
However, your real problem is that when you loop, df starts by referring to the thing in the list. But then you overwrite the name df by assigning to that name the results of df.drop(dfnewlist, axis=1). This does not replace the dataframe in your list as you'd hoped but creates a new name df that no longer points to the item in the list.
Instead, you can use the inplace=True flag.
drop_these = [*map(str, range(1960, 1996)]
for df in dflist:
df.drop(drop_these, axis=1, errors='ignore', inplace=True)
I have 2 dataframes of numerical data. Given a value from one of the columns in the second df, I would like to look up the index for the value in the first df. More specifically, I would like to create a third df, which contains only index labels - using values from the second to look up its coordinates from the first.
listso = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
data = pd.DataFrame(listso,index=list('abcdefghij'), columns=list('AB'))
rollmax = pd.DataFrame(data.rolling(center=False,window=5).max())
So for the third df, I hope to use the values from rollmax and figure out which row they showed up in data. We can call this third df indexlookup.
For example, rollmax.ix['j','A'] = 30, so indexlookup.ix['j','A'] = 'g'.
Thanks!
You can build a Series with the indexing the other way around:
mapA = pd.Series(data.index, index=data.A)
Then mapA[rollmax.ix['j','A']] gives 'g'.