Split the index of a Pandas Dataframe into separate columns - python

I have a Pandas Dataframe created from a dictionary with the following code:
import pandas as pd
pd.set_option('max_colwidth', 150)
df = pd.DataFrame.from_dict(data, orient= 'index', columns = ['text'])
df
The output is as follows:
text
./form/2003Q4/0001041379_2003-12-15.html \n10-K\n1\ng86024e10vk.htm\nAFC ENTERPRISES\n\n\n\nAFC ENTERPRISES\n\n\n\nTable of Contents\n\n\n\n\n\n\n\nUNITED STATES SECURITIES AND EXCHANGE\n...
./form/2007Q2/0001303804_2007-04-17.html \n10-K\n1\na07-6053_210k.htm\nANNUAL REPORT PURSUANT TO SECTION 13 AND 15(D)\n\n\n\n\n\n\n \nUNITED\nSTATES\nSECURITIES AND EXCHANGE\nCOMMISSION...
./form/2007Q2/0001349848_2007-04-02.html \n10-K\n1\nff060310k.txt\n\n UNITED STATES\n SECURITIES AND EXCHANGE COMMISSION\n ...
./form/2014Q1/0001141807_2014-03-31.html \n10-K\n1\nf32414010k.htm\nFOR THE FISCAL YEAR ENDED DECEMBER 31, 2013\n\n\n\nf32414010k.htm\n\n\n\n\n\n\n\n\n\n\nUNITED STATES\nSECURITIES AND EX...
./form/2007Q2/0001341853_2007-04-02.html \n10-K\n1\na07-9697_110k.htm\n10-K\n\n\n\n\n\n\n \n \nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n \nFORM 10-K\n ...
I need to split the first column (the index) into three separate columns, Year & Qtr, CIK, Filing Data. So the values in these columns from the first row would be: 2003Q4, 0001041379, 2003-12-15.
I think that if this was in a proper column that I could do this using code similar to Example #2 found here:
https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/
However I am thrown by the fact that it is the index that I need to split, and not a named column.
Is there a way to separate the index or do I need to somehow save this as another column, and is this possible?
I'd appreciate any help. I am a newbie, so I don't always understand the more difficult solutions. Thanks in advance.

The fact that the column is the index makes no difference when extracting components from it but you need to be careful when assigning those components back to the original dataframe.
# Extract the components from the index
# pandas allowed us to name the columns via named captured groups
pattern = r'(?P<Quarter>\d{4}Q\d)\/(?P<CIK>\d+)_(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})'
tmp = df.index.str.extract(pattern) \
.assign(Date=lambda x: pd.to_datetime(x[['Year', 'Month', 'Day']]))
# Since `df` and `tmp` are both dataframe, assignments between them
# will be based row label. We want them to align by position (i.e.
# line 1 to line 1) so we have to convert the right hand side to
# numpy array
cols = ['Quarter', 'CIK', 'Date']
df[cols] = tmp[cols].values

Related

python pandas: how to modify column header name and modify the date formate

Using python pandas how can we change the data frame
First, how to copy the column name down to other cell(blue)
Second, delete the row and index column(orange)
Third, modify the date formate(green)
I would appreciate any feedback~~
Update
df.iloc[1,1] = df.columns[0]
df = df.iloc[1:].reset_index(drop=True)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.set_index('Date')
print(df.columns)
Question 1 - How to copy column name to a column (Edit- Rename column)
To rename a column pandas.DataFrame.rename
df.columns = ['Date','Asia Pacific Equity Fund']
# Here the list size should be 2 because you have 2 columns
# Rename using pandas pandas.DataFrame.rename
df.rename(columns = {'Asia Pacific Equity Fund':'Date',"Unnamed: 1":"Asia Pacific Equity Fund"}, inplace = True)
df.columns will return all the columns of dataframe where you can access each column name with index
Please refer Rename unnamed column pandas dataframe to change unnamed columns
Question 2 - Delete a row
# Get rows from first index
df = df.iloc[1:].reset_index()
# To remove desired rows
df.drop([0,1]).reset_index()
Question 3 - Modify the date format
current_format = '%Y-%m-%d %H:%M:%S'
desired_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date']).dt.strftime(desired_format)
# Input the existing format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=current_format).dt.strftime(desired_format)
# To update date format of Index
df.index = pd.to_datetime(df.index,infer_datetime_format=current_format).strftime(desired_format)
Please refer pandas.to_datetime for more details
I'm not sure I understand your questions. I mean, do you actually want to change the dataframe or how it is printed/displayed?
Indexes can be changed by using methods .set_index() or .reset_index(), or can be dropped eventually. If you just want to remove the first digit from each index (that's what I understood from the orange column), you should then create a list with the new indexes and pass it as a column to your dataframe.
Regarding the date format, it depends on what you want the changed format to become. Take a look into python datetime.
I would strongly suggest you to take a better look into pandas features and documentations, and how to handle a dataframe with this library. There is plenty of great sources a Google-search away :)
Delete the first two rows using this.
Rename the second column using this.
Work with datetime format using the datetime package. Read about it here

Pandas colnames not found after grouping and aggregating

Here is my data
threats = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-18/threats.csv', index_col = 0)
And here is my code -
df = (threats
.query('threatened>0')
.groupby(['continent', 'threat_type'])
.agg({'threatened':'size'}))
However df.columns only Index(['threatened'], dtype='object') is the result. That is, only the threatened column is displaying not the columns I have actually grouped by i.e continent and threat_type although present in my data frame.
I would like to perform operation on the continent column of my data frame, but it is not displaying as one of the columns. For eg - continents = df.continent.unique(). This command gives me a key error of continent not found.
After groupby...pandas put the groupby columns in the index. Always reset index after doing groupby in pandas and don't do drop=True.
After your code.
df = df.reset_index()
And then you will get required columns.

Pandas Dataframes - Combine two Dataframes but leave out entry with same column

I'm trying to create a DataFrame out of two existing ones. I read the title of some articles in the web, first column is title and the ones after are timestamps
i want to concat both data frames but leave out the ones with the same title (column one)
I tried
df = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
but because the other columns may not be the exact same all the time, I need to leave out every data pack that has the same first column. how would I do this?
btw sorry for not knowing all the right terms for my problem
You should first remove the duplicate rows from df2 and then concat it with df1:
df = pd.concat([df1, df2[~df2.title.isin(df1.title)]]).reset_index(drop=True)
This probably solves your problem:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df2=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah1','blah2','blah3','blah4','blah']
df2.columns=['blah5','blah6','blah7','blah8','blah']
for i in range(len(df.columns)):
for j in range(len(df2.columns)):
if df.columns[i] == df2.columns[j]:
df2 = df2.drop(df2.columns[j], axis = 1)
else:
continue
print(pd.concat([df, df2], axis =1))

Pandas Series from two-columned DataFrame produces a Series of NaN's

state_codes = pd.read_csv('name-abbr.csv', header=None)
state_codes.columns = ['State', 'Code']
codes = state_codes['Code']
states = pd.Series(state_codes['State'], index=state_codes['Code'])
name-abbr.csv is a two-columned CSV file of US state names in the first column and postal codes in the second: "Alabama" and "AL" in the first row, "Alaska" and "AK" in the second, and so forth.
The above code correctly sets the index, but the Series is all NaN. If I don't set the index, the state names correctly show. But I want both.
I also tried this line:
states = pd.Series(state_codes.iloc[:,0], index=state_codes.iloc[:,1])
Same result. How do I get this to work?
Here is reason called alignment, it means pandas try match index of state_codes['State'].index with new index of state_codes['Code'] and because different get missing values in output, for prevent it is necessary convert Series to numpy array:
states = pd.Series(state_codes['State'].to_numpy(), index=state_codes['Code'])
Or you can use DataFrame.set_index:
states = state_codes.set_index('Code')['State']

pandas, python, excel, search for substring in column of df 1 to write string to column in df2

I am using the package pandas in python to work with and read and write to excel spreadsheets. I have created 2 different dataframes (df1 and df2) that have cells that are all of the data type string. df1 has over 50,000 rows. There are many cells in each column of df1 that were “Nan”and I have converted to to a string that says “Empty”. df2 has over 9000 rows. Every row in “WHSE_Nbr” and “WHSE_Desc_HR” contains an accurate string value. Only some rows have values other than the string “Empty” in the last 2 columns of df2. The “Warehouse” column in df1 has many cells containing names with only words. The rows of the “Warehouse” column in df1 that I am interested in identifying are the ones that contain any of the warehouse numbers that are found in df2 in the “WHSE_Nbr” column.
Example of dataframe1 - df1
Job Warehouse GeneralDescription Purpose
Empty AP Accounts Payable Accounting
Empty Empty Empty Empty
Empty Cyber Security GA Security & Compliance Data Security
Empty Merch|04-1854 Empty Empty
Empty WH -1925 Empty Empty
Empty Montreal-10 Empty Empty
Empty canada| 05-4325 Empty Empty
Example of dataframe2 - df2
WHSE_Nbr WHSE_Desc_HR WHSE_Desc_AD WHSE_Abrv
1 Technology Tech
2 Finance
... ...
10 Recruiting Campus Outreach
1854 Community Relations
... ...
1925 HumanResources
4325 Global People
9237 International Tech
Example of dataframe2
df2
So I want to iterate through all rows of the “Warehouse Column” of df1 to search for WHSE numbers that appear in the WHSE_Nbr column of df2. In this example, I would want my code to find 1854 in the “Warehouse” column of df1 and map that number to the associated cell in the WHSE_Desc_HR column of df2 and write “Community Relations” in the “GeneralDescription” column of df1 (to same row that contains substring “1854” In Warehouse column. And it would also write “Human Resources” to Warehouse column in same row substring “1925” appears in the Warehouse column. When the iteration reaches “Montreal 10”, I would want my code to write “Campus Outreach” to the GeneralDescription column of df1 since if there is a value in WHSE_Desc_AD of df2, this will serve as an override to what is in column “WHSE_Desc_HR” of df2. I have become familiar enough with pandas to read excel files (.xlsx) and make the data frames and change datatypes within the data frame for iteration purposes, view the data frames , but can’t figure out the most effective and efficient way to structure this code to accomplish this goal. I had to edit this question just now because i realized i left out something very important. Any time a number appears in the Warehouse column, the number I want to match always follows a hyphen or dash (-). So in df1, the Warehouse row that says"canada | 05-4325" should recognize 4325, match it with df2, and write "Global People" to the GeneralDescription column in df1. sorry guys. Help is so much appreciated and the two answers below make a very good start. Thanks
import pandas as pd
excel_file='/Users/cbri/anaconda3/WHSE_gen.xlsx'
df1 = pd.read_excel(excel_file, usecols [1,5,6,7])
excel_file='/Users/cbri/PycharmProjects/True_Dept/HR_excel.xlsx'
df2 = pd.read_excel(excel_file)
df1=df1.replace(np.nan, "Empty",regex=True)
df2=df2.replace(np.nan, "Empty",regex=True)
df1=pd.DataFrame(df1, dtype='str')
df2=pd.DataFrame(df2, dtype='str')
#yeah i need a push in the right direction, guess i should use ieriterms()?
for column in df1:
if (df1['Warehouse'])
#so i got as far as returning all records that contained the substring "1854" but obviously that's without the for and if statement above
df1[df1['Warehouse'].str.contains("1854", na=False)]
What I would do is write a regex expression to extract the numbers from your column join the tables, and maybe do the rest in excel... (the column updates)
df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10'],'TrueDeparment' : ['Empty','empty','empty']})
df2 = pd.DataFrame({'Dept_Nbr' : [1854, 1925, 10], 'Dept_Desc_HR' : ['Community Relations','Human Resources','Recruiting']})
Then here you can try what the function does:
line = 'Merch - 1854 '
match = re.search(r'[0-9]+', line)
if match is None:
print(0)
else:
print(int(match[0]))
If you need the match after a character as specified in your comment use this one:
line = '12125 15151 Merch -1854 '
match = re.search(r'(?<=-)[0-9]+', line)
if match is None:
print(0)
else:
print(int(match[0]))
Note that if you have spaces or other characters after the "-" you need to add it to the regex to work!
Important - you suppose that you only have one number in your text - if not it returns 0 you can change it as you wish the point is that at least it doesn't fail
Write the function:
def extract_number(field):
match = re.search(r'(?<=-)[0-9]+', field)
if match is None:
return 0
else:
return int(match[0])
Apply to dataframe:
df1['num_col'] = df1[['Department']].apply(lambda row:extract_number(row['Department']),axis=1)
Lastly do the join:
df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])
From here you can figure out which column you need whether here in Python, or in excel.
Try this:
numbers = df2['Dept_Nbr'].tolist()
df2['Dept_Nbr'] = [int(i) for i in df2['Dept_Nbr']]
df2.set_index('Dept_Nbr')
for n in numbers:
for i in df1.index:
if n in df1.at[i, 'Department']:
if df2.at[int(n), 'Dept_Desc_AD']: #if values exists
df1.at[i, 'TrueDepartment'] = df2.at(int(n), 'Dept_Desc_AD')
else:
df1.at[i, 'TrueDepartment'] = df2.at(int(n), 'Dept_Desc_HR')

Categories