Remove the rows from dataframe till the actual column names are found - python

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.
The column names that will always be there are [ID,Name and Year].Sometimes there can be additional columns such as "Age"
dummy1 dummy2 dummy3 dummy4
test_column1 test_column2 test_column3 test_column4
ID Name Year Age
1 John Sophomore 20
2 Lisa Junior 21
3 Ed Senior 22
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how can I remove the initial rows that don't contain the column names?["ID","Name","Year"]
So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence,and they can be variable.But these 3 columns will always be there ["ID","Name","Year"]
if i do the following,it only works if the dataframe contains only 3 columns ["ID","Name","Year"]
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :]
I should be able to fetch the corresponding column index as long as the row contains any of these 3 columns ["ID","Name","Year"]
How can I achieve this?
I tried
col_index = df.index[(["ID","Name","Year"] in df).any(1)].item()
But i am getting error

You could stack the dataframe and use isin to find the header row.
IIUC, a small function could work. (personally I'd change this to pass in your file I/O read method and return a dataframe starting at that header row.
#make sure your read method has pd.read..(headers=None)
def find_columns(dataframe,cols) -> list:
stack_df = dataframe.stack()
header_row = stack_df[stack_df.isin(cols)].index.get_level_values(0)[0]
return header_row
header_row = find_columns(df,["Age", "Year", "ID", "Name"])
new_df = pd.read_csv(file,skiprows=header_row)
ID Name Year Age
0 1 John Sophomore 20
1 2 Lisa Junior 21
2 3 Ed Senior 22

Related

create a scoring value based on filtering in pandas

Imagine we have 2 dataframes. Df1 has 1000 rows and 12 columns and has a projection of a student for the next 12 months. The other dataframe (1000 by 15) has the credentials of all of our students. What I want is to implement the following logic:
group the students by id and country
if in df2 the column "test" has a value "not given" then give a score of 0 for all 12 months in df1
if in df2 the column "test" has a value "given" then i should count the "given" per id and country (the group by in step 1.)
I have done the following:
# setting the filters
given = df2['test'] == 'not given'
not given = df2['test'] != 'not given'
#give the score 0 in all columns of df1 based on the filtering of df2
df1.loc[given] = 0
# try to find the number of students that have the same id and country based on the groupby on 1st step
temp_df2 = df2[not given]
groups = pd.DataFrame(temp_df2.groupby(["name", "country"]).size())
When I do this I create a dataframe that has as index the "name" and "country" separated by "/". Now I do the following
groups.reset_index(level=0, inplace=True)
groups.reset_index(level=0, inplace=True)
groups.rename(columns = {0:'counts'}, inplace = True)
Now I have a dataframe of the following form
groups
name country counts
Alex Japan 2
George Italy 1
Now I want to find the values in df2 that have the same name and country and in the corresponding rows in df1 set 10 divided by the value of "counts" in groups dataframe. For example
I have to look in df2 for the rows that have the name Alex and country Japan and
then in df1 I should do 10/2 thus insert 5 in all 12 columns for the rows that I have found in step 1
However, I am not quite sure how to do this matching so I can make the division

List of Python Dataframe column values meeting criteria in another dataframe?

I have two dataframes, df1 and df2, which have a common column heading, Name. The Name values are unique within df1 and df2. df1's Name values are a subset of those in df2; df2 has more rows -- about 17,300 -- than df1 -- about 6,900 -- but each Name value in df1 is in df2. I would like to create a list of Name values in df1 that meet certain criteria in other columns of the corresponding rows in df2.
Example:
df1:
Name
Age
Hair
0
Jim
25
black
1
Mary
58
brown
3
Sue
15
purple
df2:
Name
Country
phoneOS
0
Shari
GB
Android
1
Jim
US
Android
2
Alain
TZ
iOS
3
Sue
PE
iOS
4
Mary
US
Android
I would like a list of only those Name values in df1 that have df2 Country and phoneOS values of US and Android. The example result would be [Jim, Mary].
I have successfully selected rows within one dataframe that meet multiple criteria in order to copy those rows to a new dataframe. In that case pandas/Python does the iteration over rows internally. I guess I could write a "manual" iteration over the rows of df1 and access df2 on each iteration. I was hoping for a more efficient solution whereby the iteration was handled internally as in the single-dataframe case. But my searches for such a solution have been fruitless.
try:
df_1.loc[df_1.Name.isin(df_2.loc[df_2.Country.eq('US') & \
df_2.phoneOS.eq('Android'), 'Name']), 'Name']
Result:
0 Jim
1 Mary
Name: Name, dtype: object
if you want the result as a list just add .to_list() at the end
data = df1.merge(df2, on='Name')
data.loc[((data.phoneOS == 'Android') & (data.Country == "US")), 'Name'].values.tolist()

Python Merging data frames

In python, I have a df that looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 2
Lacy 2
Ryan 3
Colt 4
Tia 4
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
Dan 4
Hallie 5
Cam 5
Lacy 5
Ryan 6
Colt 7
Tia 7
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time. I know that I can reset the index if ID is a unique identifier. But in this case, more than one person can have the same ID. So how can I account for that?
From the example that you have provided above, you can observe that we can obtain the final dataframe by: adding the maximum value of ID in first df to the second and then concatenating them, to explain this better:
Name df2 final_df
Dan 1 4
This value in final_df is obtained by doing a 1+(max value of ID from df1 i.e. 3) and this trend is followed for all entries for the dataframe.
Code:
import pandas as pd
df = pd.DataFrame({'Name':['Anna','Polly','Sarah','Max','Kate','Ally','Steve'],'ID':[1,1,2,3,3,3,3]})
df1 = pd.DataFrame({'Name':['Dan','Hallie','Cam','Lacy','Ryan','Colt','Tia'],'ID':[1,2,2,2,3,4,4]})
max_df = df['ID'].max()
df1['ID'] = df1['ID'].apply(lambda x: x+max_df)
final_df = pd.concat([df,df1])
print(final_df)

how to split the data in a column based on multiple delimiters, into multiple columns, in pandas

I have a dataframe with only one column named 'ALL_category[![enter image description here][1]][1]'. There are multiple names in a row ranging between 1 to 3 and separated by delimiters '|', '||' or '|||', which can be either at the beginning, in between or end of the words in every row. I want to split the column into multiple columns such that the new columns contain the names. How can I do it?
Below is the code to generate the dataframe:
x = {'ALL Categories': ['Rakesh||Ramesh|','||Rajesh|','HARPRIT|||','Tushar||manmit|']}
df = pd.DataFrame(x)
When I used the below code for modification of the above dataframe, it didn't give me any result.
data = data.ALL_HOLDS.str.split(r'w', expand = True)
I believe you need Series.str.extractall if want each word to separate column:
df1 = df['ALL Categories'].str.extractall(r'(\w+)')[0].unstack()
print (df1)
match 0 1
0 Rakesh Ramesh
1 Rajesh NaN
2 HARPRIT NaN
3 Tushar manmit
Or a bit changed code of #Chris A from comments with Series.str.strip and Series.str.split by one or more |:
df1 = df['ALL Categories'].str.strip('|').str.split(r'\|+', expand=True)
print (df1)
0 1
0 Rakesh Ramesh
1 Rajesh None
2 HARPRIT None
3 Tushar manmit

new pandas dataframe row based on existing data frame?

Is it possible to create a row based on values of a previous row?
Lets say
Name Location Amount
1 xyz london 23423
is a row in a DF. and I want to scan the DF, and if amount > 2000 and Location == london I want to append another row that keeps the location and amount of row 1 but changes name to EEE
As per my note, I would like the output to be the same DF but this:
Name Location Amount
1 xyz london 23423
2 EEE london 23424
You can slice the dataframe based on the conditions, then change the name.
df2 = df[df.Location.eq('london') & df.Amount.gt(2000)].reset_index(drop=True)
df2.Name = 'EEE'

Categories