new pandas dataframe row based on existing data frame? - python

Is it possible to create a row based on values of a previous row?
Lets say
Name Location Amount
1 xyz london 23423
is a row in a DF. and I want to scan the DF, and if amount > 2000 and Location == london I want to append another row that keeps the location and amount of row 1 but changes name to EEE
As per my note, I would like the output to be the same DF but this:
Name Location Amount
1 xyz london 23423
2 EEE london 23424

You can slice the dataframe based on the conditions, then change the name.
df2 = df[df.Location.eq('london') & df.Amount.gt(2000)].reset_index(drop=True)
df2.Name = 'EEE'

Related

create a scoring value based on filtering in pandas

Imagine we have 2 dataframes. Df1 has 1000 rows and 12 columns and has a projection of a student for the next 12 months. The other dataframe (1000 by 15) has the credentials of all of our students. What I want is to implement the following logic:
group the students by id and country
if in df2 the column "test" has a value "not given" then give a score of 0 for all 12 months in df1
if in df2 the column "test" has a value "given" then i should count the "given" per id and country (the group by in step 1.)
I have done the following:
# setting the filters
given = df2['test'] == 'not given'
not given = df2['test'] != 'not given'
#give the score 0 in all columns of df1 based on the filtering of df2
df1.loc[given] = 0
# try to find the number of students that have the same id and country based on the groupby on 1st step
temp_df2 = df2[not given]
groups = pd.DataFrame(temp_df2.groupby(["name", "country"]).size())
When I do this I create a dataframe that has as index the "name" and "country" separated by "/". Now I do the following
groups.reset_index(level=0, inplace=True)
groups.reset_index(level=0, inplace=True)
groups.rename(columns = {0:'counts'}, inplace = True)
Now I have a dataframe of the following form
groups
name country counts
Alex Japan 2
George Italy 1
Now I want to find the values in df2 that have the same name and country and in the corresponding rows in df1 set 10 divided by the value of "counts" in groups dataframe. For example
I have to look in df2 for the rows that have the name Alex and country Japan and
then in df1 I should do 10/2 thus insert 5 in all 12 columns for the rows that I have found in step 1
However, I am not quite sure how to do this matching so I can make the division

List of Python Dataframe column values meeting criteria in another dataframe?

I have two dataframes, df1 and df2, which have a common column heading, Name. The Name values are unique within df1 and df2. df1's Name values are a subset of those in df2; df2 has more rows -- about 17,300 -- than df1 -- about 6,900 -- but each Name value in df1 is in df2. I would like to create a list of Name values in df1 that meet certain criteria in other columns of the corresponding rows in df2.
Example:
df1:
Name
Age
Hair
0
Jim
25
black
1
Mary
58
brown
3
Sue
15
purple
df2:
Name
Country
phoneOS
0
Shari
GB
Android
1
Jim
US
Android
2
Alain
TZ
iOS
3
Sue
PE
iOS
4
Mary
US
Android
I would like a list of only those Name values in df1 that have df2 Country and phoneOS values of US and Android. The example result would be [Jim, Mary].
I have successfully selected rows within one dataframe that meet multiple criteria in order to copy those rows to a new dataframe. In that case pandas/Python does the iteration over rows internally. I guess I could write a "manual" iteration over the rows of df1 and access df2 on each iteration. I was hoping for a more efficient solution whereby the iteration was handled internally as in the single-dataframe case. But my searches for such a solution have been fruitless.
try:
df_1.loc[df_1.Name.isin(df_2.loc[df_2.Country.eq('US') & \
df_2.phoneOS.eq('Android'), 'Name']), 'Name']
Result:
0 Jim
1 Mary
Name: Name, dtype: object
if you want the result as a list just add .to_list() at the end
data = df1.merge(df2, on='Name')
data.loc[((data.phoneOS == 'Android') & (data.Country == "US")), 'Name'].values.tolist()

Column value from first df to another df based on condition

I have original df where I have column "average", where is average value counted for country . Now I have new_df, where I want to add these df average values based on country.
df
id country value average
1 USA 3 2
2 UK 5 5
3 France 2 2
4 USA 1 2
new df
country average
USA 2
Italy Nan
I had a solution that worked but there is a problem, when there is in new_df a country for which I have not count the average yet. In that case I want to fill just nan.
Can you please recommend me any solution?
Thanks
If need add average column to df2 use DataFrame.merge with DataFrame.drop_duplicates:
df2.merge(df1.drop_duplicates('country')[['country','average']], on='country', how='left')
If need aggregate mean:
df2.join(df1.groupby('country')['average'].mean(), on='country')

Remove the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.
The column names that will always be there are [ID,Name and Year].Sometimes there can be additional columns such as "Age"
dummy1 dummy2 dummy3 dummy4
test_column1 test_column2 test_column3 test_column4
ID Name Year Age
1 John Sophomore 20
2 Lisa Junior 21
3 Ed Senior 22
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how can I remove the initial rows that don't contain the column names?["ID","Name","Year"]
So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence,and they can be variable.But these 3 columns will always be there ["ID","Name","Year"]
if i do the following,it only works if the dataframe contains only 3 columns ["ID","Name","Year"]
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :]
I should be able to fetch the corresponding column index as long as the row contains any of these 3 columns ["ID","Name","Year"]
How can I achieve this?
I tried
col_index = df.index[(["ID","Name","Year"] in df).any(1)].item()
But i am getting error
You could stack the dataframe and use isin to find the header row.
IIUC, a small function could work. (personally I'd change this to pass in your file I/O read method and return a dataframe starting at that header row.
#make sure your read method has pd.read..(headers=None)
def find_columns(dataframe,cols) -> list:
stack_df = dataframe.stack()
header_row = stack_df[stack_df.isin(cols)].index.get_level_values(0)[0]
return header_row
header_row = find_columns(df,["Age", "Year", "ID", "Name"])
new_df = pd.read_csv(file,skiprows=header_row)
ID Name Year Age
0 1 John Sophomore 20
1 2 Lisa Junior 21
2 3 Ed Senior 22

How to delete rows from a csv file?

I was able to pull the rows that I would like to delete from a CSV file but I can't make that drop() function to work.
data = pd.read_csv(next(iglob('*.csv')))
data_top = data.head()
data_top = data_top.drop(axis=0)
What needs to be added?
Example of a CSV file. It should delete everything until it reaches the Employee column.
creation date Unnamed: 1 Unnamed: 2
0 NaN type of client NaN
1 age NaN NaN
2 NaN birth date NaN
3 NaN NaN days off
4 Employee Salary External
5 Dan 130e yes
6 Abraham 10e no
7 Richmond 201e third-party
If it is just the top 5 rows you want to delete, then you can do it as follows:
data = pd.read_csv(next(iglob('*.csv')))
data.drop([0,1,2,3,4], axis=0, inplace=True)
With axis, you should also pass either a single label or list (of column names, or row indexes).
There are, of course, many other ways to achieve this too. Especially if the case is that the index of rows you want to delete is not just the top 5.
edit: inplace added as pointed out in comments.
Considering the coments and further explanations, assuming you know the name of the column, and that you have a positional index, you can try the following:
data = pd.read_csv(next(iglob('*.csv')))
row = data[data['creation date'] == 'Employee']
n = row.index[0]
data.drop(labels=list(range(n)), inplace=True)
The main goal is to find the index of the row that contains the value 'Employee'. To achieve that, assuming there are no other rows that contain that word, you can filter the dataframe to match the value in question in the specific column.
After that, you extract the index value, wich you will use to create a list of labels (given a positional index) that you will drop of the dataframe, as #MAK7 stated in his answer.

Categories