How to delete rows from a csv file? - python

I was able to pull the rows that I would like to delete from a CSV file but I can't make that drop() function to work.
data = pd.read_csv(next(iglob('*.csv')))
data_top = data.head()
data_top = data_top.drop(axis=0)
What needs to be added?
Example of a CSV file. It should delete everything until it reaches the Employee column.
creation date Unnamed: 1 Unnamed: 2
0 NaN type of client NaN
1 age NaN NaN
2 NaN birth date NaN
3 NaN NaN days off
4 Employee Salary External
5 Dan 130e yes
6 Abraham 10e no
7 Richmond 201e third-party

If it is just the top 5 rows you want to delete, then you can do it as follows:
data = pd.read_csv(next(iglob('*.csv')))
data.drop([0,1,2,3,4], axis=0, inplace=True)
With axis, you should also pass either a single label or list (of column names, or row indexes).
There are, of course, many other ways to achieve this too. Especially if the case is that the index of rows you want to delete is not just the top 5.
edit: inplace added as pointed out in comments.

Considering the coments and further explanations, assuming you know the name of the column, and that you have a positional index, you can try the following:
data = pd.read_csv(next(iglob('*.csv')))
row = data[data['creation date'] == 'Employee']
n = row.index[0]
data.drop(labels=list(range(n)), inplace=True)
The main goal is to find the index of the row that contains the value 'Employee'. To achieve that, assuming there are no other rows that contain that word, you can filter the dataframe to match the value in question in the specific column.
After that, you extract the index value, wich you will use to create a list of labels (given a positional index) that you will drop of the dataframe, as #MAK7 stated in his answer.

Related

Columns getting appended to wrong row in pandas

So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

Returning a Dataframe of each group with a condition of max value of a particular column

I have one dataframe (this one is a demo for explaining the problem), in which I have to group datas with same same row values of 1 particular column let's say Company name.
We have 3 As and 4 Bs. So there are 2 different entities so in output. I need 2 rows for A and B but with the condition that I need the data of that company whose MPG value is maximum.
So the output should look like:
A A2 2020 Auto 2.0 67.3
B B4 2021 Manual 1.5 83.1
I have tried all the possible things with groupby method but couldn't get the result.
I have tried:
new_df = pd.groupby(["Company"])["MPG"].max()
And also,
new_df = pd.groupby(["Company"]).max(["MPG"])
But only getting the series in return. Not the Dataframe

join rows with same index and keep other rows unchanged

I have this data frame
df=
ID join Chapter ParaIndex text
0 NaN 1 0 I am test
1 NaN 2 1 it is easy
2 1 3 2 but not so
3 1 3 3 much easy
I want to get this
(merge the column "text" with the same index in column "join" and reindex "ID" and "ParaIndex", rest without change)
dfEdited=
ID join Chapter ParaIndex text
0 NaN 1 0 I am test
1 NaN 2 1 it is easy
2 1 3 2 but not so much easy
I used this command
dfedited=df.groupby(['join'])['text'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
it only merges the row with the numerical index in column join and exclude row with non index
so I changed to this
dfedited=df.groupby(['join'],dropna=False)['text'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
here it merges all rows based on index join but it considers row with index NaN as one group therefore join them also to be group! however, I do not want to join them ...any idea? many thanks
I also used this
dfedited=df.groupby(['join', "ParaIndex", "Chapter"],dropna=False )['text'].apply(lambda x: ' '.join(x.astype(str) )).reset_index()
it looks better as it has all columns, but no changes!!
I hope you can give an example of data and code. And do it step by step rather than just code it in one line without testing. It's hard to help you with this one-line code.
But the main idea is to use merge(..., on='join')
I solved that so;
dfEdited = df.assign(key=df['join'].ne(df['join'].shift()).cumsum()).groupby('key').agg({ "ParaIndex": 'first', "Chapter":'first','text':' '.join}).reset_index()

Conditionally copy values from dataframe column to another dataframe with different length

I would like to copy a values from a dataframe column to another dataframe if the values in two other columns are the same.
example df1:
identifier price
1 nan
1 nan
3 nan
3 nan
and so on. There are several rows for every identifier.
In my df2, there is only one value for each identifier in "price"
example df2:
Identifier price
1 3
3 5
I just would like to copy the "price" values in df2 to "price" in df1. It does not matter to me if the values are copied to each column where the identifiers match or just to the first, since I will alter all but the first entry for each identifier in df1["price"] anyways.
Expected output would be still df1 because there are other columns I still need:
identifier price
1 3
1 nan
3 5
3 nan
OR:
identifier price
1 3
1 3
3 5
3 5
I could work with both.
I tried np.where but the different length of the dataframes causes problems. I also tried using loc, but I got stuck when defining the value that should be inserted in the cell if the condition holds.
Any help is much appreciated, thank you in advance!

Drop rows of NaN with a slice of columns in Pandas

I have hundreds of columns in a DataFrame and would like to drop rows where multiple columns are NaN. Meaning entire row is NaN for those columns.
I have tried to slice columns but the code is taking forever to run.
df = df.drop(df[(df.loc[:,'col1':'col100'].isna()) & (df.loc[:,'col120':'col220'].isna())].index)
Appreciate any help.
Part of your original question reads: "... would like to drop rows where multiple columns are NaN. Meaning entire row is NaN for those columns. "
Can I interpret this as, you want to delete the row when the entire row has NaNs. If that is true you should be able to achive this by:
df.dropna(axis = 'rows', how = 'all', inplace = True)
If that is not the case then I misunderstood your question.
You should try to use the dropna() function with the subset parameter equal to the columns you are trying to drop on. Here is a short example taken from Pandas' documentation
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
df.dropna(subset=['name', 'born'])
This gives you the following:
name toy born
1 Batman Batmobile 1940-04-25

Categories