how to group data in a column based on indices - python

i am a newbie, slowly learning... i have a unique dataframe as shown below:
time
index
1 8:51 am
1 8:51 am
1 8:51 am
2 8:52 am
2 8:52 am
3 8:53 am
3 8:53 am
3 8:53 am
i want to be able to combine the dataframe and input the index in one row only as shown below:
time
index
1 8:51 am
2 8:52 am
3 8:53 am

Try with
df = df.groupby(level=0).head(1)

Nothing looks unique there, that just seems to be whole duplicate rows (unless timestamps can be different for same index number)
Df.drop_duplicates function is what you’re looking for.
You can also use this function even if timestamp can be different by just running it over a selected column( index) and argument “first” or “last” will keep first or last of those timestamps.

data.drop_duplicates(subset ="time", keep = False, inplace = True)
This should return only the rows of the dataframe containing unique values in the subset column mentioned.

Related

join rows with same index and keep other rows unchanged

I have this data frame
df=
ID join Chapter ParaIndex text
0 NaN 1 0 I am test
1 NaN 2 1 it is easy
2 1 3 2 but not so
3 1 3 3 much easy
I want to get this
(merge the column "text" with the same index in column "join" and reindex "ID" and "ParaIndex", rest without change)
dfEdited=
ID join Chapter ParaIndex text
0 NaN 1 0 I am test
1 NaN 2 1 it is easy
2 1 3 2 but not so much easy
I used this command
dfedited=df.groupby(['join'])['text'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
it only merges the row with the numerical index in column join and exclude row with non index
so I changed to this
dfedited=df.groupby(['join'],dropna=False)['text'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
here it merges all rows based on index join but it considers row with index NaN as one group therefore join them also to be group! however, I do not want to join them ...any idea? many thanks
I also used this
dfedited=df.groupby(['join', "ParaIndex", "Chapter"],dropna=False )['text'].apply(lambda x: ' '.join(x.astype(str) )).reset_index()
it looks better as it has all columns, but no changes!!
I hope you can give an example of data and code. And do it step by step rather than just code it in one line without testing. It's hard to help you with this one-line code.
But the main idea is to use merge(..., on='join')
I solved that so;
dfEdited = df.assign(key=df['join'].ne(df['join'].shift()).cumsum()).groupby('key').agg({ "ParaIndex": 'first', "Chapter":'first','text':' '.join}).reset_index()

Pandas dataframe on python

I feel like this may be a really easy question but I can't figure it out I have a data frame that looks like this
one two three
1 2 3
2 3 3
3 4 4
The third column has duplicates if I want to keep the first row but drop the second row because there is a duplicate on row two how would I do this.
Pandas DataFrame objects have a method for this; assuming df is your dataframe, df.drop_duplicates(subset='name_of_third_column') returns the dataframe with any rows containing duplicate values in the third column removed.

Conditionally copy values from dataframe column to another dataframe with different length

I would like to copy a values from a dataframe column to another dataframe if the values in two other columns are the same.
example df1:
identifier price
1 nan
1 nan
3 nan
3 nan
and so on. There are several rows for every identifier.
In my df2, there is only one value for each identifier in "price"
example df2:
Identifier price
1 3
3 5
I just would like to copy the "price" values in df2 to "price" in df1. It does not matter to me if the values are copied to each column where the identifiers match or just to the first, since I will alter all but the first entry for each identifier in df1["price"] anyways.
Expected output would be still df1 because there are other columns I still need:
identifier price
1 3
1 nan
3 5
3 nan
OR:
identifier price
1 3
1 3
3 5
3 5
I could work with both.
I tried np.where but the different length of the dataframes causes problems. I also tried using loc, but I got stuck when defining the value that should be inserted in the cell if the condition holds.
Any help is much appreciated, thank you in advance!

Create a multiindex DataFrame from existing delimited column names

I have a pandas DataFrame that looks like the following
A_value A_avg B_value B_avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
and my goal is to create a multiindex Dataframe that looks like that:
A B
value avg value avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
So the part of the column name before the '-' should be the first level of the column index and the part afterwards the second level. The first part is unstructured, the second is always the same (4 endings).
I tried to solve it with pd.wide_to_long() but I think that is the wrong path, as I don't want to change the df itself. The real df is much larger, so creating it manually is not an option. I'm stuck here and did not find a solution.
You can split the columns by the delimier and expand to create Multiindex:
df.columns=df.columns.str.split("_",expand=True)

How to delete rows from a csv file?

I was able to pull the rows that I would like to delete from a CSV file but I can't make that drop() function to work.
data = pd.read_csv(next(iglob('*.csv')))
data_top = data.head()
data_top = data_top.drop(axis=0)
What needs to be added?
Example of a CSV file. It should delete everything until it reaches the Employee column.
creation date Unnamed: 1 Unnamed: 2
0 NaN type of client NaN
1 age NaN NaN
2 NaN birth date NaN
3 NaN NaN days off
4 Employee Salary External
5 Dan 130e yes
6 Abraham 10e no
7 Richmond 201e third-party
If it is just the top 5 rows you want to delete, then you can do it as follows:
data = pd.read_csv(next(iglob('*.csv')))
data.drop([0,1,2,3,4], axis=0, inplace=True)
With axis, you should also pass either a single label or list (of column names, or row indexes).
There are, of course, many other ways to achieve this too. Especially if the case is that the index of rows you want to delete is not just the top 5.
edit: inplace added as pointed out in comments.
Considering the coments and further explanations, assuming you know the name of the column, and that you have a positional index, you can try the following:
data = pd.read_csv(next(iglob('*.csv')))
row = data[data['creation date'] == 'Employee']
n = row.index[0]
data.drop(labels=list(range(n)), inplace=True)
The main goal is to find the index of the row that contains the value 'Employee'. To achieve that, assuming there are no other rows that contain that word, you can filter the dataframe to match the value in question in the specific column.
After that, you extract the index value, wich you will use to create a list of labels (given a positional index) that you will drop of the dataframe, as #MAK7 stated in his answer.

Categories