Subseting python dataframe using position values from lists - python

I have a dataframe with raw data and I would like to select different range of rows for each column, using two different lists: one containing the first row position to select and the other the last.
INPUT
| Index | Column A | Column B |
|:--------:|:--------:|:--------:|
| 1 | 2 | 8 |
| 2 | 4 | 9 |
| 3 | 1 | 7 |
first_position=[1,2]
last_position=[2,3]
EXPECTED OUTPUT
| Index | Column A | Column B |
|:--------:|:--------:|:--------:|
| 1 | 2 | 9 |
| 2 | 4 | 7 |
Which function can I use?
Thanks!
I tried df.filter but I think it does not accept list as input.

Basically, as far as I can see, you have two meaningful columns in your DataFrame.
Thus, I would suggest using "Index" column as the index indeed:
df.set_index(df.columns[0], inplace=True)
That way you might use .loc:
df_out = pd.concat(
[
df.loc[first_position, "Column A"].reset_index(drop=True),
df.loc[last_position, "Column B"].reset_index(drop=True)
],
axis=1
)
However, having indexes stored in separate lists you would need to watch them yourselves, which may be not too convenient.
Instead, I would re-organize it with slicing:
df_out = pd.concat(
[
df[["Column A"]][:-1].reset_index(drop=True),
df[["Column B"]][1:].reset_index(drop=True)
],
axis=1
)
In either cases, index is being destroyed. If that matters, then the scenario without .reset_index(drop=True) would be required.

Related

Merge 2 data frame based on values on dataframe 1 and index and column from dataframe 2

I have 2 DataFrame as follows
DataFrame 1
DataFrame 2
I wanted to merge these 2 DataFrames, based on the values of each row in DataFrame 2, matched with the combination of index and column in DataFrame 1.
So I want to append another column in DataFrame 2, name it "weight", and store the merged value there.
For example,
----------------------------------------------------
| | col1 | col2 | relationship | weight |
| 0 | Andy | Claude | 0 | 1 |
| 1 | Andy | Frida | 20 | 1 |
and so on. How to do this?
Use DataFrame.join with DataFrame.stack for MultiIndex Series:
df2 = df2.join(df1.stack().rename('weight'), on=['col1','col2'])

Select value from other dataframe where index is equal

I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+
This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1
I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']

How would I iterate over a Pandas Series and compare it to single float?

I'm wondering if it's possible to check and see if the result of a calculation between two tables can be used with comparison operators.
Lets say I have two dataframes.
DF
| user_id | col1| col2| col3| col4| check |
|---------|-----|-----|-----|-----|-------|
| 100 | 1 | 2 | 1 | 0 | 5 |
| 200 | 2 | 4 | 0 | 2 | 5 |
DF2
| user_id | col1| col2| col3| col4| check |
| 300 | 3 | 6 | 2 | 0 | 5 |
| 400 | 4 | 8 | 0 | 4 | 5 |
For each user in df, I loop through each user in df2. I then want to add their col1 values, and see if they are greater than the number 5. If so, a 'greater than 5' should be returned.If not, 'less than 5' should be returned.
This is how I would imagine the syntax to look, but it doesn't work.
for a in df.user_id:
for b in df2.user_id:
if df.col1 + df.col2 > df.check:
print('Greater than 5')
else:
print('Less than 5')
I get a ValueError:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What's the logic behind this, and how would comparing iterated items to single value work?
Also, aside from being a static value vs an iterable series, is there a difference for using the df.check column value vs an int(5) in the for loop? What kind of effect does this have?
Thanks!
You should maybe switch to a more relational approach. The way I would proceed is:
df_result = (pd.concat([df, df2], axis=0)
.assign(greater_than_check = lambda d: (d.col1 + d.col2)>d.check))
Users with col1 + col2 greater than check:
df_result.loc[lambda d: d.greater_than_check, "user_id"]
for a in df:
for b in df2:
if df.loc[a,"col1"] + df2.loc[b,"col1"] > df.loc[a,"check"]:
print('Greater than 5')
else:
print('Less than 5')
Before, you were adding and comparing an entire column to another entire column. Instead, look at the row index of one column using .loc.

What is the smartest way to get the rest of a pandas.DataFrame?

Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |
Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.
If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.
I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]

How to compare two pandas dataframes and remove duplicates on one file without appending data from other file [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 4 years ago.
I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.
I am creating two pandas dataframes from these files:
import pandas as pd
dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)
I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.
The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.
I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.
Here is an example of what the dataframes could look like.
test_master.csv
column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2
test_daily.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
Desired output is:
test_master.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
test_daily.csv
+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+----------+----------+----------+
Any help would be greatly appreciated!
EDIT
I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.
Try this:
I crate 2 indexes, and then set rows 2-4 to be duplicates:
import numpy as np
test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values
print(test_master)
print(test_daily)
output:
A B C
0 0.009322 0.330057 0.082956
1 0.197500 0.010593 0.356774
2 0.147410 0.697779 0.421207
A B C
0 0.643062 0.335643 0.215443
1 0.009322 0.330057 0.082956
2 0.197500 0.010593 0.356774
3 0.147410 0.697779 0.421207
4 0.973867 0.873358 0.502973
Then, add a multiindex level to identify which data is from which dataframe:
test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)
Now merge as you suggested and drop duplicates:
merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
master 0.009322 0.330057 0.082956
1 master 0.197500 0.010593 0.356774
2 master 0.147410 0.697779 0.421207
4 daily 0.973867 0.873358 0.502973
There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:
idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
4 daily 0.973867 0.873358 0.502973

Categories