Pandas keeping certain rows based on strings in other rows - python

I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks

Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1

Related

Python. Dataframes. Select rows and apply condition on the selection

python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.
You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False

How to efficiently zero pad datasets with different lengths

My aim is to zero pad my data to have an equal length for all the subset datasets. I have data as follows:
|server| users | power | Throughput range | time |
|:----:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 1 | [8, 6,2,7] | -6.4528433 | [6.2343, 7.0974845] | 1 |
| 2 | [9,12,10,11] | -3.5322451 | [4.31240, 4.9073840]| 2 |
| 3 | [14,13,16,17]| -5.9752843 | [5.2243, 5.2974843] | 3 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 4 |
| 1 | [22,23,24,25]| -9.884843 | [8.00843, 8.0974843]| 5 |
| 2 | [27,26,28,29]| -2.3984843 | [7.23843, 8.2094845]| 6 |
| 3 | [30,32,31,33]| -4.5654566 | [3.1233, 4.2474643] | 7 |
| 1 | [36,34,37,35]| -1.2974652 | [3.12843, 4.2474643]| 8 |
| 2 | [40,41,38,39]| -3.5322451 | [4.31240, 4.9073840]| 9 |
| 1 | [42,43,45,44]| -5.9752843 | [6.31240, 6.9073840]| 10 |
The aim is to analyze individual servers by their respective data which was done using the code below:
c0 = grp['server'].values == 0
c0_new = grp[c0]
server0 = pd.DataFrame(c0_new)
c1 = grp['server'].values == 1
c1_new = grp[c1]
server1 = pd.DataFrame(c1_new)
c2 = grp['server'].values == 2
c2_new = grp[c2]
server2 = pd.DataFrame(c2_new)
c3 = grp['server'].values == 3
c3_new = grp[c3]
server3 = pd.DataFrame(c3_new)
The results of this code provide the different servers and their respective data features. For example, the server0 output becomes:
| server | users | power | Throughput range | time |
|:------:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 1 |
The results obtained for individual servers have different lengths so I tried padding using the code below:
from Keras.preprocessing.sequence import pad_sequences
man = [server0, server1, server2, server3]
new = pad_sequences(man)
The results obtained in this case show the padding has been done with all the servers having equal length but the problem is that the output does not contain the column names anymore, I want the final data to contain the columns. Please any suggestions?
The aim is to apply machine learning on the data and would like to have them concatenated. This is what I later did and it worked for the application I wanted it for.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
man = [server0, server1, server2, server3]
for cel in man:
cel.set_index('time', inplace=True)
cel.drop(['users'], axis=1, inplace=True)
scl = MinMaxScaler()
vals = [cel.values.reshape(cel.shape[0], 1) for cel in man]
I then applied the the pad sequence and it worked as follows:
from keras.preprocessing.sequence import pad_sequences
new = pad_sequences(vals)

Pyspark: Reorder only a subset of rows among themselves

my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/

Best way to compare 2 dfs, get the name of different col & before + after vals?

What is the best way to compare 2 dataframes w/ the same column names, row by row, if a cell is different have the Before & After value and which cellis different in that dataframe.
I know this question has been asked a lot, but none of the applications fit my use case. Speed is important. There is a package called datacompy but it is not good if I have to compare 5000 dataframes in a loop (i'm only comparing 2 at a time, but around 10,000 total, and 5000 times).
I don't want to join the dataframes on a column. I want to compare them row by row. Row 1 with row 1. Etc. If a column in row 1 is different, I only need to know the column name, the before, and the after. Perhaps if it is numeric I could also add a column w/ the abs val. of the dif.
The problem is, there is sometimes an edge case where rows are out of order (only by 1 entry), and don’t want these to come up as false positives.
Example:
These dataframes would be created when I pass in race # (there are 5,000 race numbers)
df1
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | Location | |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.1 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.8 | | Bob | | 8 | | New York | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
+-----+-------+--+------+--+----------+----------+-------------+--+
df2:
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | | Location |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.4 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.5 | | Bob | | 8 | | New York | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
+-----+-------+--+------+--+----------+----------+-------------+--+
diff:
+-------+----------+--------+-------+
| Race# | Diff_col | Before | After |
+-------+----------+--------+-------+
| 123 | Speed | 9.1 | 9.4 |
| 123 | Speed | 1.8 | 1.5 |
An example of a false positive is with the last 2 rows, Ken + Mark.
I could summarize the differences in one line per race, but if the dataframe has 3000 records and there are 1,000 differences (unlikely, but possible) than I will have tons of columns. I figured this was was easier as I could export to excel and then sort by race #, see all the differences, or by diff_col, see which columns are different.
def DiffCol2(df1, df2, race_num):
is_diff = False
diff_cols_list = []
row_coords, col_coords = np.where(df1 != df2)
diffDf = []
alldiffDf = []
for y in set(col_coords):
col_df1 = df1.iloc[:,y].name
col_df2 = df2.iloc[:,y].name
for index, row in df1.iterrows():
if df1.loc[index, col_df1] != df2.loc[index, col_df2]:
col_name = col_df1
if col_df1 != col_df2: col_name = (col_df1, col_df2)
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
try:
check_edge_case = df1.loc[index, col_df1] == df2.loc[index+1, col_df1]
except:
check_edge_case = False
try:
check_edge_case_two = df1.loc[index, col_df1] == df2.loc[index-1, col_df1]
except:
check_edge_case_two = False
if not (check_edge_case or check_edge_case_two):
col_name = col_df1
if col_df1 != col_df2:
col_name = (col_df1, col_df2) #if for some reason column name isn’t the same, which should never happen but in case, I want to know both col names
is_diff = True
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
return diffDf, alldiffDf, is_diff
[apologies in advance for weirdly formatted tables, i did my best given how annoying pasting tables into s/o is]
The code below works if dataframes have the same number and names of columns and the same number of rows, so comparing only values in the tables
Not sure where you want to get Race# from
df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df2 = df1.copy(deep=True)
df2['B'][5] = 100 # Creating difference
df2['C'][6] = 100 # Creating difference
dif=[]
for col in df1.columns:
for bef, aft in zip(df1[col], df2[col]):
if bef!=aft:
dif.append([col, bef, aft])
print(dif)
Results below
Alternative solution without loops
df = df1.melt()
df.columns=['Column', 'Before']
df.insert(2, 'After', df2.melt().value)
df[df.Before!=df.After]

Creating new complex index with pandas Dataframe columns

I'm trying to concatenate the columns 'A' and 'C' in a Dataframe like the following to use it as a new Index:
A | B | C | ...
---------------------------
0 5 | djn | 0 | ...
1 5 | vlv | 1 | ...
2 5 | bla | 2 | ...
3 5 | ses | 3 | ...
4 5 | dug | 4 | ...
The desired result would be a Dataframe which is similar to the following result:
A | B | C | ...
-------------------------------
05000 5 | djn | 0 | ...
05001 5 | vlv | 1 | ...
05002 5 | bla | 2 | ...
05003 5 | ses | 3 | ...
05004 5 | dug | 4 | ...
I've searched my eyes off, does someone know how to manipulate a dataframe to get such result?
#dummying up a dataframe
cf['A'] = 5*[5]
cf['C'] = range(5)
cf['B'] = list('qwert')
#putting together two columns into a new one -- EDITED so string formatting is OK
cf['D'] = map(lambda x: str(x).zfill(5), 1000*cf.A + cf.C)
# use it as the index
cf.index = cf.D
# we don't need it as a column
cf.drop('D', axis=1, inplace=True)
print(cf.to_csv())
D,A,C,B
05000,5,0,q
05001,5,1,w
05002,5,2,e
05003,5,3,r
05004,5,4,t
That said, I suspect you'd be safer with multi-indexing (what if the values in B go above 999....), or sorting or grouping on multi-columns.

Categories