I'm trying to concatenate the columns 'A' and 'C' in a Dataframe like the following to use it as a new Index:
A | B | C | ...
---------------------------
0 5 | djn | 0 | ...
1 5 | vlv | 1 | ...
2 5 | bla | 2 | ...
3 5 | ses | 3 | ...
4 5 | dug | 4 | ...
The desired result would be a Dataframe which is similar to the following result:
A | B | C | ...
-------------------------------
05000 5 | djn | 0 | ...
05001 5 | vlv | 1 | ...
05002 5 | bla | 2 | ...
05003 5 | ses | 3 | ...
05004 5 | dug | 4 | ...
I've searched my eyes off, does someone know how to manipulate a dataframe to get such result?
#dummying up a dataframe
cf['A'] = 5*[5]
cf['C'] = range(5)
cf['B'] = list('qwert')
#putting together two columns into a new one -- EDITED so string formatting is OK
cf['D'] = map(lambda x: str(x).zfill(5), 1000*cf.A + cf.C)
# use it as the index
cf.index = cf.D
# we don't need it as a column
cf.drop('D', axis=1, inplace=True)
print(cf.to_csv())
D,A,C,B
05000,5,0,q
05001,5,1,w
05002,5,2,e
05003,5,3,r
05004,5,4,t
That said, I suspect you'd be safer with multi-indexing (what if the values in B go above 999....), or sorting or grouping on multi-columns.
Related
python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.
You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False
my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/
inside my application i have a dataframe that looks similiar to this:
Example:
id | address | code_a | code_b | code_c | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | 1345wqdwqe | ....
2 | parkdrive 1 | 012ah8 | 012ah8a | dwqd4646 | ....
3 | parkdrive 2 | 852fhz | 852fhza | fewf6465 | ....
4 | parkdrive 3 | 456se1 | 456se1a | 856fewf13 | ....
5 | parkdrive 3 | 456se1 | 456se1a | gth8596s | ....
6 | parkdrive 3 | 456se1 | 456se1a | a48qsgg | ....
7 | parkdrive 4 | tg8596 | tg8596a | 134568a | ....
As you may see, every address can contain multiple entrys inside my dataframe, the code_a and code_b are following a certain pattern and only code_c is unqiue.
What I'm trying to obtain is a dataframe where the column code_c is ignored, dropped or whatever and the whole dataframe is reduced to only one entry for each address...something like this:
id | address | code_a | code_b | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | ...
3 | parkdrive 2 | 852fhz | 852fhza | ...
4 | parkdrive 3 | 456se1 | 456se1a | ...
7 | parkdrive 4 | tg8596 | tg8596a | ...
I tried the groupby-function, but this doesn't seemed to work - or is this even the right function?
Thanks for your help and good day to all of you!
You can drop_duplicates to do this
df.drop_duplicates(subset=[‘address’], inplace=True)
This will keep only a single entry per address
I think what you are looking for is
# in this way you are looking for all the duplicates rows in all columns except for 'code_c'
df.drop_duplicates(subset=df.columns.difference(['code_c']))
# in this way you are looking for all the duplicates rows ONLY based on column 'address'
df.drop_duplicates(subset='address')
I notice in your example data, if you drop columnC then all the entries with address "parkdrive 1" for example, are just duplicates.
you should drop the column c:
df.drop('code_c',axis=1,inplace=True)
Then you can drop the duplicates:
df_clean = df.drop_duplicates()
I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks
Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1
I need help with creating a conditional column using values from multiple other columns with pandas.
Column1|Column2|Column4|Column4
1 | 2 | 5 | A
2 | 3 | 4 | B
3 | 4 | 3 | C
4 | 5 | 2 | B
5 | 1 | 1 | C
And what I want is to create a new column such that if column4 is equal to A then the new column will be equal to the value in column1 so the final dataframe would look like this
Column1|Column2|Column4|Column4|column5
1 | 2 | 5 | A | 1
2 | 3 | 4 | B | 3
3 | 4 | 3 | C | 3
4 | 5 | 2 | B | 5
5 | 1 | 1 | C | 1
Here is what I have tried so far but keep getting the response data.column1 (x) object is not callable
def column5(x):
if x['column4'] == 'A'
return data.column1(x)
elif x['column4'] == 'B'
return data.column2(x)
elif x['column4'] == 'C'
return data.column3(x)
You got error because data.column1 is a pandas.Series, you cannot call it like a function with data.column1(x).
Also your desired value are different for each row based on value of col4, so you will need to use either a loop, or better: using pandas's apply() function.
Try this:
# map value to column
val_to_col = {
'A': 'Column1',
'B': 'Column2',
'C': 'Column3'
}
# get data from col, based on row[col4]
df['column5'] = df.apply(lambda row: row[val_to_col.get(row['Column4'])], axis=1)