Group by Pandas and combine multipe strings - python

I have a dataframe currently which resembles this
Source | Destination | Type
A | B | Insert
A | B | Delete
B | C | Insert
What I want to achieve is something like this
Source | Destination | Type
A | B | Insert, Delete
B | C | Insert
I tried using group by Source and Destination but Im a little unsure how do I append to the type. Any ideas?

Check Below Code:
df.groupby(['Source','Destination']).agg({'Type':'unique'})

Figured out something
df = df.groupby(['Source','Destination'])['Type'].apply(lambda x: ','.join(x)).reset_index()

Make a string separated by ", ":
df.groupby(['Source', 'Destination']).agg(', '.join).reset_index()
Make a list:
df.groupby(['Source','Destination']).agg(list).reset_index()

Joining by , and building a string works but it is not very convenient if you later want to perform other operations with this column value. (such as iterating over the elements.)
This creates a set of values.
column_map: Dict[str,Any] = {}
column_map["Type"] = lambda x: set(x)
df.groupby(["Source", "Destination"]).agg(column_map)
Source | Destination | Type
A | B | {Insert Delete}
B | C | {Insert}
if you instead want to get a list and dont want eliminate duplicates. Just replace set(x) with list(x)

Related

How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.
Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.

Filter rows of strings after a starting row that starts with certain characters

I'm using Python.
I have extracted text from pdf. So I have a data frame full of strings with just one column and no column name.
I need to filter rows from a starting row until the end. This starting row is identified because starts with certain characters. Consider the following example:
+----------------+
| aaaaaaa |
| bbbbbb |
| ccccccc |
| hellodddd |
| eeeeeeeee |
| fffffffffff |
| gggggggg |
| hhhhhhhh |
+----------------+
I need to filter rows from the starting row, which is hellodddd until the end. As you can see, the starting row is identified because startswith hello characters.
So, the expected output is:
+----------------+
| hellodddd |
| eeeeeeeee |
| fffffffffff |
| gggggggg |
| hhhhhhhh |
+----------------+
I think this example can be reproduced with the following code:
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'gggggggg', 'hhhhhhhh']
df = pd.DataFrame(mylist)
I think I need to use startswith() function first to identify the starting row. But, then, what could I do to select the wanted columns (the ones that follow the starting row until the end)?
.startswith() is a method on a string, returning whether or not a string starts with some substring, it won't help you select rows in a dataframe (unless you're looking for the first row with a value that starts with that string).
You're looking for something like:
import pandas as pd
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)
print(df[(df[0].values == 'hellodddd').argmax():])
Result:
0
3 hellodddd
4 eeeeeeeee
5 fffffffffff
6 hellodddd
7 hhhhhhhh
Note that I replaced a later value with 'hellodddd' as well, to show that it will include all rows from the first match onwards.
Edit: in response to the comment:
import pandas as pd
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)
print(df[(df[0].str.startswith('hello')).argmax():])
Result is identical.
I don't know much about panda, but I know that itertools can solve this problem:
import itertools
mylist = [
'aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee',
'fffffffffff', 'gggggggg', 'hhhhhhhh'
]
result = list(itertools.dropwhile(
lambda element: not element.startswith("hello"),
mylist,
))
The dropwhile function drop (discard) those element that fits the condition, after that, it returns the rest.

cleaning my dataframe (similar lines and \xc3\x28 in the field)

I am working on dataframe with python.
in my first dataframe df1 i have :
+------+---------+-------------+-------------------------------+
| ID | PUBLICATION TITLE | DATE | JOURNAL |
+------+---------------------+--------------+------------------+
| 1 "a" "01/10/2000" "book1" |
| 2 "b" "09/03/2005" NaN |
| NaN "b" "09/03/2005" "book2 |
| 5 "z" "21/08/1995" "book4" |
| 6 "n" "15/04/1993" "book9\xc3\x28" |
+--------------------------------------------------------------+
Here I would like to clean my dataframe but I don't know how to do it in this case.
Indeed there are two points which block me.
The first one is that lines 2 and 3 seems to be the same line because the title of the publication is the same and because I think that the title of the publication is unique to a newspaper
The second point is for the last line one to \xc3\x28.
How can I clean my dataframe in a smart way, to be able to use this code for other daataframe if possible?
First you should remove the row with ID = NaN. This can be done by:
df1 = df1[df1['ID'].notna()]
Then update the journal of the 2nd row:
df1.iloc[1, df1.columns.get_loc('JOURNAL')] = 'book2'
Finally, for the entry of 'book9\xc3\x28', you can update it by:
df1.iloc[4, df1.columns.get_loc('JOURNAL')] = 'book9'
What type of encoding are you using.
I recommend using "utf8" encoding for this purpose.

Apply function with string and integer from multiple columns not working

I want to create a combined string based on two columns, one is an integer and the other is a string. I need to combine them to create a string.
I've already tried using the solution from this answer here (Apply function to create string with multiple columns as argument) but it doesn't give the required output. H
I have two columns: prod_no which is an integer and PROD which is a string. So something like
| prod_no | PROD | out | | |
|---------|-------|---------------|---|---|
| 1 | PRODA | #Item=1=PRODA | | |
| 2 | PRODB | #Item=2=PRODB | | |
| 3 | PRODC | #Item=3=PRODC | | |
to get the last column, I used the following code:
prod_list['out'] = prod_list.apply(lambda x: "#ITEM={}=={}"
.format(prod_list.prod_no.astype(str), prod_list.PROD), axis=1)
I'm trying to produce the column "out" but the result of that code is weird. The output is #Item=0 1 22 3...very odd. I'm specifically trying to implement using apply and lambda. However, I am biased to efficient implementations since I am trying to learn how to write optimized code. Please help :)
This works.
import pandas as pd
df= pd.DataFrame({"prod_no": [1,2,3], "PROD": [ "PRODA", "PRODB", "PRODC" ]})
df["out"] = df.apply(lambda x: "#ITEM={}=={}".format(x["prod_no"], x["PROD"]), axis=1)
print(df)
Output:
PROD prod_no out
0 PRODA 1 #ITEM=1==PRODA
1 PRODB 2 #ITEM=2==PRODB
2 PRODC 3 #ITEM=3==PRODC
you can also try with zip:
df=df.assign(out=['#ITEM={}=={}'.format(a,b) for a,b in zip(df.prod_no,df.PROD)])
#or directly : df.assign(out='#Item='+df.prod_no.astype(str)+'=='+df.PROD)
prod_no PROD out
0 1 PRODA #ITEM=1==PRODA
1 2 PRODB #ITEM=2==PRODB
2 3 PRODC #ITEM=3==PRODC

Replacing string value in a pandas dataframe column inside a list in Python

I have a column in my dataframe like this:
___________________________
| columnn |
____________________________
| [happiness#sad] |
| [happy ness#moderate] |
| [happie ness#sad] |
____________________________
and I want to replace the “happy ness”,”happiness”,”happie ness” with 'happyness' . I am currently using this method but nothing is changed.
string exactly matching
happy ness===> happyness
happiness ===> happyness
happie ness===>happyness
I treid the below two approaches
1st Approach
df['column']
df.column=df.column.replace({"happiness":"happyness" ,"happy ness":"happyness" ,"happie ness":"happynesss" })
2nd Approach
df['column']=df['column'].str.replace("happiness","happyness").replace(“happy ness”.”happyness”).replace(“happie ness”,”happynesss”)
Desired Output:
______________________
| columnn |
_______________________
| [happyness,sad] |
| [happyness,moderate] |
| [happyness,sad] |
_______________________
This is one approach using replace with regex=True.
Ex:
import pandas as pd
df = pd.DataFrame({"columnn": [["happiness#sad"], ["happy ness#moderate"], ["happie ness$sad"]]})
data = {"happiness":"happyness" ,"happy ness":"happyness" ,"happie ness":"happynesss" }
df["columnn"] = df["columnn"].apply(lambda x: pd.Series(x).replace(data, regex=True).tolist())
print(df)
Output:
columnn
0 [happyness#sad]
1 [happyness#moderate]
2 [happynesss$sad]
Try this approach i think this will work for you.
df['new_col']=df['column'].replace(to_replace =
['happyness','happiness','happie ness'], value =
['happyness','happyness','happyness'])

Categories