search for string in pandas row - python

How can I search through the entire row in a pandas dataframe for a phrase and if it exist create a new col where says it says 'Yes' and what columns in that row it found it in? I would like to be able to ignore case as well.

You could use Pandas apply function, which allows you to traverse rows or columns and apply your own function to them.
For example, given a dataframe
+--------------------------------------+------------+---+
| deviceid | devicetype | 1 |
+--------------------------------------+------------+---+
| b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 |
| 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 |
| cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 |
+--------------------------------------+------------+---+
Define a function
def pr(array, value):
condition = array[array.str.contains(value).fillna(False)].index.tolist()
if condition:
ret = array.append(pd.Series({"condition":['Yes'] + condition}))
else:
ret = array.append(pd.Series({"condition":['No'] + condition}))
return ret
Use it
df.apply(pr, axis=1, args=('Google',))
+---+--------------------------------------+------------+---+-------------------+
| | deviceid | devicetype | 1 | condition |
+---+--------------------------------------+------------+---+-------------------+
| 0 | b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 | [Yes, devicetype] |
| 1 | 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 | [No] |
| 2 | cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 | [No] |
+---+--------------------------------------+------------+---+-------------------+

Related

Using regex expresion to create a new Dataframe Column

I have the following Python DataFrame:
| ColumnA | File |
| -------- | -------------- |
| First | aasdkh.xls |
| Second | sadkhZ.xls |
| Third | asdasdPH.xls |
| Fourth | adsjklahsd.xls |
and so on.
I'm trying to get the following DataFrame:
| ColumnA | File | Category|
| -------- | ---------------- | ------- |
| First | aasdkh.xls | N |
| Second | sadkhZ.xls | Z |
| Third | asdasdPH.xls | PH |
| Fourth | adsjklahsdPH.xls | PH |
I'm trying to use regex expresions, but I'm not sure how to use them. I need to get a new column that "extracts" the category of the file; N if is a "normal" file (no category), Z if the file contains a "Z" just before the extension and PH if the file contains a "PH" before the extension.
I defined the following regex expresions that I think I could use, but I dont know how to use them:
regex_Z = re.compile('Z.xls$')
regex_PH = re.compile('PH.xls$')
PD: Could you recomend me any website to learn how to use the regex expresions?
Let's try
df['Category'] = df['File'].str.extract('(Z|PH)\.xls$').fillna('N')
print(df)
ColumnA File Category
0 First aasdkh.xls N
1 Second sadkhZ.xls Z
2 Third asdasdPH.xls PH
3 Fourth adsjklahsd.xls N

Update a column data w.r.t values in other columns regex match in dataframes

I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large

Pyspark: Reorder only a subset of rows among themselves

my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/

Pandas - Create new column w/values from another column based on str contains

I have two DataFrames. One with multiple columns and other with just one. So what I need is to join based on partial str of a column. Example:
df1
| Name | Classification |
| -------- | -------------------------- |
| A | Transport/Bicycle/Mountain |
| B | Transport/City/Bus |
| C | Transport/Taxi/City |
| D | Transport/City/Uber |
| E | Transport/Mountain/Jeep |
df2
| Category |
| -------- |
| Mountain |
| City |
As you can see the order on Classification column is not well difined.
Derisable Output
| Name | Classification | Category |
| -------- | -------------------------- |-----------|
| A | Transport/Bicycle/Mountain | Mountain |
| B | Transport/City/Bus | City |
| C | Transport/Taxi/City | City |
| D | Transport/City/Uber | City |
| E | Transport/Mountain/Jeep | Mountain |
I'm stuck on this. Any ideas?
Many thanks in advance.
This implementation does the trick:
def get_cat(val):
for cat in df2['Category']:
if cat in val:
return cat
return None
df['Category'] = df['Classification'].apply(get_cat)
Note: as #Justin Ezequiel pointed out in the comments, you haven't specified what to do when Mountain and City exists in the Classification. Current implementation uses the first Category that matches.
You can try this:
dff={"ne":[]}
for x in df1["Classification"]:
if a in df2 and a in x:
dff["ne"].append(a)
df1["Category"]=dff["ne"]
df1 will look like your desirable output.

Python - Pandas - Converting column with specific subsets into rows

I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840

Categories