I have the following Python DataFrame:
| ColumnA | File |
| -------- | -------------- |
| First | aasdkh.xls |
| Second | sadkhZ.xls |
| Third | asdasdPH.xls |
| Fourth | adsjklahsd.xls |
and so on.
I'm trying to get the following DataFrame:
| ColumnA | File | Category|
| -------- | ---------------- | ------- |
| First | aasdkh.xls | N |
| Second | sadkhZ.xls | Z |
| Third | asdasdPH.xls | PH |
| Fourth | adsjklahsdPH.xls | PH |
I'm trying to use regex expresions, but I'm not sure how to use them. I need to get a new column that "extracts" the category of the file; N if is a "normal" file (no category), Z if the file contains a "Z" just before the extension and PH if the file contains a "PH" before the extension.
I defined the following regex expresions that I think I could use, but I dont know how to use them:
regex_Z = re.compile('Z.xls$')
regex_PH = re.compile('PH.xls$')
PD: Could you recomend me any website to learn how to use the regex expresions?
Let's try
df['Category'] = df['File'].str.extract('(Z|PH)\.xls$').fillna('N')
print(df)
ColumnA File Category
0 First aasdkh.xls N
1 Second sadkhZ.xls Z
2 Third asdasdPH.xls PH
3 Fourth adsjklahsd.xls N
Related
I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large
I have two DataFrames. One with multiple columns and other with just one. So what I need is to join based on partial str of a column. Example:
df1
| Name | Classification |
| -------- | -------------------------- |
| A | Transport/Bicycle/Mountain |
| B | Transport/City/Bus |
| C | Transport/Taxi/City |
| D | Transport/City/Uber |
| E | Transport/Mountain/Jeep |
df2
| Category |
| -------- |
| Mountain |
| City |
As you can see the order on Classification column is not well difined.
Derisable Output
| Name | Classification | Category |
| -------- | -------------------------- |-----------|
| A | Transport/Bicycle/Mountain | Mountain |
| B | Transport/City/Bus | City |
| C | Transport/Taxi/City | City |
| D | Transport/City/Uber | City |
| E | Transport/Mountain/Jeep | Mountain |
I'm stuck on this. Any ideas?
Many thanks in advance.
This implementation does the trick:
def get_cat(val):
for cat in df2['Category']:
if cat in val:
return cat
return None
df['Category'] = df['Classification'].apply(get_cat)
Note: as #Justin Ezequiel pointed out in the comments, you haven't specified what to do when Mountain and City exists in the Classification. Current implementation uses the first Category that matches.
You can try this:
dff={"ne":[]}
for x in df1["Classification"]:
if a in df2 and a in x:
dff["ne"].append(a)
df1["Category"]=dff["ne"]
df1 will look like your desirable output.
I have a data frame as below :
member_id | loan_amnt | Age | Marital_status
AK219 | 49539.09 | 34 | Married
AK314 | 1022454.00 | 37 | NA
BN204 | 75422.00 | 34 | Single
I want to create an output file in the below format
Columns | Null Values | Duplicate |
member_id | N | N |
loan_amnt | N | N |
Age | N | Y |
Marital Status| Y | N |
I know about one python package called PandasProfiling but I want build this in the above manner so that I can enhance my code with respect to the data sets.
Use something like:
m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
.replace({True:'Y',False:'N'}))
Here is python one-liner:
pd.concat([df.isnull().any() , df.apply(lambda x: x.count() != x.nunique())], 1).replace({True: "Y", False: "N"})
Actually the Pandas_Profiling gives you multiple options where you can figure out if there are repetitive values.
Given the following data frame
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 1 | you#you.com | 3.0 | World |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
How can I get all the rows where re.compile([Hh](i|ello)) would match in a cell? That is, from the above example, I would like to get the following output:
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
I am not able to get a solution for this. And help would be very much appreciated.
Using stack to avoid apply
df.loc[df.stack().str.match(r'[Hh](i|ello)').unstack().any(1)]
Using match generates a future warning. The warning is consistant with what we are doing, so that's good. However, findall accomplishes the same thing
df.loc[df.stack().str.findall(r'[Hh](i|ello)').unstack().any(1)]
You can use the findall function which takes regular expressions.
msk = df.apply(lambda x: x.str.findall(r'[Hh](i|ello)')).any(axis=1)
df[msk]
+---|------------|------|---------+
| | A | B | C |
+---|------------|------|---------+
| 0 |hello#me.com| 2 | Hello |
| 2 |us#world.com| hi | holiday |
+---|------------|------|---------+
any(axis=1) will check if any of the columns in a given row are true. So msk is a single column of True/False values indicating whether or not the regular expression was found in that row.
How can I search through the entire row in a pandas dataframe for a phrase and if it exist create a new col where says it says 'Yes' and what columns in that row it found it in? I would like to be able to ignore case as well.
You could use Pandas apply function, which allows you to traverse rows or columns and apply your own function to them.
For example, given a dataframe
+--------------------------------------+------------+---+
| deviceid | devicetype | 1 |
+--------------------------------------+------------+---+
| b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 |
| 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 |
| cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 |
+--------------------------------------+------------+---+
Define a function
def pr(array, value):
condition = array[array.str.contains(value).fillna(False)].index.tolist()
if condition:
ret = array.append(pd.Series({"condition":['Yes'] + condition}))
else:
ret = array.append(pd.Series({"condition":['No'] + condition}))
return ret
Use it
df.apply(pr, axis=1, args=('Google',))
+---+--------------------------------------+------------+---+-------------------+
| | deviceid | devicetype | 1 | condition |
+---+--------------------------------------+------------+---+-------------------+
| 0 | b569dcb7-4498-4cb4-81be-333a7f89e65f | Google | 1 | [Yes, devicetype] |
| 1 | 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f | Android | 2 | [No] |
| 2 | cf7391c5-a82f-4889-8d9e-0a423f132026 | Android | 3 | [No] |
+---+--------------------------------------+------------+---+-------------------+