Data Profiling using python - python

I have a data frame as below :
member_id | loan_amnt | Age | Marital_status
AK219 | 49539.09 | 34 | Married
AK314 | 1022454.00 | 37 | NA
BN204 | 75422.00 | 34 | Single
I want to create an output file in the below format
Columns | Null Values | Duplicate |
member_id | N | N |
loan_amnt | N | N |
Age | N | Y |
Marital Status| Y | N |
I know about one python package called PandasProfiling but I want build this in the above manner so that I can enhance my code with respect to the data sets.

Use something like:
m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
.replace({True:'Y',False:'N'}))

Here is python one-liner:
pd.concat([df.isnull().any() , df.apply(lambda x: x.count() != x.nunique())], 1).replace({True: "Y", False: "N"})

Actually the Pandas_Profiling gives you multiple options where you can figure out if there are repetitive values.

Related

Using regex expresion to create a new Dataframe Column

I have the following Python DataFrame:
| ColumnA | File |
| -------- | -------------- |
| First | aasdkh.xls |
| Second | sadkhZ.xls |
| Third | asdasdPH.xls |
| Fourth | adsjklahsd.xls |
and so on.
I'm trying to get the following DataFrame:
| ColumnA | File | Category|
| -------- | ---------------- | ------- |
| First | aasdkh.xls | N |
| Second | sadkhZ.xls | Z |
| Third | asdasdPH.xls | PH |
| Fourth | adsjklahsdPH.xls | PH |
I'm trying to use regex expresions, but I'm not sure how to use them. I need to get a new column that "extracts" the category of the file; N if is a "normal" file (no category), Z if the file contains a "Z" just before the extension and PH if the file contains a "PH" before the extension.
I defined the following regex expresions that I think I could use, but I dont know how to use them:
regex_Z = re.compile('Z.xls$')
regex_PH = re.compile('PH.xls$')
PD: Could you recomend me any website to learn how to use the regex expresions?
Let's try
df['Category'] = df['File'].str.extract('(Z|PH)\.xls$').fillna('N')
print(df)
ColumnA File Category
0 First aasdkh.xls N
1 Second sadkhZ.xls Z
2 Third asdasdPH.xls PH
3 Fourth adsjklahsd.xls N

Pyspark: Reorder only a subset of rows among themselves

my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/

Pandas - Create new column w/values from another column based on str contains

I have two DataFrames. One with multiple columns and other with just one. So what I need is to join based on partial str of a column. Example:
df1
| Name | Classification |
| -------- | -------------------------- |
| A | Transport/Bicycle/Mountain |
| B | Transport/City/Bus |
| C | Transport/Taxi/City |
| D | Transport/City/Uber |
| E | Transport/Mountain/Jeep |
df2
| Category |
| -------- |
| Mountain |
| City |
As you can see the order on Classification column is not well difined.
Derisable Output
| Name | Classification | Category |
| -------- | -------------------------- |-----------|
| A | Transport/Bicycle/Mountain | Mountain |
| B | Transport/City/Bus | City |
| C | Transport/Taxi/City | City |
| D | Transport/City/Uber | City |
| E | Transport/Mountain/Jeep | Mountain |
I'm stuck on this. Any ideas?
Many thanks in advance.
This implementation does the trick:
def get_cat(val):
for cat in df2['Category']:
if cat in val:
return cat
return None
df['Category'] = df['Classification'].apply(get_cat)
Note: as #Justin Ezequiel pointed out in the comments, you haven't specified what to do when Mountain and City exists in the Classification. Current implementation uses the first Category that matches.
You can try this:
dff={"ne":[]}
for x in df1["Classification"]:
if a in df2 and a in x:
dff["ne"].append(a)
df1["Category"]=dff["ne"]
df1 will look like your desirable output.

pandas dataframe get rows based on matched strings in cells

Given the following data frame
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 1 | you#you.com | 3.0 | World |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
How can I get all the rows where re.compile([Hh](i|ello)) would match in a cell? That is, from the above example, I would like to get the following output:
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
I am not able to get a solution for this. And help would be very much appreciated.
Using stack to avoid apply
df.loc[df.stack().str.match(r'[Hh](i|ello)').unstack().any(1)]
Using match generates a future warning. The warning is consistant with what we are doing, so that's good. However, findall accomplishes the same thing
df.loc[df.stack().str.findall(r'[Hh](i|ello)').unstack().any(1)]
You can use the findall function which takes regular expressions.
msk = df.apply(lambda x: x.str.findall(r'[Hh](i|ello)')).any(axis=1)
df[msk]
+---|------------|------|---------+
| | A | B | C |
+---|------------|------|---------+
| 0 |hello#me.com| 2 | Hello |
| 2 |us#world.com| hi | holiday |
+---|------------|------|---------+
any(axis=1) will check if any of the columns in a given row are true. So msk is a single column of True/False values indicating whether or not the regular expression was found in that row.

Replicating GROUP_CONCAT for pandas.DataFrame

I have a pandas DataFrame df:
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
| B | speedy |
| A | goofy |
| A | marvin |
| B | pepe |
| C | petunia |
| C | porky |
+------+---------
I want to find or write a function to return a DataFrame that I would return in MySQL using the following:
SELECT
team,
GROUP_CONCAT(user)
FROM
df
GROUP BY
team
for the following result:
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn,goofy,marvin |
| B | dawg,speedy,pepe |
| C | petunia,porky |
+------+---------------------------------------+
I can think of nasty ways to do this by iterating over rows and adding to a dictionary, but there's got to be a better way.
Do the following:
df.groupby('team').apply(lambda x: ','.join(x.user))
to get a Series of strings or
df.groupby('team').apply(lambda x: list(x.user))
to get a Series of lists of strings.
Here's what the results look like:
In [33]: df.groupby('team').apply(lambda x: ', '.join(x.user))
Out[33]:
team
a elmer, daffy, bugs, foghorn, goofy, marvin
b dawg, speedy, pepe
c petunia, porky
dtype: object
In [34]: df.groupby('team').apply(lambda x: list(x.user))
Out[34]:
team
a [elmer, daffy, bugs, foghorn, goofy, marvin]
b [dawg, speedy, pepe]
c [petunia, porky]
dtype: object
Note that in general any further operations on these types of Series will be slow and are generally discouraged. If there's another way to aggregate without putting a list inside of a Series you should consider using that approach instead.
A more general solution if you want to use agg:
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})

Categories