How to check if column with name is valid by regex? - python

How to check if column with name is valid by regex in Pandas:
I have tried this way:
r = df['ADDRESS'].str.findall('^((?:0[1-9]|1[0-2]))[.\/\\\\-]*([0-9]{2})[.\/\\\\]([1|2][0-9]{3})$')
if not r:
df['ADDRESS'] = "Bird"
It does not work for meю
My goas is to move values from columns to specifci columns by content.
Exmaple is:
ID NAME EMAIL
1 #GMAIL.COM Inigma
# Mary 4
As result I should arrange values by columns names:
ID NAME EMAIL
1 Inigma #GMAIL.COM
2 Mary #

Related

Convert Multiples Columns to a List inside a single column in pandas

I am using azure databricks, getting differents excel forms storaged in a blob. I need to keep 3 columns as it is and group as a list other multiples (and differents for each form) responses columns.
My main goal here is to transforme those diferents columns in one unique with a object that the keys are the title of the questions and the value is the response.
I have the following dataframe:
id
name
email
question_1
question_2
question_3
1
mark
mark#email.com
response_11
response_21
response_31
3
elon
elon#email.com
response_12
response_22
response_32
I would like to have the following output.
id
name
email
responses
1
mark
mark#email.com
{'question1':'response'11','question2':'response21','question3':'response_31'}
2
elon
elon#email.com
{'question1':'response'12','question2':'response22','question3':'response_32'}
3
zion
zion#email.com
{'question1':'response'13','question2':'response23','question3':'response_33'}
How i could get that using pandas? i already did the following:
baseCols = ['id','name','email']
def getFormsColumnsName(df):
df_response_columns = df.columns.values.tolist()
for deleted_column in cols:
df_response_columns.remove(deleted_column)
return df_response_columns
formColumns = getFormsColumnsName(df)
df = df.astype(str)
df['responses'] = df[formColumns].values.tolist()
display(df)
But this give me that strange list of responses:
id
name
email
responses
1
mark
mark#email
0: "response11"1: "response12"2: "response13"3: "['response11', 'response12', 'response13' "[]"]"
i dont know what i should do to get what i expected.
Thank you in advance!
You can get your responses column by using pd.DataFrame.to_dict("records").
questions = df.filter(like="question")
responses = questions.to_dict("records")
out = df.drop(questions, axis=1).assign(responses=responses)
output:
id name email responses
0 1 mark mark#email.com {'question_1': 'response_11', 'question_2': 'response_21', 'question_3': 'response_31'}
1 3 elon elon#email.com {'question_1': 'response_12', 'question_2': 'response_22', 'question_3': 'response_32'}

Creating a new column with concatenated values from another column

I am trying to create a new column in this data frame. The data set has multiple records for each PERSON because each record is a different account. The new column values should be a combination of the values for each PERSON in the TYPE column. For example, if John Doe has four accounts the value next to his nae in the new column should be a concatenation of the values in TYPE. An example of the final data frame is below. Thanks in advance.
enter image description here
You can do this in two lines (first code, then explanation):
Code:
in: name_types = df.pivot_table(index='Name', values='AccountType', aggfunc=set)
out:
AccountType
Name
Jane Doe {D}
John Doe {L, W, D}
Larry Wild {L, D}
Patti Shortcake {L, W}
in: df['ClientType'] = df['Name'].apply(lambda x: name_types.loc[x]['AccountType'])
Explanation:
The pivot table gets all the AccountTypes for each individual name, and removes all duplicates using the 'set' aggregate function.
The apply function then iterates through each 'Name' in the main data frame, looks up the AccountType associated with each in name_typed, and adds it to the new column ClientType in the main dataframe.
And you're done!
Addendum:
If you need the column to be a string instead of a set, use:
in: def to_string(the_set):
string = ''
for item in the_set:
string += item
return string
in: df['ClientType'] = df['ClientType'].apply(to_string)
in: df.head()
out:
Name AccountType ClientType
0 Jane Doe D D
1 John Doe D LDW
2 John Doe D LDW
3 John Doe L LDW
4 John Doe D LDW

How can I delete the lines that contain a part (search word) in python?

I have a dataframe with 3 columns, and I want delete all rows, which contains a part of a string (search keys).
my dataframe:
user_name user_first_name user_email
Max Mustermann max.musterman#gmail.com
Tom Hans tom.musterman#web.de
Tom1 Hans1 tom.musterman#test.de
my search keywords are: #gmail.com, #web.de
df = df[~df['user_email'].isin(['*#gmail.com'])]
It doesn't work, because I need to write the excat email adress.
Use str.endswith:
df = df[~df['user_email'].str.endswith('#gmail.com')]
user_name user_first_name user_email
1 Tom Hans tom.musterman#web.de
2 Tom1 Hans1 tom.musterman#test.de
Or str.contains which supports regex:
df = df[~df['user_email'].str.contains('.*#gmail\.com$')]
user_name user_first_name user_email
1 Tom Hans tom.musterman#web.de
2 Tom1 Hans1 tom.musterman#test.de
Your list implies that you might want to pass multiple conditions, so str.contains would probably be the best bet, using | to delimit conditions:
df = df[~df['user_email'].str.contains('.*#gmail\.com$|.*#web\.de$')]
user_name user_first_name user_email
2 Tom1 Hans1 tom.musterman#test.de

How do I extract values from different columns after a groupby in pandas?

I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes

How to add a header name next to its cell value in python

I have this table as an input and I would like to add the name of the header to its corresponding cells before converting it to a dataframe
I am generating association rules after converting the table to a dataframe and each rule is not clear if it belongs to which antecedent/consequent.
Example for the first column of my desired table:
Age
Age = 45
Age = 30
Age = 45
Age = 80
.
.
and so on for the rest of the columns. What is the best way to access each column and rewrite them? And is there a better solution to reference my values after generating association rules other than adding the name of the header to each cell?
Here is one way to add the column names to all cells:
df = pd.DataFrame({'age':[1,2],'sex':['M','F']})
df = df.applymap(str)
for c in df.columns:
df[c] = df[c].apply(lambda s: "{} = {}".format(c,s))
This yields:
age sex
0 age = 1 sex = M
1 age = 2 sex = F

Categories