How to convert value data to another string data using pandas dataframe? - python

I want to ask about how to convert data value to another data string from table 1 to table 2. Here's my explanation.
This is my function python program
def freq_action_requirement_measurement(keyword= stopwords_ucd1_topic.action, requirement= stopwords_freq_topic.requirement):
disambiguation_df = []
for angka in range(0, len(requirement)):
a = [cosine_similarity(requirement[angka], keyword[num]) for num in range(0, len(keyword))]
disambiguation_df.append(a)
hasil_disambiguation = pd.DataFrame(disambiguation_df, index= requirement, columns= keyword)
return hasil_disambiguation
using that syntax, I called it
list_action_requirement3 = freq_action_requirement_measurement()
and get data like this:
enter image description here
I can get data from that table 1 to be table 2. That have been result such as Table usecase dan threshold value. Here's my code..
data_d1 = []
df_stop1 = stopwords_ucd1_topic
for idx, angka in enumerate(list_action_requirement3):
for jdx, num in enumerate(list_action_requirement3.iloc[:,idx]):
if num >= 0.1:
d1 = df_stop1[df_stop1.action == angka].iloc[0].usecase
# d1 = len(df_stop1[df_stop1.action == angka])
data_d1.append([d1, num])
df_b = pd.DataFrame(data_d1, columns= ['usecase', 'threshold'])
print(tabulate(df_b, headers = 'keys', tablefmt = 'psql'))
and the result, like this one..
enter image description here
From that threshold value, I wanna to change that value data to string data as a usecase data column. Like this one..
enter image description here
Please help me soon, thank you..
Here's also another qna from another forum qna.

Related

How to split one column data into multiple columns using sql query/Python?

My data is in the form given below
Number
ABCD0001
ABCD0002
ABCD0003
GHIJ768O
GHIJ7681
GHIJ7682
SEDFTH1
SEDFTH2
SEDFTH3
I want to split this data into multiple colunms using postgreSQl/python script?
The output data should be like
Number1 Number2 Number3
ABCD0001 GHIJ7680 SEDFTH1
ABCD0002 GHIJ7681 SEDFTH2
Can I do this using an postgreSQl query or via a python script?
This is just a quick solution to your problem, i'm still learning python myself. So this code snippet could probaly be optimized alot. But it solves your problem.
import pandas as pd
number = ['ABCD0001','ABCD0002','ABCD0003','GHIJ768O','GHIJ7681','GHIJ7682','SEDFTH1','SEDFTH2','SEDFTH3']
def find_letters(list_of_str):
abc_arr = []
ghi_arr = []
sed_arr = []
for i in range(len(list_of_str)):
text = number[i]
if text.__contains__('ABC'):
abc_arr.append(text)
if text.__contains__('GHI'):
ghi_arr.append(text)
if text.__contains__('SED'):
sed_arr.append(text)
df = pd.DataFrame({'ABC':abc_arr, 'GHI':ghi_arr, 'SED':sed_arr})
return df
This code give this output.
Screenshot Of Output
Edit:
Just realized the first output you showed is prob a dataframe aswell, below code is how you would handle it if your data is from a df and not a list.
import pandas as pd
data = {'Numbers': ['ABCD0001','ABCD0002','ABCD0003','GHIJ768O','GHIJ7681','GHIJ7682','SEDFTH1','SEDFTH2','SEDFTH3']}
df = pd.DataFrame(data)
print(df)
def find_letters(list_of_str):
abc_arr = []
ghi_arr = []
sed_arr = []
list_of_str = list_of_str.values.tolist()
for i in range(len(list_of_str)):
text = list_of_str[i][0]
if text.__contains__('ABC'):
abc_arr.append(text)
if text.__contains__('GHI'):
ghi_arr.append(text)
if text.__contains__('SED'):
sed_arr.append(text)
df = pd.DataFrame({'ABC':abc_arr, 'GHI':ghi_arr, 'SED':sed_arr})
return df
find_letters(df)
Which gives this output output

In Python, How do you use a loop to create a dataframe?

I recently pulled data from youtube API, and I'm trying to create a data frame using that information.
When I loop through each item with the "print" function, I get 25 rows output for each variable (which is what I want in the data frame I create).
How can I create a new data frame that contains 25 rows using this information instead of just 1 line in the data frame?
When I loop through each item like this:
df = pd.DataFrame(columns = ['video_title', 'video_id', 'date_created'])
#For Loop to Create columns for DataFrame
x=0
while x < len(response['items']):
video_title= response['items'][x]['snippet']['title']
video_id= response['items'][x]['id']['videoId']
date_created= response['items'][x]['snippet']['publishedAt']
x=x+1
#print(video_title, video_id)
df = df.append({'video_title': video_title,'video_id': video_id,
'date_created': date_created}, ignore_index=True)
=========ANSWER UPDATE==========
THANK YOU TO EVERYONE THAT GAVE INPUT !!!
The code that created the Dataframe was:
import pandas as pd
x=0
video_title = []
video_id = []
date_created = []
while x < len(response['items']):
video_title.append (response['items'][x]['snippet']
['title'])
video_id.append (response['items'][x]['id']['videoId'])
date_created.append (response['items'][x]['snippet'].
['publishedAt'])
x=x+1
#print(video_title, video_id)
df = pd.DataFrame({'video_title': video_title,'video_id':
video_id, 'date_created': date_created})
Based on what I know about youtube APIs return objects, the values of 'title' , 'videoId' and 'publishedAt' are strings.
A strategy of making a single df from these strings are:
Store these strings in a list. So you will have three lists.
Convert the lists into a df
You will get a df with x rows, based on x values that are retrieved.
Example:
import pandas as pd
x=0
video_title = []
video_id = []
date_created = []
while x < len(response['items']):
video_title.append (response['items'][x]['snippet']['title'])
video_id.append (response['items'][x]['id']['videoId'])
date_created.append (response['items'][x]['snippet']['publishedAt'])
x=x+1
#print(video_title, video_id)
df = pd.DataFrame({'video_title': video_title,'video_id':
video_id, 'date_created': date_created})

Python Panda-Specific format of text with hyphen

I need to change multiple such python dataframe columns that do not follow a specific format like Name-ID-Date. And want to change that to follow the same format. I have attached the input and Corrected output format as images.
I have written some code that basically looks at all the columns in dataframe and if it follows the format then it separates the data into 3 different columns but if does not follow the specific format of Name-ID-Date the code is not able to proceed. Any help will be highly appreciated here.
dff = df[['PPS_REQ','Candidate1', 'Candidate2',
'Candidate3', 'Candidate4', 'Candidate5', 'Candidate6', 'Candidate7',
'Candidate8', 'Candidate9','Candidate10', 'Candidate11', 'Candidate12',
'Candidate13', 'Candidate14', 'Candidate15', 'Candidate16',
'Candidate17', 'Candidate18', 'Candidate19', 'Candidate20','Candidate21',
'Candidate22','Candidate23','Candidate24','Candidate25','Candidate26','Candidate27','Candidate28']]
all_candiadates = ['Candidate1', 'Candidate2',
'Candidate3', 'Candidate4', 'Candidate5', 'Candidate6', 'Candidate7',
'Candidate8', 'Candidate9','Candidate10', 'Candidate11', 'Candidate12',
'Candidate13', 'Candidate14', 'Candidate15', 'Candidate16',
'Candidate17', 'Candidate18', 'Candidate19', 'Candidate20','Candidate21',
'Candidate22','Candidate23','Candidate24','Candidate25','Candidate26','Candidate27','Candidate28']#,'Candidate29','Candidate30','Candidate31','Candidate32','Candidate33','Candidate34','Candidate35','Candidate36','Candidate37','Candidate38']
blank = pd.DataFrame()
for index, row in dff.iterrows():
for c in all_candiadates:
print('the value of c :',c)
candidate = dff[['PPS_REQ',c]]
candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
blank = blank.append(candidate)
Thank you
i have done some workaround in the code something like below, But the problem I am facing with this part of code:
candidate['Sdate'] = candidate[c].str.extract('(../..)', expand=True)
Here if Date is 11/18 it works fine, but if date is 11/8 it returns nan.
for index, row in dff.iterrows():
for c in all_candiadates:
print('the value of c :',c)
candidate = dff[['PPS_REQ',c]]
candidate['Sdate'] = candidate[c].str.extract('(../..)', expand=True)
candidate['Id'] = candidate[c].str.extract('(\d\d\d\d\d\d\d)', expand=True)
candidate['Name'] = candidate[c].str.extract('([a-zA-Z ]*)\d*.*', expand=False)
# candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
blank = blank.append(candidate)
Finally this is fixed, just adding this if this is useful for someone else.
blank = pd.DataFrame()
#for index, row in dff.iterrows():
for c in all_candiadates:
# print('the value of c :',c)
try:
candidate = dff[['PPS_REQ',c]]
candidate = candidate[candidate[c].str.contains('FILL|Reopen|Fill|REOPEN|Duplicate|reopen|FILED|fill') != True]
candidate=candidate.loc[(candidate[c] !="")]
candidate['Sdate'] = candidate[c].str.extract('(\d+/\d+)', expand=True)
candidate['Id'] = candidate[c].str.extract('(\d\d\d\d\d\d\d)', expand=True)
candidate['Name'] = candidate[c].str.extract('([a-zA-Z ]*)\d*.*', expand=False)
# candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
blank = blank.append(candidate)
except:
pass
blank = blank[['PPS_REQ', 'Name','Id','Sdate']]
bb = blank.drop_duplicates()

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

Filter data through multiple columns and print rows?

Kind of a follow up on my last question. So I got this data in .csv file that looks like:
id,first_name,last_name,email,gender,ip_address,birthday
1,Ced,Begwell,cbegwell0#google.ca,Male,134.107.135.233,17/10/1978
2,Nataline,Cheatle,ncheatle1#msn.com,Female,189.106.181.194,26/06/1989
3,Laverna,Hamlen,lhamlen2#dot.gov,Female,52.165.62.174,24/04/1990
4,Gawen,Gillfillan,ggillfillan3#hp.com,Male,83.249.190.232,31/10/1984
5,Syd,Gilfether,sgilfether4#china.com.cn,Male,180.153.199.106,11/07/1995
What I want is that when the python program runs it asks the user what keywords to search for. It then takes all keywords entered ( maybe they are stored in a list???), then prints out all rows that contain all keywords no matter what column that keyword is in.
I've been playing around with csv and pandas, and have been googling for hours but just can't seem to get it to work like I want it to. I'm still kinda new to python3. Please help.
**Edit to show what I've got so far:
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("MOCK_DATA.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)
Works great if only searching by one keyword. But I want the choice of filtering by one or mutiple keywords.
Here you go, using try and except because if the datatype is not matched with your keyword it would raise an error
import pandas as pd
def fun(data,keyword):
ans = pd.DataFrame()
for i in data.columns:
try:
ans = pd.concat((data[data[i]==keyword],ans))
except:
pass
ans.drop_duplicates(inplace=True)
return ans
Try the following code for AND search with the keywords:
def AND_serach(df,list_of_keywords):
# init a numpy array to store the index
index_arr = np.array([])
for keyword in list_of_keywords:
# drop the nan if entire row is nan and get remaining rows' indexs
index = df[df==keyword].dropna(how='all').index.values
# if index_arr is empty then assign to it; otherwise update to intersect of two arrays
index_arr = index if index_arr.size == 0 else np.intersect1d(index_arr,index)
# get back the df by filter the index
return df.loc[index_arr.astype(int)]
Try the following code for ORsearch with the keywords:
def OR_serach(df,list_of_keywords):
index_arr = np.array([])
for keyword in list_of_keywords:
index = df[df==keyword].dropna(how='all').index.values
# get all the unique index
index_arr = np.unique(np.concatenate((index_arr,index),0))
return df.loc[index_arr.astype(int)]
OUTPUT
d = {'A': [1,2,3], 'B': [10,1,5]}
df = pd.DataFrame(data=d)
print df
A B
0 1 10
1 2 1
2 3 5
keywords = [1,5]
AND_serach(df,keywords) # return nothing
Out[]:
A B
OR_serach(df,keywords)
Out[]:
A B
0 1 10
1 2 1
2 3 5

Categories