combine two id into a new table? - python

i had a task about text processing and i don't know how to combine some columns from separate tables into one table
so here is the case:
i have a table named list with id_doc, and title columns
then i create a new table named term_list which contains a list of result terms when i do some text processing to titles from list.
the term_list table have id_term, term, df, and idf column. Lastly, i want to have a table named term_freq which has columns id, id_term, id_doc, tf, and normalized_tf
example :
table list is like this:
id_doc titles
11 information retrieval system
12 operating system
13 business information
table term_list is below this:
id_term term df idf
21 information 2 --
22 retrieval 1 --
23 system 2 --
24 operating 1 --
25 business 1 --
I want to ask how to create a table term_freq so that the table becomes like this?
id id_term id_doc tf normalized_tf
31 21 11 1 --
32 22 11 1 --
33 23 11 1 --
34 24 12 1 --
35 23 12 1 --
36 25 13 1 --
37 21 13 1 --
the main problem is i have to join id_term and id_doc into one table that one id_doc has relation to several id_term but i don't know how to correlate because list and term_listdoesn't have any similar column.
Please help :(

You can iterate over rows in term_list:
SELECT id_term, term FROM term_list
for each term make:
SELECT id_doc FROM list WHERE titles LIKE "term"
and saves pairs id_term and id_doc in the table term_freq.

Related

Python) How to create a table of already defined dataset

I am looking for how to make this into a table.
Fairly simple case.
row_mean1= pd.DataFrame(round(df1.mean(axis = 0), 3))
This is how I defined the dataset(1 row x 26 columns)
I would like to skip the first value, which is useless, and would like to create a table of
Index = ['ME1', 'ME2', 'ME3', 'ME4', 'ME5'], Columns = ['BM1', 'BM2', 'BM3', 'BM4', 'BM5'],
there are 25 value and I want every 5 value to be on each row.
For instance,
let's say I have data
1, 2, ..., 25
I want to make a table of
1 2 3 4 5
6 7 8 9 10 . .
21 22 23 24 25.
How could I do this?
This is how dataset look like
I tried to solve this for 2 hours, but keep fail to solve ths.

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

Is there a way to join 2 dataframes using another reference table in python

I have 2 data frames created from CSV files and there is another data frame which is a reference for these table. For e.g.
1 Employee demographic (Emp_id, dept_id)
2 Employee detail (Emp_id, RM_ID)
I have 3rd dataframe(dept_manager) which has only 2 columns (dept_id, RM_ID). Now I need to join table 1 and 2 referencing the 3rd dataframe.
Trying out in pandas(python) any help here would be much appreciated..Thanks in advance.
Table1
Empdemogr
Empid dept_id
1 10
2 20
1 30
Table2
Empdetail
Empid RM_id
1 E120
2 E140
3 E130
Table3
dept_manager
dept_id RM_id
10 E110
10 E120
10 E121
10 E122
10 E123
20 E140
20 E141
20 E142
30 E130
30 E131
30 E132
Output:
Emp_id dept_id RM_id
1 10 E120
2 20 E140
1 30 E130
So trying to bring this sql in python:
select a.Emp_id, a.dept_id, b.RM_id
Empdemogr a, Empdetail b, dept_manager d
where
a.emp_id=b.emp_id
and a.dept_id=d.dept_id
and b.RM_id=d.RM_id
Trying to figure out if you had a typo or you have wrong understanding. Your above SQL would not output the the result you are looking for based on the provided data. I do not think you will see dept_id '30' in there.
But Going by your SQL query, here is how you can write the same in python dataframe:
Preparing DataFrames (I will leave it up to you how you load the dataframes):
import pandas as pd
EmpployeeDemo=pd.read_csv(r"YourEmployeeDemoFile.txt")
EmpDetail=pd.read_csv(r"YourEmpDetailFile.txt")
Dept_Manager=pd.read_csv(r"YourDept_Manager.txt")
Code to Join the DataFrames:
joined_dataframe = pd.merge(pd.merge(EmpployeeDemo, EmpDetail, on="Empid"),Dept_Manager, on=["dept_id", "RM_id"])
print(joined_dataframe)

Cx_Oracle equivalent of on duplicate key update

Now I have a list of tuple named "Data"
[
('171115090000',
Timestamp('2017-11-15 09:00:00'),
'PAIR1',
156.0)
]
I want to insert this list to Oracle DB, my code is
cur.executemany(
'''INSERT INTO A
("SID","DATE","ATT","VALUE")
VALUES(:1,:2,:3,:4)''',Data)
And it works well. However if I want to add/replace new records into this database, I have to create a table B to put those records then merge A and B.
Is there anything like on duplicate key update that I could finish my job without creating a new table?
I know I could select all records from A, convert them to a DataFrame and merge DataFrames in Python, is this a good solution?
Is there anything like on duplicate key update
In Oracle, it is called MERGE; have a look at the following example:
Table contents at the beginning:
SQL> select * From dept;
DEPTNO DNAME LOC
---------- -------------- -------------
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
MERGE statement:
SQL> merge into dept d
2 using (select deptno, dname, loc
3 from (select 10 deptno, 'ACC' dname, 'NY' loc from dual --> already exists, should be updated
4 union all
5 select 99 , 'NEW DEPT' , 'LONDON' from dual --> doesn't exists, should be inserted
6 )
7 ) x
8 on (d.deptno = x.deptno)
9 when matched then update set
10 d.dname = x.dname,
11 d.loc = x.loc
12 when not matched then insert (d.deptno, d.dname, d.loc)
13 values (x.deptno, x.dname, x.loc);
2 rows merged.
The result: as you can see, values for existing DEPTNO = 10 were updated, while the new DEPTNO = 99 was inserted into the table.
SQL> select * From dept;
DEPTNO DNAME LOC
---------- -------------- -------------
10 ACC NY
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
99 NEW DEPT LONDON
SQL>
I don't speak Python so I can't compose code you might use, but I hope that you'll manage to do it yourself.

Automated combinatorial DataFrame generation in Python/pandas

I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.
For instance, if I have a DataFrame containing a given set of combinations:
https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.
Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.
I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.
Consider the following solution
from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]
df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )
>>> df
protein drug drug_concentration
0 protein1 drug1 100
1 protein1 drug1 30
2 protein1 drug1 10
3 protein1 drug2 100
4 protein1 drug2 30
5 protein1 drug2 10
6 protein1 drug3 100
7 protein1 drug3 30
8 protein1 drug3 10
9 protein2 drug1 100
10 protein2 drug1 30
11 protein2 drug1 10
12 protein2 drug2 100
13 protein2 drug2 30
14 protein2 drug2 10
15 protein2 drug3 100
16 protein2 drug3 30
17 protein2 drug3 10
18 protein3 drug1 100
19 protein3 drug1 30
20 protein3 drug1 10
21 protein3 drug2 100
22 protein3 drug2 30
23 protein3 drug2 10
24 protein3 drug3 100
25 protein3 drug3 30
26 protein3 drug3 10
This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.
I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.
The code is:
data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
dict1 = {"protein":unique_protein,"drug":"","drug_concentration":""}
rows_list.append(dict1)
for unique_drug in data.drug.unique():
for unique_drug_conc in data.drug_concentration.unique():
dict1 = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df
It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.
I've improved upon the earlier version
enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.
I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.
Code for a fully functional python file follows:
import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)
def null_check(value):
return pandas.isnull(value)
def combinator(input_table):
for unique_protein in input_table.protein.unique():
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
rows_list.append(dict1)
if prot_drug_bool.ix[unique_protein]:
for unique_drug in input_table.drug.unique():
if not null_check(unique_drug):
for unique_drug_conc in input_table.drug_concentration.unique():
if not null_check(unique_drug_conc):
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
dict1['drug']=unique_drug
dict1['drug_concentration']=unique_drug_conc
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')
I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.

Categories