Auto Increment column value is larger than I expected - python

When I put data in DB with python,
I met a problem that auto increment column value is larger than I expected.
Assume that I use the following function multiple times to put data into the DB.
'db_engine' is a DB, which contains table 'tbl_student' and 'tbl_score'.
To manage the total number of student, table 'tbl_student' has Auto increment column named 'index'.
def save_in_db(db_engine, dataframe):
# tbl_student
student_dataframe = pd.DataFrame({
"ID":dataframe['ID'],
"NAME":dataframe['NAME'],
"GRADE":dataframe['GRADE'],
})
student_dataframe.to_sql(name="tbl_student",con=db_engine, if_exists='append', index=False)
# tbl_score
score_dataframe = pd.DataFrame({
"SCORE_MATH": dataframe['SCORE_MATH'],
"SCORE_SCIENCE":dataframe['SCORE_SCIENCE'],
"SCORE_HISTORY":dataframe['SCORE_HISTORY'],
})
score_dataframe.to_sql(name="tbl_score",con=db_engine, if_exists='append', index=False)
'tbl_student' after some inputs is as follows:
index
ID
NAME
GRADE
0
2023001
Amy
1
1
2023002
Brady
1
2
2023003
Caley
4
6
2023004
Dee
2
7
2023005
Emma
2
8
2023006
Favian
3
12
2023007
Grace
3
13
2023008
Harry
3
14
2023009
Ian
3
Please take a look column 'index'.
When I put in several times, 'index' has larger value than I expected.
What should I try to solve this problem?

You could try:
student_dataframe.reset_index()

Actually, the problem situation is 'index' part connected to another table as a FOREIGN KEY.
Every time I add a data, the error occurred because there was no key(because the index value is not continuous!).
I solve this problem by checking the index part once before put data in DB and setting it as key.
Following code is what I tried.
index_no = get_index(db_engine)
dataframe.index = dataframe.index + index_no + 1 - len(dataframe)
dataframe.reset_index(inplace=True)
If anyone has the same problem, it could be nice way to try another way rather than trying to make auto increment key sequential.

Related

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index

How to create columns in pandas df with .apply and user defined function

I'm trying to create several columns in a pandas DataFrame at once, where each column name is a key in a dictionary and the function returns 1 if any of the values corresponding to that key are present.
My DataFrame has 3 columns, jp_ref, jp_title, and jp_description. Essentially, I'm searching the jp_descriptions for relevant words assigned to that key and populating the column assigned to that key with 1s and 0s based on if any of the values are found present in the jp_description.
jp_tile = [‘software developer’, ‘operations analyst’, ‘it project manager’]
jp_ref = [‘j01’, ‘j02’, ‘j03’]
jp_description = [‘software developer with java and sql experience’, ‘operations analyst with ms in operations research, statistics or related field. sql experience desired.’, ‘it project manager with javascript working knowledge’]
myDict = {‘jp_title’:jp_title, ‘jp_ref’:jp_ref, ‘jp_description’:jp_description}
data = pd.DataFrame(myDict)
technologies = {'java':['java','jdbc','jms','jconsole','jprobe','jax','jax-rs','kotlin','jdk'],
'javascript':['javascript','js','node','node.js','mustache.js','handlebar.js','express','angular'
'angular.js','react.js','angularjs','jquery','backbone.js','d3'],
'sql':['sql','mysql','sqlite','t-sql','postgre','postgresql','db','etl']}
def term_search(doc,tech):
for term in technologies[tech]:
if term in doc:
return 1
else:
return 0
for tech in technologies:
data[tech] = data.apply(term_search(data['jp_description'],tech))
I received the following error but don't understand it:
TypeError: ("'int' object is not callable", 'occurred at index jp_ref')
Your logic is wrong you are traversing list in a loop and after first iteration it return 0 or 1 so jp_description value is never compared with complete list.
You split the jp_description and check the common elements with technology dict if common elements exists it means substring is found so return 1 else 0
def term_search(doc,tech):
doc = doc.split(" ")
common_elem = list(set(doc).intersection(technologies[tech]))
if len(common_elem)>0:
return 1
return 0
for tech in technologies:
data[tech] = data['jp_description'].apply(lambda x : term_search(x,tech))
jp_title jp_ref jp_description java javascript sql
0 software developer j01 software developer.... 1 0 1
1 operations analyst j02 operations analyst .. 0 0 1
2 it project manager j03 it project manager... 0 1 0

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

how to update values in a DataFrame based on values in another DataFrame?

Suppose I have the following DataFrames:
Containers:
Key ContainerCode Quantity
1 P-A1-2097-05-B01 0
2 P-A1-1073-13-B04 0
3 P-A1-2024-09-H05 0
5 P-A1-2018-08-C05 0
6 P-A1-2089-03-C08 0
7 P-A1-3033-16-H07 0
8 P-A1-3035-18-C02 0
9 P-A1-4008-09-G01 0
Inventory:
Key SKU ContainerCode Quantity
1 22-3-1 P-A1-4008-09-G01 1
2 2132-12 P-A1-3033-16-H07 55
3 222-12 P-A1-4008-09-G01 3
4 4561-3 P-A1-3083-12-H01 126
How do I update the Quantity values in Containers to reflect the number of units in each container based on the information in Inventory? Note that multiple SKUs can reside in a single ContainerCode, so we need to add to the quantity, rather than just replace it, and there may be multiple entries in Containers for a particular ContainerCode.
What are the possible ways to accomplish this, and what are their relative pros and cons?
EDIT
The following code seems to serve as a good test case:
import itertools
import pandas as pd
import numpy as np
inventory = pd.DataFrame({'Container Code':['A1','A2','A2','A4'],
'Quantity':[10,87,2,44],
'SKU':['123-456','234-567','345-678','456-567']})
containers = pd.DataFrame({'Container Code':['A1','A2','A3','A4'],
'Quantity':[2,0,8,4],
'Path Order':[1,2,3,4]})
summedInventory = inventory.groupby('Container Code')['Quantity'].sum()
print('Containers Data Frame')
print(containers)
print('\nInventory Data Frame')
print(inventory)
print('\nSummed Inventory List')
print(summedInventory)
print('\n')
newContainers = containers.drop('Quantity', axis=1). \
join(inventory.groupby('Container Code').sum(), on='Container Code')
print(newContainers)
This seems to produce the desired output.
I also tried using a regular merge:
pd.merge(containers.drop('Quantity', axis=1), \
summedInventory,how='inner',left_on='Container Code', right_index=True)
But that produces an 'IndexError: list index out of range'
Any ideas?
I hope I got your scenario correctly. I think you can use:
containers.drop('Quantity', axis = 1).\
join(inventory.groupby('ContainerCode').sum(), \
on = 'ContainerCode')
I'm first dropping quantity from containers because you don't need it - we'll create it from inventory.
Then, we group by inventory by the container code, to sum the quantity relevant to each container.
We then perform the join between the two, and each containercode existent in containers would recieve the summed quantity from inventory

How to divide a dbf table to two or more dbf tables by using python

I have a dbf table. I want to automatically divide this table into two or more tables by using Python. The main problem is, that this table consists of more groups of lines. Each group of lines is divided from the previous group by empty line. So i need to save each of groups to a new dbf table. I think that this problem could be solved by using some function from Arcpy package and FOR cycle and WHILE, but my brain cant solve it :D :/ My source dbf table is more complex, but i attach a simple example for better understanding. Sorry for my poor english.
Source dbf table:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
4
5 D 2
6 E 3
I want get dbf1:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
I want get dbf2:
ID NAME TEAM
1 D 2
2 E 3
Using my dbf package it could look something like this (untested):
import dbf
source_dbf = '/path/to/big/dbf_file.dbf'
base_name = '/path/to/smaller/dbf_%03d'
sdbf = dbf.Table(source_dbf)
i = 1
ddbf = sdbf.new(base_name % i)
sdbf.open()
ddbf.open()
for record in sdbf:
if not record.name: # assuming if 'name' is empty, all are empty
ddbf.close()
i += 1
ddbf = sdbf.new(base_name % i)
continue
ddbf.append(record)
ddbf.close()
sdbf.close()

Categories