Assign several DataFrame columns to match SQL table - python

I have several DataFrames that need to flip to SQL tables. The SQL tables all share one schema yet the DataFrames do not. I need to be able to easily match/change the df columns to the sql table. Everything I have seen on here is manipulating 1 or 2 fields using df.to_sql. I need to be able to manipulate at least 10 fields as easy as I do with lists. Below are example tables
list1
+-------+-------+-------+-------+
| name |hobby1 |hobby2 |hobby3 |
+-------+-------+-------+-------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+-------+-------+-------+-------+
df2
+---------+------------+-----------+-----------+
| df2name | df2hobby1 | df2hobby2 |df2hobby3 |
+---------+------------+-----------+-----------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+----------+-----------+-----------+-----------+
sql1
+-----------+-----------+-----------+-----------+
| sql_name |sql_hobby1 |sql_hobby2 |sql_hobby3 |
+-----------+-----------+-----------+-----------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+----------+-----------+------------+------------+
Sometimes I receive the data in a python dict, I can easily transfer using a kwargs function and works great. My function is below:
def transfer_dict(**kwargs):
transfer = {'sqlname':' ',
'sqlhobby1' : ' ',
'sqlhobby2' : ' ',
'sqlhobby3' : ' '
}
transfer.update(kwargs)
return (transfer)
I transfer easily by doing:
new_list.append(transfer_dict(sqlname=name, sqlhobby1=hobby1, sqlhobby2=hobby2, sqlhobby3=hobby3))
Can I use my same kwargs transfer function to apply on DataFrame transfers to SQL? Or is there a better way?

The pandas.DataFrame.rename() method will accept a dict-like set of column names and names to rename them with. In many cases, the fastest solution to the problem you are describing (if I'm understanding you correctly) is to use a combination of rename() and drop() to change the source DataFrame so that it matches the SQL target, and then use to_sql() as you have described doing (but now, critically, all the column names match their intended targets). For example:
sql_mappings = {'df2_name':'sql_name', 'df2_hobby1':'sql_hobby1', 'df2_hobby2':'sql_hobby2', 'df2_hobby3':'sql_hobby3'}
sql_columns = [i for i in sql_mappings.values()]
df2 = df2.rename(columns=sql_mappings)
df2 = df2.drop(columns=[col for col in df2 if col not in sql_columns ])
If you want to set things like the sql table name and execute to_sql dynamically, I can imagine a fairly straightforward wrapper function that does both tasks using this approach.

Related

How to create a table from another table with GridDB?

I have a GridDB container where I have stored my database. I want to copy the table but this would exclude a few columns. The function I need should extract all columns matching a given keyword and then create a new table from that. It must always include the first column *id because it is needed on every table.
For example, in the table given below:
'''
-- | employee_id | department_id | employee_first_name | employee_last_name | employee_gender |
-- |-------------|---------------|---------------------|---------------------|-----------------|
-- | 1 | 1 | John | Matthew | M |
-- | 2 | 1 | Alexandra | Philips | F |
-- | 3 | 2 | Hen | Lotte | M |
'''
Suppose I need to get the first column and every other column starting with "employee". How can I do this through a Python function?
I am using GridDB Python client on my Ubuntu machine and I have already stored the database.csv file in the container. Thanks in advance for your help!

split JSON/list of dictionaries in the column in dataframe to new rows in python

I am quite new to Python, I tried to find an answer but nothing I tried seems to be working. And the most of the answers are provided when the whole data in JSON format
Through PYODBC I use the following code to retrieve data
formula = """select id, type, custbody_attachment_1 from transaction """
lineitem = pd.read_sql_query(formula, cnxn)
It gives me something like the following
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Internal_ID | Type | Formula_Text |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2895531 | Bill | |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3492009 | Bill | [{"FL":"https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile0","NM":"someFileName0"}] |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3529162 | Bill | [{"FL":"5https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile1","NM":"someFileName1"},{"FL":"https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile2","NM":"someFileName2"}] |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I need the output like this. (There might be more than 2 links in the cell.)
+-------------+------+---------------------------------------------------------------------+---------------+
| Internal_ID | Type | FL | NM |
+-------------+------+---------------------------------------------------------------------+---------------+
| 2895531 | Bill | | |
+-------------+------+---------------------------------------------------------------------+---------------+
| 3492009 | Bill | https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile0 | someFileName0 |
+-------------+------+---------------------------------------------------------------------+---------------+
| 3529162 | Bill | https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile1 | someFileName1 |
+-------------+------+---------------------------------------------------------------------+---------------+
| 3529162 | Bill | https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile2 | someFileName2 |
+-------------+------+---------------------------------------------------------------------+---------------+
I tried to play with JSON but there were one problem after another(because it seemed like JSON data to me). In the end I run
print(lineitem['custbody_attachment_1'])
and got the following in Python console
999 [{"FL":"https://4811553.app.netsuite.com/core/...
Name: custbody_attachment_1, Length: 1000, dtype: object
So, I have no idea how to transform this so I could create new rows
df = df.explode('Formula_Text')
df = pd.concat([df.drop(['Formula_Text'], axis=1), df['Formula_Text'].apply(pd.Series)], axis=1)
print(df)

Efficient way to write Pandas groupby codes by eliminating repetition

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

How to efficiently generate a special co-author network in python pandas?

I'm trying to generate a network graph of individual authors given a table of articles. The table I start with is of articles with a single column for the "lead author" and a single column for "co-author". Since each article can have up to 5 authors, article rows may repeat as such:
| paper_ID | project_name | lead_id | co_lead_id | published |
|----------+--------------+---------+------------+-----------|
| 1234 | "fubar" | 999 | 555 | yes |
| 1234 | "fubar" | 999 | 234 | yes |
| 1234 | "fubar" | 999 | 115 | yes |
| 2513 | "fubar2" | 765 | 369 | no |
| 2513 | "fubar2" | 765 | 372 | no |
| 5198 | "fubar3" | 369 | 325 | yes |
My end goal is to have a nodes table, where each row is a unique author, and an edge table, where each row contains source and target author_id columns. The edges table is trivial, as I can merely create a dataframe using the requisite columns of the article table.
For example, for the above table I would have the following node table:
| author_id | is_published |
|-----------+--------------|
| 999 | yes |
| 555 | yes |
| 234 | yes |
| 115 | yes |
| 765 | no |
| 369 | yes |
| 372 | no |
| 325 | yes |
Notice how the "is_published" shows if the author was ever a lead or co-author on at least one published paper. This is where I'm running into trouble creating a nodes table efficiently. Currently I iterate through every row in the article table and run checks on if an author exists yet in the nodes table and whether to turn on the "is_published" flag. See the following code snippet as an example:
articles = pd.read_excel('excel_file_with_articles_table')
nodes = pd.DataFrame(columns=('is_published'))
nodes.index.name = 'author_id'
for row in articles.itertuples():
if not row.lead_id in nodes.index:
author = pd.Series([False], index=["is_published"], name=row.lead_id)
pi_nodes = pi_nodes.append(author)
if not row.co_lead_id in nodes.index:]
investigator = pd.Series([False], index=["is_published"], name=row.co_lead_id)
pi_nodes = pi_nodes.append(investigator)
if row.published == "yes":
pi_nodes.at[row.lead_id,"is_published"]=True
pi_nodes.at[row.co_lead_id,"is_published"]=True
For my data set (with tens of thousands of rows), this is somewhat slow, and I understand that loops should be avoided when possible when using pandas dataframes. I feel like the pandas apply function may be able to do what I need, but I'm at a loss as to how to implement it.
With df as your first DataFrame, you should be able to:
nodes = pd.concat([df.loc[:, ['lead_id', 'is_published']].rename(columns={'lead_id': 'author_id'}, df.loc[:, ['co_lead_id', 'is_published']].rename(columns={'co_lead_id': 'author_id'}]).drop_duplicates()
for a unique list of author_id and co_author_id with their respective is_published information.
To only keep is_published=True if there is also a False entry:
nodes = nodes.sort_values('is_published', ascending=False).drop_duplicates(subset=['author_id'])
.sort_values() will sort True (==1) before False, and .drop_duplicates() by default keeps the first occurrence (see docs). With this addition I guess you don't really need the first .drop_duplicates() anymore.

Categories