pandas merge dataframes with same columns and reference its id - python

I have a problem with python pandas. I have serveral different dataframes which I want to split up into an SQLite Database. My first Dataframe to_country:
to_country = df[['country']]
to_country = to_country.rename(columns={'country': 'Name'})
to_country = to_country.drop_duplicates()
#Add Index to Country Dataframe
to_country.insert(0, 'Id', range(1, 1 + len(to_country)))
to_country=to_country.reindex(columns=['Id','Name'])
to_country = to_country.set_index('Id')
to_country.to_sql('Country', con=con, if_exists='append', index=True)```
This part works fine.
Now i have another Dataframe to_state which looks like that:
to_state = df[['state','country']]
to_state = to_state.rename(columns={'state': 'Name'})
to_state = to_state.drop_duplicates()
to_state.insert(0, 'Id', range(1, 1 + len(to_state)))
to_state=to_state.reindex(columns=['Id','Name','country'])
to_state = to_state.set_index('Id')
Now I want to replace the Country USA with the Id from the previous Dataframe, i want it to look like that:
Note the CountryId should be the attribute Id from the dataframe to_country
Id___Name___CountryId
1_____CA_________1
I tried following Statement but which only resulted in:
to_state = pd.merge(to_state, to_country, left_on='country', right_on="Name")
I really do not know how should i solve this. what is even more irritating, I don't know why the Colums Id from both Dataframes disappear.

As I don't have your example dataframe, test this
import pandas as pd
to_country = pd.DataFrame({"id": [20,30],
"country": ['USA','CHINA']})
to_state = pd.DataFrame({"id": [90,80],
"state": ['CA','AB'],
"country": ['USA','CHINA']})
print(f'__________ORIGINALS DATAFRAMES__________ \n##STATE##\n{to_state}\n\n###COUNTRY###\n{to_country}')
def func(line):
t = 0
for x in range(0, len(to_country['country'])):
t = to_country.loc[to_state['country'] == line['country']]
t = t.values.tolist()
return t[0][0]
print(f'\n_________FINAL DATAFRAME__________\n')
to_state['ID_NEW_country'] = to_state.apply(func, axis = 1)
print(f' \n{to_state}')

I solved it like that in the end:
#Add Countries to Database
to_country = df[['country']]
to_country = to_country.rename(columns={'country': 'Name'})
to_country = to_country.drop_duplicates()
#Add Index to Country Dataframe
to_country = to_country.reset_index()
to_country = to_country.rename(columns={"index":"ID"})
to_country['ID'] = to_country.index + 1
to_country.set_index('ID').to_sql('Country', con=con, if_exists='append', index=True)
#Add States to Database
to_state = df[['state','country']]
to_state = to_state.drop_duplicates()
#Add Index to Country Dataframe
to_state = to_state.reset_index()
to_state = to_state.rename(columns={"index":"ID", 'state': 'Name'})
to_state['ID'] = to_state.index + 1
to_state = to_state.merge(to_country, how='left', left_on='country', right_on='Name').drop(['country', 'Name_y'], axis= 1)
print(to_state)
to_state = to_state.rename(columns={'ID_x': 'ID', 'Name_x': 'Name', 'ID_y': 'Country_ID'})
print(to_state)
to_state.set_index('ID').to_sql('State', con=con, if_exists='append', index=True)

Related

Extract data from array - Python

I am using the unleashed_py library to extract Unleashed data.
The sample of the output is as below where there could be several items in the invoice:
[{
'OrderNumber': 'SO-00000742',
'QuoteNumber': None,
'InvoiceDate': '/Date(1658496322067)/',
'InvoiceLines': [{'LineNumber': 1,
'LineType': None},
{'LineNumber': 2,
'LineType': None}],
'Guid': '8f6b89da-1e6e-42288a24-902a-038041e04f06',
'LastModifiedOn': '/Date(1658496322221)/'}]
I need to get a df:
If I run the below script, the invoice lines just get appended with the common fields such as ordernumber, quotenumber, invoicedate, guide, and lastmodifiedon not getting repeated.
order_number = []
quote_number = []
invoice_date = []
invoicelines = []
invoice_line_number = []
invoice_line_type = []
guid = []
last_modified = []
for item in df:
order_number.append(item.get('OrderNumber'))
quote_number.append(item.get('QuoteNumber'))
invoice_date.append(item.get('InvoiceDate'))
guid.append(item.get('Guid'))
last_modified.append(item.get('LastModifiedOn'))
lines = item.get('InvoiceLines')
for item_sub_2 in lines:
invoice_line_number.append('LineNumber')
invoice_line_type.append('LineType')
df_order_number = pd.DataFrame(order_number)
df_quote_number = pd.DataFrame(quote_number)
df_invoice_date = pd.DataFrame(invoice_date)
df_invoice_line_number = pd.DataFrame(invoice_line_number)
df_invoice_line_type = pd.DataFrame(invoice_line_type)
df_guid = pd.DataFrame(guid)
df_last_modified = pd.DataFrame(last_modified)
df_row = pd.concat([
df_order_number,
df_quote_number,
df_invoice_date,
df_invoice_line_number,
df_invoice_line_type,
df_guid,
df_last_modified
], axis = 1)
What am I doing wrong?
You don't need to iterate, just create the dataframe out of the list of dictionaries you have, then explode InvoiceLines columns then apply pd.Series and join it with the original dataframe:
data = [{
'OrderNumber': 'SO-00000742',
'QuoteNumber': None,
'InvoiceDate': '/Date(1658496322067)/',
'InvoiceLines': [{'LineNumber': 1,
'LineType': None},
{'LineNumber': 2,
'LineType': None}],
'Guid': '8f6b89da-1e6e-42288a24-902a-038041e04f06',
'LastModifiedOn': '/Date(1658496322221)/'}]
df=pd.DataFrame(data).explode('InvoiceLines')
out=pd.concat([df['InvoiceLines'].apply(pd.Series),
df.drop(columns=['InvoiceLines'])],
axis=1)
OUTPUT:
#out
LineNumber LineType OrderNumber QuoteNumber InvoiceDate \
0 1.0 NaN SO-00000742 None /Date(1658496322067)/
0 2.0 NaN SO-00000742 None /Date(1658496322067)/
Guid LastModifiedOn
0 8f6b89da-1e6e-42288a24-902a-038041e04f06 /Date(1658496322221)/
0 8f6b89da-1e6e-42288a24-902a-038041e04f06 /Date(1658496322221)/
I'm leaving the date conversion and column renames for you cause I believe you can do that yourself.

Pandas how to search one df for a certain date and return that data

I have two data frames and I am trying to search each row by date in the user.csv file and find the corresponding date in the Raven.csv file and then return the Price from the df1 and the date and amount from df2.
This is working but my Price is returning a value like this [[0.11465]], is there a way to remove these brackets or a better way to do this?
import pandas as pd
df1 = pd.read_csv('Raven.csv',)
df2 = pd.read_csv('User.csv')
df1 = df1.reset_index(drop=False)
df1.columns = ['index', 'Date', 'Price']
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'], format="%Y-%m-%d %H:%M:%S").dt.date
df1['Date'] = pd.to_datetime(df1['Date'], format="%Y-%m-%d").dt.date
Looper = 0
Date = []
Price = []
amount = []
total_value = []
for x in df2['Timestamp']:
search = df2['Timestamp'].values[Looper]
Date.append(search)
price =(df1.loc[df1['Date'] == search,['index']] )
value = df1['Price'].values[price]
Price.append(value)
payout = df2['Amount'].values[Looper]
amount.append(payout)
payout_value = value * payout
total_value.append(payout_value)
Looper = Looper + 1
dict = {'Date': Date, 'Price': Price, 'Payout': amount, "Total Value": total_value}
df = pd.DataFrame(dict)
df.to_csv('out.csv')
You can do indexing to get the value:
value = [[0.11465]][0][0]
print(value)
You get:
0.11465
I hope this is what you need.

Add new column to DataFrame with same default value

I would like to add a name column based on the 'lNames' list. But my code is overwriting the whole column in the last iteration as follows:
import pandas as pd
def consulta_bc(codigo_bcb):
url = 'http://api.bcb.gov.br/dados/serie/bcdata.sgs.{}/dados?formato=json'.format(codigo_bcb)
df = pd.read_json(url)
df['data'] = pd.to_datetime(df['data'], dayfirst=True)
df.set_index('data', inplace=True)
return df
lCodigos = [12, 11, 1, 21619, 21623, 12466]
lNames = ['CDI', 'SELIC', 'USD', 'EUR', 'GPB', 'IMAB']
iter_len = len(lCodigos)
saida = pd.DataFrame()
for i in range(iter_len):
saida = saida.append(consulta_bc(lCodigos[i]))
saida['nome']= lNames[i]
saida.to_csv('Indice', sep=';', index=True)
saida
Any help will be fully appreciated
Change the for loop in this way:
for i in range(iter_len):
df = consulta_bc(lCodigos[i])
df['nome'] = lNames[i]
saida = saida.append(df)

Resampling and regrouping using pivot table

edited --- code added
I'm trying to group all the values of the dataframe essaie['night_cons'] by day (and by year) but the result just gives me NAN.
colss = {'Date_Time': ['2017-11-10','2017-11-11','2017-11-12','2017-11-13', '2017-11-14', '2017-11-15', '2017-11-16', '2017-11-17', '2017-11-18', '2017-11-19'],
'Night_Cons(+)': [4470.76,25465.72,25465.72,25465.72, 21480.59, 20024.53, 19613.29, 28015.18, 28394.20, 29615.69]
}
dataframe = pd.DataFrame(colss, columns = ['Date_Time', 'Night_Cons(+)'])
#print (dataframe)
dataframe['Date_Time'] = pd.to_datetime(dataframe['Date_Time'], errors = 'coerce')
# Create new columns
dataframe['Day'] = dataframe['Date_Time'].dt.day
dataframe['Month'] = dataframe['Date_Time'].dt.month
dataframe['Year'] = dataframe['Date_Time'].dt.year
# Set index
#essaie = essaie.set_index('Date_Time')
dataframe = dataframe[['Night_Cons(+)', 'Day', 'Month', 'Year']]
#dataframe
#daily_data = pd.pivot_table(essaie, values = "Night_Cons(+)", columns = ["Month"], index = "Day")
daily_data = pd.pivot_table(dataframe, values = "Night_Cons(+)", columns = ["Year"], index = "Day")
daily_data = daily_data.reindex(index = ['Montag','Dienstag','Mittwoch', 'Donnerstag', 'Freitag', 'Samstag', 'Sonntag'])
daily_data
DataFrame and Results
please see the image below.
Sample:
colss = {'Date_Time': ['2017-11-10','2017-11-11','2017-11-12','2017-11-13', '2017-11-14', '2017-11-15', '2017-11-16', '2017-11-17', '2017-11-18', '2017-11-19'],
'Night_Cons(+)': [4470.76,25465.72,25465.72,25465.72, 21480.59, 20024.53, 19613.29, 28015.18, 28394.20, 29615.69]
}
dataframe = pd.DataFrame(colss, columns = ['Date_Time', 'Night_Cons(+)'])
First convert Date column to Series.dt.dayofweek, then pivoting and last rename index values:
dataframe['Date_Time'] = pd.to_datetime(dataframe['Date_Time'], errors = 'coerce')
dataframe['Year'] = dataframe['Date_Time'].dt.year
dataframe['Date'] = dataframe['Date_Time'].dt.dayofweek
daily_data = dataframe.pivot_table(values = "Night_Cons(+)",
columns = "Year",
index = "Date")
days = ['Montag','Dienstag','Mittwoch', 'Donnerstag', 'Freitag', 'Samstag', 'Sonntag']
daily_data = daily_data.rename(dict(enumerate(days)))
print (daily_data)
Year 2017
Date
Montag 25465.720
Dienstag 21480.590
Mittwoch 20024.530
Donnerstag 19613.290
Freitag 16242.970
Samstag 26929.960
Sonntag 27540.705

How to Fix ''Level None not found' Error in pandas?

Facing following error when running code
Level None not found
pt = df.pivot_table(index = 'User Name',values = ['Threat Score', 'Score'],
aggfunc = {
'Threat Score': np.mean,
'Score' :[np.mean, lambda x: len(x.dropna())]
},
margins = True)
pt = pt.sort_values('Score', ascending = False)
I want to take the average value of Threat Score & Score, also count of the user name. Then sort by Threat Score high to low.
Its a bug in pandas this is a github link for the same. This error comes with with multiple aggregations per column and margins=True, it won't come if you choose flag margins = False. you can add them later if you want. That sure will work:
pt = df.pivot_table(index = 'User Name',values = ['Threat Score', 'Score'],
aggfunc = {
'Threat Score': np.mean,
'Score' :[np.mean, lambda x: len(x.dropna())]
},
margins = False)
pt = pt.sort_values('Score', ascending = False)
let me know if this works for you
pt = df.pivot_table(index = 'User Agent', values = ['Threat Score', 'Score','Source IP'] ,
aggfunc = {"Source IP" : 'count',
'Threat Score':np.mean,
'Score': np.mean})
pt = pt.sort_values('Threat Score', ascending = False)
new_cols = ['Avg_Score', 'Count', 'Avg_ThreatScore']
pt.columns = new_cols
pt.to_csv(Path3 + '\\AllUserAgent.csv')

Categories