I have been working with PySimpleGUI, where I'm trying to change from an initial empty table as presented in the figure 1, there can be observed how the initial table at the bottom is empty and has three columns.
Figure 1:
In here I use a script where you imput your databases, and click on -GENERATE- to apply some statistcs and present it as an image on the rigth side and a table with statistical data on the bottom.
Here you can see the script:
Script (deleted irrelevant things):
# TABLE DATA
data= {
("Count", "-", "-"),
("Average", " -", "-"),
("Median", "-", "-"),
("Min", " -", "-"),
("Max", "-", "-"),
("StdDev", " -", "-"),
("Q1", "-", "-"),
("Q2", " -", "-"),
}
headings = ["STAT", "DATABASE A", "DATABASE B"] #list
# Table generation:
list_variables = ["Count", "Average", "Median", "Min", "Max", "StdDev", "Q1", "Q3"]
dicts = {}
def tablegen (imp_dict): #enter dictionary from -FOLDERS-
for k in imp_dict.items():
del k[1]["survey"]
v = [k[1].shape[0],np.average(k[1].iloc[:,0]) ,np.median(k[1].iloc[:,0]),min(k[1].iloc[:,0]),max(k[1].iloc[:,0]),np.std(k[1].iloc[:,0]),np.quantile(k[1].iloc[:,0],0.25),np.quantile(k[1].iloc[:,0],0.75)]
final[k[0]]=v
# LAYOUT
layout = [
[sg.Button('GENERATE'),sg.Button('REMOVE')],
[sg.Text('Generated table:')],
[sg.Table(values=data, headings=headings, max_col_width=25,
auto_size_columns=True,
display_row_numbers=False,
justification='center',
num_rows=5,
alternating_row_color='lightblue',
key='-TABLE-',
selected_row_colors='red on yellow',
enable_events=True,
expand_x=False,
expand_y=True,
vertical_scroll_only=True,
tooltip='This is a table')]
]
window = sg.Window('Tool', layout)
# ------ Loops ------
while True:
if event == 'GENERATE': #problems
selection(file_list) #some functions blah blah, it generates a csv file called "minimum"
#This is the archive (minimum.csv) that is produced after clicking -GENERATE- to make the desired table (it applies some functions).
file_loc2 = (real_path + "/minimum.csv")
try:
archive = pd.read_csv(file_loc2, sep=",")
df_names = pd.unique(archive["survey"]) #column names
for name in df_names: #enter initial archive
dicts[name] = pd.DataFrame(data= (archive.loc[archive["survey"] == name ]),columns=("Wavelength_(nm)", "survey")) #iteration blah blah
tablegen(dicts) #this generates the statistical values for the table.
final_df = pd.DataFrame(data= final, index=list_variables, columns=df_names)
final_df = final_df.round(decimals=1)
final_lists = final_df.values.tolist()
# I tried using a DataFrame and produced the table in figure 2 (final_df), and a list of list (as recomended at PySimpleGUI webpage) final_lists and produced the table in figure 3.
window["-TABLE-"].update(final_df) #or .update(final_lists)
except Exception as E:
print(f'** Error {E} **')
pass # if something weird happened making the full filename
window.close()
The issue is this:
In the second and third figures present how this script uses the information from the folders (databases) selected in the left square, and generates the image in the left and supposedly would present DataFrame shown below.
GOAL TABLE TO PRESENT:
final_df:
13 MAR 2018 14 FEB 2018 16 FEB 2018 17 FEB 2018
Count 84.0 25.0 31.0 31.0
Average 2201.5 2202.1 2203.1 2202.9
Median 2201.0 2202.0 2204.0 2204.0
Min 2194.0 2197.0 2198.0 2198.0
Max 2208.0 2207.0 2209.0 2209.0
StdDev 4.0 3.0 3.5 3.5
Q1 2198.0 2199.0 2199.5 2199.5
Q3 2205.0 2205.0 2206.0 2206.0
Figure 2: This is using a DataFrame as imput in the loop -GENERATE-.
Figure 3: This is using a list of lists as imput in the loop -GENERATE-.
As it is observed, the "-TABLE-" is not working the way that I intend. If I use a DataFrame it is just selecting the columns names, and if I use a list of list it ignores the index and column names from the intended goal table.
Also, the table is in both cases not generating more columns even there should be 5 including the index. And how can I change the column names from the initially provided ones?
In the PySimpleGUI demos and call references I can not find something to solve this, also I searched in the web and in older StackOverflow posts, but to be honest I do not find a similar case to this.
I'll be really greatful if somebody can help me to find what I am doing wrong!
Btw sorry for my english, Colombian here.
The number of columns increased per records ? Most of time, the number of rows of a Table element increased per records. Maybe you should take ['Date'] + list_variables as the headings of the Table element, and add each holder as one new row of the Table element.
import pandas as pd
import PySimpleGUI as sg
file = r'd:/Qtable.csv'
"""
Date,Count,Average,Median,Min,Max,StdDev,Q1,Q3
13-Mar-18,84,2201.5,2201,2194,2208,4,2198,2205
14-Mar-18,25,2202.1,2202,2197,2207,3,2199,2205
16-Mar-18,31,2203.1,2204,2198,2209,3.5,2199.5,2206
17-Mar-18,31,2202.9,2204,2198,2209,3.5,2199.5,2206
"""
df = pd.read_csv(file, header=None)
values = df.values.tolist()
headings = values[0]
data = values[1:]
layout = [[sg.Table(data, headings=headings, key='-TABLE-')]]
sg.Window('Table', layout).read(close=True)
Related
I'm trying to create a webscraping tool that will update a Dataframe with data from multiple tables.
The page I'm working on has a base table in which every row has a link that directs you to a new URL that has a secondary table with the data I'm looking for.
My objective is to create a unique Dataframe comprehensive of all the data present on all secondary tables of the site.
Problem is, every secondary table can have different sets of columns from the previous one, depending on whether that secondary table has a value for that specific column or not, and I cannot know all the possibile column types
I tried multiple solutions. What I'm working on at the moment is to create a for loop that constantly create a new Dataframe out of the new tables and merge them to the previous one.
But I'm stuck on trying to merge the two Dataframes on all the columns they have in common.
Please forgive me if I made amateur mistakes, I've been using python only for a week.
#create the main DataFrame
link1= links[0]
url_linked = url_l + link1
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text,'lxml')
table_linked= soup_linked.find('table', class_="XXXXX")
headers_link=[]
headers_unique=[]
for i in table_linked.find_all('th'):
title_link=i.text
title_link=map(str,title_link)
headers_link.append(title_link)
headers_unique=headers_link
mydata_link = pd.DataFrame(columns=headers_link)
count = 1
for link in links:
url_linked = url_l + link
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text,'lxml')
table_linked= soup_linked.find('table', class_="table table-directory-responsive")
row2=[]
n_columns =len(table_linked.find_all('th'))
#populating the main dataframe
if count == 1:
for j in table_linked.find_all('tr'):
row_data=j.find_all('td')
row=[i.text for i in row_data]
row2.append(row)
lenght_link= len(mydata_link)
row2.remove(['']) #To get rid of empty rows that have no th
mydata_link.loc[lenght_link]=row2
print(mydata_link)
print('Completato link '+ str(count))
count= count+1
#creating the secondary Dataframe
else:
headers_test=[]
for i in table_linked.find_all('th'):
title_test=i.text
title_test=map(str,title_test)
headers_test.append(title_test)
mydata_temp=pd.DataFrame(columns=headers_test)
for j in table_linked.find_all('tr'):
row_data=j.find_all('td')
row=[i.text for i in row_data]
row2.append(row)
lenght_link= len(mydata_link)
row2.remove(['']) #To get rid of empty rows that have no th
mydata_temp.loc[lenght_link]=row2
print(mydata_temp)
#merge the two DataFrames based on the unique set of columns they both have
headers_unique= set(headers_unique).intersection(headers_test)
mydata_link=mydata_link.merge(mydata_temp, on=[headers_unique], how='outer')
print(mydata_link)
print('Completed link '+ str(count))
count= count+1
What I need is basically a function that, given these sample dataFrames:
A
B
C
1
2
3
C
A
D
E
4
5
6
7
Will return the following dataframe:
A
B
C
D
E
1
2
3
Nan
Nan
5
Nan
4
6
7
Just concatenating all the secondary tables should do - build a list of all the secondary DataFrames, and then pd.concat(dfList).
Btw, have you considered just using .read_html instead of looping through the cells?
#create the main DataFrame
link1 = links[0]
url_linked = url_l + link1
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text, 'lxml')
table_linked = soup_linked.find('table', class_="XXXXX")
if table_linked:
primaryDf = pd.read_html(table_linked.prettify())[0]
headers_link = [h.get_text(' ').strip() for h in table_linked.find_all('th')]
dfList = [pd.DataFrame(columns=headers_link if headers_link else primaryDf.columns)]
else: primaryDf, dfList = None, []
count = 0
for link in links:
count += 1
url_linked = url_l + link
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text, 'lxml')
table_linked = soup_linked.find('table', class_="table table-directory-responsive")
if not table_linked:
## to see if any response errors or redirects
print(f'[{page_linked.status_code} {page_linked.reason} from {page_linked.url}]')
## print error message and move to next link
print(f'Found no tables with required class at link#{count}', url_linked)
continue
tempDf = pd.read_html(table_linked.prettify())[0] ## read table as df [if found]
## get rid of empty rows and empty columns
tempDf = tempDf.dropna(axis='rows', 'how'='all').dropna(axis='columns', 'how'='all')
dfList.append(tempDf.loc[:]) ## .loc[:] to append a copy, not original (just in case)
print(f'Completed link#{count} with {len(tempDf)} rows from {url_linked}')
combinedDF = pd.concat(dfList)
I want to print data from an Excel document that I imported. Each row comprises a Description and an Ugence level. What I want is to print in red each row that has the "Urgent" statement in it, I made a function which works for that (red_text).
I can print the entire rows in red but I can't find how to only print those with the mention "Urgent" in their Urgence row. Here is my code :
Import of the file
import pandas as pd
from dash import Dash, html, dcc, dash_table
# Importing the file
file = r"test.xlsx"
try:
df = pd.read_excel(file)
except OSError:
print("Impossible to read :", file)
# Function to add red color to a text
def red_text(textarea):
return html.P(
[
html.Span(textarea, style={"color": "red"}),
]
)
Iterating through each row to put it into a test[] list and then applying a style to each row
# Loop for each row if there is an urgent statement, put the row in red
test = []
for index, row in df.iterrows():
x = row['Description'] + ' | ' + row['Urgence']
test.append(x)
# **HERE** -> Statement to apply a color with the function red_text
if row['Urgence'] == "Urgent":
red_text(test)
This last statement prints the full table in red, but I only want to apply the red_text function to the rows with the "Urgent" mention in them (from the "Urgence" row).
Edit : the Excel file is a basic two columns file :
Thank you
Given that I can't verify the output because of the lack of a reproducible example, I think you want to do something like:
df = pd.DataFrame({'Description':['urgent stuff','asdasd','other urgent'],'Urgence':['Urgent','sadasd','Urgent']})
print(df)
urgent_stuff = df.loc[df['Urgence'] == "Urgent"]
print('------------')
print(urgent_stuff)
print('++++++++++++')
for row in urgent_stuff.iterrows():
red_html = red_text(row) #I am not sure what textarea is supposed to be, nor what html.P is
print(red_html)
the output is:
Description Urgence
0 urgent stuff Urgent
1 asdasd sadasd
2 other urgent Urgent
------------
Description Urgence
0 urgent stuff Urgent
2 other urgent Urgent
++++++++++++
NameError: name 'html' is not defined
I'm having trouble splitting a data frame by _ and creating new columns from it.
The original strand
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
my current code
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
Output:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
I would really like to split up 'section' in a way that puts it in new columns based on '_'
I've tried so many different variations of regex to split 'section' and all of them either gave me headings with no fill or they added columns after section and text, which isn't useful. I should also add theres going to be around 100,000 observations.
Desired result:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
Any guidance would be appreciated.
If you always know the number of splits, you can do something like:
import pandas as pd
df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
That way, the dataframe will turn into
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
Since you want the columns to be on a certain order, you can reorder the dataframe's columns as below:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d
So there are 2 csv files im working with:
file 1:
City KWR1 KWR2 KWR3
Killeen
Killeen
Houston
Whatever
file2:
location link reviews
Killeen www.example.com 300
Killeen www.differentexample.com 200
Killeen www.example3.com 100
Killeen www.extraexample.com 20
Here's what im trying to make this code do:
look at the 'City' in file one, take the top 3 links in file 2 (you can go ahead and assume the cities wont get mixed up) and then put these top 3 into the KWR1 KWR2 KWR3 columns for all the same 'City' values.
so it gets the top 3 and then just copies them to the right of all the Same 'City' values.
even asking this question correctly is difficult for me, hope i've provided enough information.
i know how to read the file in with pandas and all that, just cant code this exact situation in...
It is a little unusual requirement but I think you need to three steps:
1. Keep only the first three values you actually need.
df = df.sort_values(by='reviews',ascending=False).groupby('location').head(3).reset_index()
Hopefully this keeps only the first three from every city.
Then you somehow need to label your data, there might be better ways to do this but here is one way:- You assign a new column with numbers and create a user defined function
import numpy as np
df['nums'] = np.arange(len(df))
Now you have a column full of numbers (kind of like line numbers)
You create your function then that will label your data...
def my_func(index):
if index % 3 ==0 :
x = 'KWR' + str(1)
elif index % 3 == 1:
x = 'KWR' + str(2)
elif index % 3 == 2:
x = 'KWR' + str(3)
return x
You can then create the labels you need:
df['labels'] = df.nums.apply(my_func)
Then you can do:
my_df = pd.pivot_table(df, values='reviews', index=['location'], columns='labels', aggfunc='max').reset_index()
Which literally pulls out the labels (pivots) and puts the values in to the right places.
I'm trying to match data to data in a dataframe. The way I'm currently attempting to is not working. After some research, I believe I'm only choosing either or, not and. I have transactions I want to match the opening and closing, and disregard the rest. The results still show unclosed transactions.
Code:
# import important stuffs
import pandas as pd
# open file and sort through options only and pair opens to closes
with open('TastyTrades.csv'):
trade_reader = pd.read_csv('TastyTrades.csv') # create reader
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')] # sort for options only
BTO = options_frame[options_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_CLOSE'])] # look for BTO/STC
STO = options_frame[options_frame['Action'].isin(['SELL_TO_OPEN', 'BUY_TO_CLOSE'])] # look for STO/BTC
paired_frame = [BTO, STO] # combine
results = pd.concat(paired_frame) # concat
results_sorted = results.sort_values(by=['Symbol', 'Call or Put', 'Date'], ascending=True) # sort by symbol
results_sorted.to_csv('new_taste.csv') # write new list
Results:
310,2019-12-19T15:47:24-0500,Trade,SELL_TO_OPEN,APA 200117P00020000,Equity Option,Sold 1 APA 01/17/20 Put 20.00 # 0.33,33,1,33.0,-1.0,-0.15,100.0,APA,1/17/2020,20.0,PUT
296,2019-12-31T09:30:07-0500,Trade,BUY_TO_CLOSE,APA 200117P00020000,Equity Option,Bought 1 APA 01/17/20 Put 20.00 # 0.08,-8,1,-8.0,0.0,-0.14,100.0,APA,1/17/2020,20.0,PUT
8,2020-02-14T12:19:30-0500,Trade,BUY_TO_OPEN,AXAS 200918C00002500,Equity Option,Bought 2 AXAS 09/18/20 Call 2.50 # 0.05,-10,2,-5.0,-2.0,-0.28,100.0,AXAS,9/18/2020,2.5,CALL
172,2020-01-28T10:05:14-0500,Trade,SELL_TO_OPEN,BAC 200320C00033000
As you can see here, I have one full transaction: APA, one half of a transaction: AXAS, and the first half of a full transaction: BAC. I don't want to see AXAS in there. AXAS and the others keep popping up no matter how many times I try to get rid of them.
Right now you're just selecting for all opens and all closes, and then stacking them; there's no actual pairing going on. If I'm understanding you correctly, you only want to include transactions that have both an Open and a Close in the dataset? If that's the case, I'd suggest finding the set intersection of the transaction IDs, and using that to select the paired transactions. It'd look something like the code below, assuming that the fifth column in your data (e.g. "APA 200117P00020000") is the TransactionID.
import pandas as pd
trade_reader = pd.read_csv('TastyTrades.csv')
options_frame = trade_reader.loc[
(trade_reader['Instrument Type'] == 'Equity Option')
] # sort for options only
opens = options_frame[
options_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])
] # look for opens
closes = options_frame[
options_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])
] # look for closes
# Then create the set intersection of the open and close transaction IDs
paired_ids = set(opens['TransactionID']) & set(closes['TransactionID'])
paired_transactions = options_frame[
options_frame['TransactionID'].isin(paired_ids)
] # And use those to select the paired items
results = paired_transactions.sort_values(
by=['Symbol', 'Call or Put', 'Date'],
ascending=True
) # sort by symbol
results.to_csv('NewTastyTransactions.csv')