Using a logic- I am reading multiple PDF files which are having certain highlighted portions(presume that these are tables).
After pushing them to a list, I am saving them to a dataframe.
Here's the logic for the same
try:
filepath = [file for file in glob.glob("Folder/*.pdf")]
for file in filepath:
doc = fitz.open(file)
print(file)
highlights = []
for page in doc:
highlights += handle_page(page)
#print(highlights)
highlights_alt = highlights[0].split(',')
df = pd.DataFrame(highlights_alt, columns=['Security Name'])
#print(df.columns.tolist())
df[['Security Name', 'Weights']] = df['Security Name'].str.rsplit(n=1, expand=True)
df.drop_duplicates(keep='first', inplace=True)
print(df.head())
print(df.shape)
except IndexError:
print('file {} is not highlighted'.format(file))
Using this logic I get the dataframes however if the folder has 5 PDFs then this logic creates 5 different dataframes. Something like this.
Folder\z.pdf
Security Name Weights
0 UTILITIES (5.96
1 %*) None
(2, 2)
Folder\y.pdf
Security Name Weights
0 Quantity/ Market Value % of Net Investments Cu... 1.125
1 % 01
2 /07 None
3 /2027 None
4 EUR 230
(192, 2)
Folder\x.pdf
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
(526, 2)
However I want a single dataframe with the above records in them making their shape as (720,2) something like
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
.
.
720 xyz 3.33
(720, 2)
I tried using pandas's concat & append but have been unsuccessful so far. Please let me know an efficient way of doing it since, the PDFs in future would be more than 1000s.
Please help!
A quick way is to use pd.concat:
big_df = pd.concat(list_of_dfs, axis=0)
If this creates an error it would be helpful to know what the error is.
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I have a set of job vacancy data in a pandas dataframe I have 'tagged' by the strings it contains. E.g., if C# is mentioned in the title or description, C# is added to that row under the languages column.
I now want to summarise the count of each skill in the entire dataframe. Original dataframe:
languages frameworks platforms databases other
0 [SQL, C] [] [AWS] [] []
1 [SQL] [] [] [SQL Server] []
2 [SQL, C#] [ASP.NET, ASP] [] [SQL Server] []
3 [JavaScript, HTML, CSS,] [] [] [] []
4 [JavaScript, Python, Java] [React] [] [] []
...
Desired:
skill_category skill count
0 languages SQL 3
1 languages C 1
2 languages C# 1
3 languages JavaScript 2
...
9 frameworks ASP.NET 1
10 frameworks ASP 1
...
12 platforms AWS 1
...
14 databases SQL Server 2
...
15 other Hadoop 1
etc.
I have tried:
Outputting the relevant parts of the df into a python list of dictionaries and using for loops and counters to count the skills in each category. I can then create a new dataframe from this, but this is a lot of code for something I feel should be possible with pandas.
Looked in to pandas' .pivot() method, though I can't work out a way to have the column names (languages, frameworks etc.) become rows for each skill.
Used pandas' .explode() and .value_counts() methods to count the skills in each column, e.g.:
In[12]: df['languages'].explode().value_counts()
Out[12]:
JavaScript 39
SQL 32
C# 28
HTML 24
Java 24
...
But this only works column-by-column. I need a dataframe with a category row for creating faceted visualisations in plotly.
Please help?
You are close to the answer! But stack all columns into one tall series before counting:
counts = df.stack().explode()\
.reset_index().groupby(["level_1", 0]).count()
counts = counts.reset_index()
counts.columns = "skill_category", "skill", "count"
Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as
pd.HDFStore('Filename',mode='a')
store.append(data)
However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?
Here is a demonstration for the #flyingmeatball's answer:
Let's generate a test DF:
In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))
In [57]: df
Out[57]:
a b c
0 0.022079 0.901965 0.282529
1 0.596452 0.096204 0.197186
2 0.034127 0.992500 0.523114
3 0.659184 0.447355 0.246932
4 0.441517 0.853434 0.119602
5 0.779707 0.429574 0.744452
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
8 0.690729 0.052097 0.146705
9 0.828667 0.439608 0.091007
10 0.988435 0.326589 0.536904
11 0.687250 0.661912 0.318209
12 0.829129 0.758737 0.519068
13 0.500462 0.723528 0.026962
14 0.464162 0.364536 0.843899
and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):
In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
In [59]: store.append('test', df, format='t', data_columns=True)
In [60]: store.close()
Solution:
In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
The .remove() method should return # of removed rows:
In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9
Let's append changed (multiplied by 100) rows :
In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)
Test:
In [64]: store.select('test')
Out[64]:
a b c
0 0.022079 0.901965 0.282529
2 0.034127 0.992500 0.523114
4 0.441517 0.853434 0.119602
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
14 0.464162 0.364536 0.843899
1 59.645151 9.620415 19.718557
3 65.918421 44.735482 24.693160
5 77.970749 42.957446 74.445185
8 69.072948 5.209725 14.670545
9 82.866731 43.960848 9.100682
10 98.843540 32.658931 53.690360
11 68.725002 66.191215 31.820942
12 82.912937 75.873689 51.906795
13 50.046189 72.352794 2.696243
finalize:
In [65]: store.close()
Here are the docs I think you're after:
http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables
See this thread as well:
Update pandas DataFrame in stored in a Pytable with another pandas DataFrame
Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back
to replace the whole table
store.remove(key, where = ...)
store.append(.....)
You can also do outside of Pandas - see tutorial here on removal
http://www.pytables.org/usersguide/tutorials.html
I have a dbf table. I want to automatically divide this table into two or more tables by using Python. The main problem is, that this table consists of more groups of lines. Each group of lines is divided from the previous group by empty line. So i need to save each of groups to a new dbf table. I think that this problem could be solved by using some function from Arcpy package and FOR cycle and WHILE, but my brain cant solve it :D :/ My source dbf table is more complex, but i attach a simple example for better understanding. Sorry for my poor english.
Source dbf table:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
4
5 D 2
6 E 3
I want get dbf1:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
I want get dbf2:
ID NAME TEAM
1 D 2
2 E 3
Using my dbf package it could look something like this (untested):
import dbf
source_dbf = '/path/to/big/dbf_file.dbf'
base_name = '/path/to/smaller/dbf_%03d'
sdbf = dbf.Table(source_dbf)
i = 1
ddbf = sdbf.new(base_name % i)
sdbf.open()
ddbf.open()
for record in sdbf:
if not record.name: # assuming if 'name' is empty, all are empty
ddbf.close()
i += 1
ddbf = sdbf.new(base_name % i)
continue
ddbf.append(record)
ddbf.close()
sdbf.close()