I've been struggling with this matter for 2 full days now due to my incompetence. After trying almost all stackoverflow and other solutions I could find sadly still no luck.
I'm using Tabular-Py to import tables from PDFs. After which it's already "perfectly" in what seems to be a dataframe. The part of the code used for this is:
tables = tabula.read_pdf(file, pages=18, lattice=True, multiple_tables = False)
Print(Tables)
[Output after printing the table]
[1]: https://i.stack.imgur.com/82Qpa.png
However, it seems to be a list object, as it's blocking me from doing anything else with it besides printing. Even using integers and renaming columns doesn't work due to the errors leading back to "Cannot XX because it's a list object". I was under the impression Tabular makes a direct Pandas Dataframe.
Now when I try to add the following code to rename the columns as desired:
tables.columns = ['HS_Code', 'Product', 'PreviousMonth', 'CurrentMonth', 'LastYear']
I get the error :
AttributeError: 'list' object has no attribute 'columns'
I've tried many forms of renaming and using different sets of output such as Json. Still no luck, it's still a "list object".
Does anyone have experience with this matter? How can I ensure the Table/Dataframe I have is an actual dataframe instead of a list object?
Any tips would be highly appreciated.
I am not familiar with tabula-py objects but considering this post you can do the following:
use pandas.read_clipboard() after copying the pdf content by hand
or 2. save the tabula-py object as csv and use pandas.read_csv() to get the DataFrame
Afterwards you are able to manipulate the data (e.g. change column names) using pandas.
Related
I have loaded some data from CSV files into two dataframes, X and Y that I intend to perform some analysis on. After cleaning them up a bit I can see that the indexes of my dataframes appear to match (they're just sequential numbers), except one has index with type object and the other has index with type int64. Please see attached image for a clearer idea of what I'm talking about.
I have tried manually altering this using X.index.astype('int64') and also X.reindex(Y.index) but neither seem to do anything here. Could anyone suggest anything?
Edit: Adding some additional info in case it is helpful. X was imported as row data from the csv file and transposed whereas Y was imported directly with the index set from the first column of the csv file.
So I've realised what I've done and it was pretty dumb. I should have written
X.index = X.index.astype('int64')
instead of just
X.index.astype('int64')
Oh well, the more you know.
I'm currently working on a project that takes a csv list of student names who attended a meeting, and converts it into a list (later to be compared to full student roster list, but one thing at a time). I've been looking for answers for hours but I still feel stuck. I've tried using both pandas and the csv module. I'd like to stick with pandas, but if it's easier in the csv module that works too. CSV file example and code below.
The file is autogenerated by our video call software- so the formatting is a little weird.
Attendance.csv
see sample as image, I can't insert images yet
Code:
data = pandas.read_csv("2A Attendance Report.csv", header=3)
AttendanceList = data['A'].to_list()
print(str(AttendanceList))
However, this is raising KeyError: 'A'
Any help is really appreciated, thank you!!!
As seen in sample image, you have column headers in the first row itself. Hence you need to remove header=3 from your read_csv call. Either replace it with header=0 or don't specify any explicit header value at all.
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id
I am trying to store some extra information with DataFrames directly in the same DataFrame, such as some parameters describing the data stored.
I added this information just as extra attributes to the DataFrame:
df.data_origin = 'my_origin'
print(df.data_origin)
But when it is saved and loaded, those extra attributes are lost:
df.to_pickle('pickle_test.pkl')
df2 = pd.read_pickle('pickle_test.pkl')
print(len(df2))
print(df2.definition)
...
465387
>>> AttributeError: 'DataFrame' object has no attribute 'definition'
The workaround I have found is to save the dict of the DataFrame and then assign it to the dict of an empty DataFrame:
with open('modified_dataframe.pkl', "wb") as pkl_out:
pickle.dump(df.__dict__, pkl_out)
df2 = pd.DataFrame()
with open('modified_dataframe.pkl', "rb") as pkl_in:
df2.__dict__ = pickle.load(pkl_in)
print(len(df2))
print(df2.data_origin)
...
465387
my_origin
It seems to work, but:
Is there a better way to do it?
Am I losing information? (apparently, all the data is there)
Here a different solution is discussed, but I would like to know if the approach of saving the dict of a class is valid to hold its entire information.
EDIT: Ok, I found the big drawback. This works fine to save single DataFrames in isolated files, but will not work if I have dictionaries, lists or similar with DataFrames in them.
I suggest that you can get your things done by making a new child class for pandas.DataFrame, make a new class inherit things from pandas.DataFrame class, and add your wanted attributes there. This may seem a bit spooky, but you can play around with it safely when you using in different places. Other stuff might be useful for specific cases though.