The task is a very simple data analysis, where I download a report using an api and it comes as a csv file. I have been trying to convert it correctly to a DataFrame using the following code:
#staticmethod
def convert_csv_to_data_frame(csv_buffer_file):
data = StringIO(csv_buffer_file)
dataframe = DataFrame.from_csv(path=data, index_col=0)
return dataframe
However, since the csv don't have indexes inside it, the first column of the data I need is beeing ignored by the dataframe because it is considered the index column.
I wanted to know if there is a way to make the dataframe insert an index column automatically.
Your error here was to assume that param index_col=0 meant that it would not treat your csv as having an index column. This should've been index_col=None and in fact this is the default value so you could have not specified this and it would have worked:
#staticmethod
def convert_csv_to_data_frame(csv_buffer_file):
data = StringIO(csv_buffer_file)
dataframe = DataFrame.from_csv(path=data) # remove index_col param
return dataframe
For more info consult the docs
Related
[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]
I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.
Consider a case like the following:
In the actual spreadsheet, you see: $0.96
When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96
Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?
I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.
UPDATE
Taking the comments below into consideration, I'm rewriting with openpyxl.
import openpyxl
from openpyxl import load_workbook
def excel_concat(df_source):
df_master = pd.DataFrame()
for index, row in df_source.iterrows():
excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
wb = openpyxl.load_workbook(filename = excel_file)
ws = wb.active
df_data = pd.DataFrame(ws.values)
df_master = pd.concat([df_master, df_data], ignore_index=True)
return df_master
df_master1 = excel_concat(df_excel_files)
This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?
looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**
converters dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Do you think that might work for you?
Is there a way to format the output of a particular column of a panda's data frame (e.g., as currency, with ${:,.2f}, or percentage, with {:,.2%}) without changing the data itself?
In this post I see that map can be used, but it changes the data to strings.
I also see that I can use .style.format (see here) to print a data frame with some formatting, but it returns a Styler object.
I would like just to change the default print out of the data frame itself, so that it always print it formatted as specified. (I suppose this means changing __repr__ or __repr_html__.) I'd assume that there is a simple way of doing this in pandas, but I could not find it.
Any help would be greatly appreciated!
EDIT (for clarification): Suppose I have a data frame df:
df = pd.DataFrame({"Price": [1234.5, 3456.789], "Increase": [0.01234, 0.23456]})
I want the column Price to be formatted with "${:,.2f}" and column Increase to be formatted with "{:,.2%}" whenever I print df in a Jupyter notebook (with print or just running a cell ending in df).
I can use
df.style.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
but I do not want to type that every time I print df.
I could also do
df["Price"] = df["Price"].map("${:,.2f}".format)
df["Increase"] = df["Increase"].map("{:,.2%}".format)
which does always print as I want (with print(df)), but this changes the columns from float64 to object, so I cannot manipulate the data frame anymore.
It is a natural feature, but pandas cannot guess what your format is, and each time you create a styler it has to be informed of such decisions, since it is a separate object. It does not dynamically update if you change your DataFrame.
The best you can do is to create a generic print via styler.
def p(df):
styler = dy.style
styler.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
return styler
df = DataFrame([[1,2],[3,4]], columns=["Price", "Increase"])
p(df)
I have OHLC data in a .csv file with the stock name is repeated in the header rows, like this:
M6A=F, M6A=F,M6A=F, M6A=F, M6A=F
Open, High, Low, Close, Volume
I am using pandas read_csv to get it, and parse all (and only) the 'M6A=F' columns to FastAPI. So far nothing I do will get all the columns. I either get the first column if I filter with "usecols=" or the last column if I filter with "names=".
I don't want to load the entire .csv file then dump unwanted data due to speed of use, so need to filter before extracting the data.
Here is my code example:
symbol = ['M6A=F']
df = pd.read_csv('myOHCLVdata.csv', skipinitialspace=True, usecols=lambda x: x in symbol)
def parse_csv(df):
res = df.to_json(orient="records")
parsed = json.loads(res)
return parsed
#app.get("/test")
def historic():
return parse_csv(df)
What I have done so far:
I checked the documentation for pandas.read_csv and it says "names=" will not allow duplicates.
I use lambdas in the above code to prevent the symbol hanging FastAPI if it does not match a column.
My understanding from other stackoverflow questions on this is that mangle_dupe_cols=True should be incrementing the duplicates with M6A=F.1, M6A=F.2, M6A=F.3 etc... when pandas reads it into a dataframe, but that isnt happening and I tried setting it to false, but it says it is not implemented yet.
And answers like I found in this stackoverflow solution dont seem to tally with what is happening in my code, since I am only getting the first column returned, or the last column with the others over-written. (I included FastAPI code here as it might be related to the issue or a workaround).
I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.
I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck