I have the following dataframe output that I would like to convert to json, but it adds a leading zero, which gets added to the json. How do I remove it? Pandas by default numbers each row.
id version ... token type_id
0 10927076529 0 ... 56599bb6-3b56-425b-8688-8fc0c73fbedc 3
{"0":{"id":10927076529,"version":0,"token":"56599bb6-3b56-425b-8688-8fc0c73fbedc","type_id":3}}
df = df.rename(columns={'id': 'version', 'token': 'type_id' })
df2 = df.to_json(orient="index")
print(df2)
Pandas has that 0 value as the row index for your single DataFrame entry. You can't remove it in the actual DataFrame as far as I know.
This is showing up in your JSON specifically because you're using the "index" option for the "orient" parameter.
If you want each row in your final dataframe to be a separate entry, you can try the "records" option instead of "index".
df2 = df.to_json(orient="records")
This hyperlink has a good illustration of the different options.
Another option you have is to set one of your columns as an index that you want to use, such as id/version. This will preserve a title, but without using the default indexing scheme provided by Pandas.
df = df.set_index('version')
df2 = df.to_json(orient="index")
Related
I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T
I have a pandas dataframe read in from Excel, as below. The column labels are dates.
Given today's date is 2020-04-13, I want to retrieve the C row values for next 7 days.
Currently, I set the index and retrieve the values of C, when I print rows I get the output for all the dates and their values of C.
I know that I should use date.today(). Can someone let me know how to capture the column of today's date (2020-04-13) for the C row? I am beginner to python/pandas and am just learning the concepts of dataframes.
input_path = "./data/input/data.xlsx"
pd_xls_obj = pd.ExcelFile(input_path)
data = pd.read_excel(pd_xls_obj,sheet_name="Sheet1",index_col='Names')
rows = data.loc["C"]
Easy way to do it, is to load the data from the workbook with headers and then do (in words) something like: Show me from data[column: where date is today][row: where data['Names'] is equal to 'C']
i would not go for the version, where you use a column (which anyway only has unique values) as index..
code example below; I needed to use "try: _____ except: ", because one of your headers is a String and the ".date()" would through an error.
import pandas as pd
import datetime
INPUT_PATH = "C:/temp/data.xlsx"
pd_xls_obj = pd.ExcelFile(INPUT_PATH)
data = pd.read_excel(pd_xls_obj, sheet_name="Sheet1")
column_headers = data.columns
# loop though headers and check if todays date is equal to the column header
for column_header in column_headers:
# python would throw you an Attribute Error for all Headers which are not
# in the format datetime. In order to avoid that, we use a try - except
try:
# if the dates are equal, print an output
if column_header.date() == datetime.date.today():
# you can read the statement which puts together your result as
# follows:
# data[header_i_am_interested_in][where: data['Names'] equals 'C']
result = data[column_header][data['Names'] == 'C']
print(result)
except AttributeError:
pass
It's unorthodox in pandas to use the date as column labels, instead of as row index, and since pandas dtypes go by column not by row, that means pandas won't correctly detect the column label type as 'datetime', rather than string/object, and hence comparison and arithmetic operators on it won't work properly, so you'll have to do lots of unnecessary avoidable manual work and conversions to/from datetime. Instead:
You should transpose the dataframe immediately at read-time:
data = pd.read_excel(...).T
Now your dates will be in one single column with the same dtype, and you can convert it with pd.to_datetime().
Then, make sure the dtypes are correct, i.e. the index's dtype should be 'datetime', not 'object', 'string' etc. (Please post your dataset or URL in the question to make this reproducible).
Now 'C' will be a column instead of a row.
You can access your entire 'C' column with:
rows = data[:, 'C']
... and similarly you can write an expression for your subset of rows for your desired dates. Waiting for your data snippet, to show the code.
I am trying to reference a list of expired orders from one spreadsheet(df name = data2), and vlookup them on the new orders spreadsheet (df name = data) to delete all the rows that contain expired orders. Then return a new spreadsheet(df name = results).
I am having trouble trying to mimic what I do in excel vloookup/sort/delete in pandas. Please view psuedo code/steps as code:
Import simple.xls as dataframe called 'data'
Import wo.xlsm, sheet
name "T" as dataframe called 'data2'
Do a vlookup , using Column
"A" in the "data" to be used to as the values to be
matched with any of the same values in Column "A" of "data2" (there both just Order Id's)
For all values that exist inside Column A in 'data2'
and also exist in Column "A" of the 'data',group ( if necessary) and delete the
entire row(there is 26 columns) for each matched Order ID found in Column A of both datasets. To reiterate, deleting the entire row for the matches found in the 'data' file. Save the smaller dataset as results.
import pandas as pd
data = pd.read_excel("ors_simple.xlsx", encoding = "ISO-8859-1",
dtype=object)
data2 = pd.read_excel("wos.xlsm", sheet_name = "T")
results = data.merge(data2,on='Work_Order')
writer = pd.ExcelWriter('vlookuped.xlsx', engine='xlsxwriter')
results.to_excel(writer, sheet_name='Sheet1')
writer.save()
I re-read your question and think I undertand it correctly. You want to find out if any order in new_orders (you call it data) have expired using expired_orders (you call it data2).
If you rephrase your question what you want to do is: 1) find out if a value in a column in a DataFrame is in a column in another DataFrame and then 2) drop the rows where the value exists in both.
Using pd.merge is one way to do this. But since you want to use expired_orders to filter new_orders, pd.merge seems a bit overkill.
Pandas actually has a method for doing this sort of thing and it's called isin() so let's use that! This method allows you to check if a value in one column exists in another column.
df_1['column_name'].isin(df_2['column_name'])
isin() returns a Series of True/False values that you can apply to filter your DataFrame by using it as a mask: df[bool_mask].
So how do you use this in your situation?
is_expired = new_orders['order_column'].isin(expired_orders['order_column'])
results = new_orders[~is_expired].copy() # Use copy to avoid SettingWithCopyError.
~is equal to not - so ~is_expired means that the order wasn't expired.
There are two DataFrames that I want to merge:
DataFrame A columns: index, userid, locale (2000 rows)
DataFrame B columns: index, userid, age (300 rows)
When I perform the following:
pd.merge(A, B, on='userid', how='outer')
I got a DataFrame with the following columns:
index, Unnamed:0, userid, locale, age
The index column and the Unnamed:0 column are identical. I guess the Unnamed:0 column is the index column of DataFrame B.
My question is: is there a way to avoid this Unnamed column when merging two DFs?
I can drop the Unnamed column afterwards, but just wondering if there is a better way to do it.
In summary, what you're doing is saving the index to file and when you're reading back from the file, the column previously saved as index is loaded as a regular column.
There are a few ways to deal with this:
Method 1
When saving a pandas.DataFrame to disk, use index=False like this:
df.to_csv(path, index=False)
Method 2
When reading from file, you can define the column that is to be used as index, like this:
df = pd.read_csv(path, index_col='index')
Method 3
If method #2 does not suit you for some reason, you can always set the column to be used as index later on, like this:
df.set_index('index', inplace=True)
After this point, your datafame should look like this:
userid locale age
index
0 A1092 EN-US 31
1 B9032 SV-SE 23
I hope this helps.
Either don't write index when saving DataFrame to CSV file (df.to_csv('...', index=False)) or if you have to deal with CSV files, which you can't change/edit, use usecols parameter:
A = pd.read_csv('/path/to/fileA.csv', usecols=['userid','locale'])
in order to get rid of the Unnamed:0 column ...
I have a new Data-frame df. Which was created using:
df= pd.DataFrame()
I have a date value called 'day' which is in format dd-mm-yyyy and a cost value called 'cost'.
How can I append the date and cost values to the df and assign the date as the index?
So for example if I have the following values
day = 01-01-2001
cost = 123.12
the resulting df would look like
date cost
01-01-2001 123.12
I will eventually be adding paired values for multiple days, so the df will eventually look something like:
date cost
01-01-2001 123.12
02-01-2001 23.25
03-01-2001 124.23
: :
01-07-2016 2.214
I have tried to append the paired values to the data frame but am unsure of the syntax. I've tried various thinks including the below but without success.
df.append([day,cost], columns='date,cost',index_col=[0])
There are a few things here. First, making a column the index goes like this, though you can also do it when you load the dataframe from a file (see below):
df.set_index('date', inplace=True)
To add new rows, you should write them out to file first. Pandas isn't great at adding rows dynamically, and this way you can just read the data in when you need it for analysis.
new_row = ... #a row of new data in string format with values
#separated by commas and ending with \n
with open(path, 'a') as f:
f.write(new_row)
You can do this in a loop, or singly, as many time as you need. Then when you're ready to work with it, you use:
df = pd.read_csv(path, index_col=0, parse_dates=True)
index_col can't take a string name for the index column, so you have to use the index of the order on disk; in my case it makes the first column the index. Passing parse_dates=True will make it turn your datetime strings that you declared as the index into datetime objects.
Try this:
dfapp = [day,cost]
df.append(dfapp)