raw data
pivot table
question
how can I replace industryName with tradeDate and remove that blank row? I want to make it look like:
the screenshot below is created by IPython-Dashboard
You can reset_index to convert you index to a regular column.
data_pivot.reset_index()
tradeDate is the name of your index. You can remove it via:
data_pivot.index.name = None
industryName is the name of your columns. You can change that be equal to tradeDate via:
data_pivot.columns.name = 'tradeDate'
Related
I have the following dataframe output that I would like to convert to json, but it adds a leading zero, which gets added to the json. How do I remove it? Pandas by default numbers each row.
id version ... token type_id
0 10927076529 0 ... 56599bb6-3b56-425b-8688-8fc0c73fbedc 3
{"0":{"id":10927076529,"version":0,"token":"56599bb6-3b56-425b-8688-8fc0c73fbedc","type_id":3}}
df = df.rename(columns={'id': 'version', 'token': 'type_id' })
df2 = df.to_json(orient="index")
print(df2)
Pandas has that 0 value as the row index for your single DataFrame entry. You can't remove it in the actual DataFrame as far as I know.
This is showing up in your JSON specifically because you're using the "index" option for the "orient" parameter.
If you want each row in your final dataframe to be a separate entry, you can try the "records" option instead of "index".
df2 = df.to_json(orient="records")
This hyperlink has a good illustration of the different options.
Another option you have is to set one of your columns as an index that you want to use, such as id/version. This will preserve a title, but without using the default indexing scheme provided by Pandas.
df = df.set_index('version')
df2 = df.to_json(orient="index")
I am tring to remove a column and special characters from the dataframe shown below.
The code below used to create the dataframe is as follows:
dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')
The above produces the following output:
I need help with regex to remove the characters  and delete the first column.
As regards regex, I have tried the following:
dt.withColumn('COUNTRY ID', regexp_replace('COUNTRY ID', #"[^0-9a-zA-Z_]+"_ ""))
However, I'm getting a syntax error.
Any help much appreciated.
If the position of incoming column is fixed you can use regex to remove extra characters from column name like below
import re
colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)
And for dropping index column you can refer to this stack answer
You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below
df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()
So I have the current file in Excel where I have dates and don't have dates for everything which can be seen.
I read this excel file into a pandas dataframe, rename the column and get the following:
My question is, how would I get it so every empty date in the dataframe is filled in with the last previous date encountered. All of the blanks between 04/03/2021 and 05/03/2021 gets replaced with 04/03/2021, so every row in my dataframe has a date associated with it?
Thanks!
After reading the data into a dataframe, you can fill missing values using fillna with method='ffill' for forward fill
Just using the inbuilt way in pandas of:
duplicate_df['StartDate'] = duplicate_df['StartDate'].fillna(method = 'ffill')
This replaces all the NaNs in the dataframe with the last row that had data in.
I have a data frame. It has 3 columns A, Amount. I have done a group by using 'A'. Now I want to insert these values into a new data frame how can I achieve this?
top_plt=pd.DataFrame(top_plt.groupby('A')['Amount'].sum())
The resulting dataframe contains only the Amount column but the groupby 'A' column is missing.
Example:
Result:
DataFrame constructor is not necessary, better is add as_index=False to groupby:
top_plt= top_plt.groupby('A', as_index=False)['Amount'].sum()
Or add DataFrame.reset_index:
top_plt= top_plt.groupby('A')['Amount'].sum().reset_index()
I have a pivot table with a multi-index in the name of the columns like this :
I want to keep the same data it is correct, but I want to give one name to each column that summarizes all the indexes to have something like this:
You can flatten a multi-index by converting it to a dataframe with text columns and joining them:
df.columns = df.columns.to_frame().astype(str).apply(''.join, axis=1)
The result should not be far from what you want. But as you have not given any reproducible example, I could not test against your data...