read json object and create csv string from it in python - python

read json object and create csv string from it in python.
I have a object array in string format as.
'[{"date":"2014-10-05T01:12:00.000Z","count":56.4691}, {"date":"2014-10-05T01:14:00.000Z","count":23.4691}, ...]'
I want to transform the string into csv format as,
"","date","count"
"1",2014-09-25 14:01:00,182.478
"2",2014-09-25 14:01:00,182.478
to be able to do it, I firstly read the string with read_json function in pandas library. but it sorted the columns. and count column comes before the date column. how can i get this transformation in python?

Use columns parameter in df.to_csv
Ex:
import pandas as pd
s = '[{"date":"2014-10-05T01:12:00.000Z","count":56.4691}, {"date":"2014-10-05T01:14:00.000Z","count":23.4691}]'
df = pd.read_json(s)
df.to_csv(r"PATH\B.csv", columns=["date", "count"])

Related

Function to take a list of spark dataframe and convert to pandas then csv

import pyspark
dfs=[df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df1,df12,df13,df14,df15]
for x in dfs:
y=x.toPandas()
y.to_csv("D:/data")
This is what I wrote, but I actually want the function to take this list and convert every df into a pandas df and then convert it to csv and save it in the order as it appears on dfs list and save it to a particular directory in the order of name. Is there a possible way to write such function?
PS D:/data is just an imaginary path and is used for explanation.
If you will convert a dataframe to a csv, you still need to state it in df.to_csv. So, try:
for x in dfs:
y=x.toPandas()
y.to_csv(f"D:/data/df{dfs.index(x) + 1}.csv")
I set it as df{dfs.index(x) + 1} so that the file names will be df1, df2, ... etc.

How do I read only specific columns from a JSON dataframe?

I have a JSON dataframe with 12 columns, however, I only want to read columns 2 and 5 which are named "name" and "score."
Currently, the code I have is:
df = pd.read_json("path",orient='columns', lines=True)
print(df.head())
What that does is displays every column, as would be expected.
After reading through the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
I can't find any real way to only parse certain columns within json, compared to csv where you can parse columns using names=[]
pass a list of columns for indexing
df[["name","score"]]

How to store tuples in a pandas dataframe cell?

I have a csv import of datas store in such fashion
username;groups
alice;(admin,user)
bob;(user)
I want to do some data analysis on it and import them to a pandas dataframe so that the first column is stored as a string and the second as a tuple.
I tried mydataframe = pd.read_csv('file.csv', sep=';') then convert the groups column with astype method mydataframe['groups'].astype('tuple') but it won't work.
How to store other objects than strings/ints/floats in dataframes?
Thanks.
Untested, but try
mydataframe['groups'].apply(lambda text: tuple(text[1:-1].split(',')))

is there a possible way to merge excel rows for duplicete cells in a column with python?

I am still new with python could you pleas help me with this
i have this excel sheet
and i want it to be like this
You can convert the csv data to a panda dataframe like this:
import pandas as pd
df = pd.read_csv("Input.csv")
Then do the data manipulation as such:
df = df.groupby(['Name'])['Training'].apply(', '.join).reset_index()
Finally, create an output csv file:
df.to_csv('Output.csv', sep='\t')
You could use pandas for creating a DataFrame to manipulate the excel sheet information. First, load the file using the function read_excel (this creates a DataFrame), and then use the function groupby and apply to concatenate the strings.
import pandas as pd
# Read the Excel File
df = pd.read_excel('tmp.xlsx')
# Group by the column(s) that you need.
# Finally, use the apply function to arrange the data
df.groupby(['Name'])['Training'].apply(','.join).reset_index( )

Python Pandas - read date column as string

I have some data in an excel file and I read it using pandas read_excel method.
However I want to read the entire data in all columns as strings including the date column.
The problem is that I want to leave the date column in its original format as string. For example, I have '31.01.2017' in the excel and it is formatted as date and I want to have '31.01.2017' in my data frame.
I thought using dytpes parameter of read_excel with dtype=str was the correct approach. But pandas then reads the date column as datetime and then converts it to string. So at the end I always have '2017-01-31 00:00:00' in my data frame.
Is there any way to do this?
The behavior of pandas makes sense:
If the excel-format of your date column is text, pandas will read the
dates as strings by default.
If the excel-format of your date column is date, pandas will read the dates as dates.
However, you point out that in the Excelfile the date column is formatted as a date. If that's the case, there is no string in your Excelfile to begin with. The underlying data of the date column is stored as a float. The string you are seeing is not the actual data. You can't read something as a raw string if it isn't a string.
Some more info: https://xlrd.readthedocs.io/en/latest/formatting.html
But let's say, for some reason, you want Python to display the same format as Excel, but in string form, without looking in the Excel.
First you'd have to find the format:
from openpyxl import load_workbook
wb = load_workbook('data.xlsx')
ws = wb.worksheets[0]
print(ws.cell(1,5).number_format) # look at the cell you are interested in
> '[$]dd/mm/yyyy;#'
and then convert is to something the strftime function understands.
https://www.programiz.com/python-programming/datetime/strftime#format-code
form = form[3:-2]
form = form.replace('dd','%d')
form = form.replace('mm','%m')
form = form.replace('yyyy','%Y')
print(form)
> '%d/%m/%Y'
And apply it
df.loc[:,"date_field"].apply(lambda x: x.strftime(form))
> 0 01/02/2018
1 02/02/2018
2 03/02/2018
3 04/02/2018
4 05/02/2018
However, if you're working with multiple Excel date-formats you'd have to make a strf-time mapping for each of them.
Probably there will be more practical ways of doing this, like receiving the data in csv format or just keeping the dates in excel's text format in the first place.
As you are trying to keep the date column in the initial type, the following code may help you. In the first row we insert to the variable "cols" all the columns except the date column, and then in the following two lines we just change the type of the rest columns:
cols=[i for i in df.columns if i not in ["Date_column"]]
for col in cols:
df[col]=df[col].astype('category')
Hope it helps! :-)
df['date_column'] = df['date_column'].dt.strftime('%d.%m.%Y')

Categories