Python Pandas: print the csv data in oder with columns - python

 Hi I am new with python, I am using pandas to read the csv file data, and print it. The code is shown as following:
import numpy as np
import pandas as pd
import codecs
from pandas import Series, DataFrame
dframe = pd.read_csv("/home/vagrant/geonlp_japan_station.csv",sep=',',
encoding="Shift-JIS")
print (dframe.head(2))
but the data is printed like as following(I just give example to show it)
However, I want the data to be order with columns like as following:
I don't know how to make the printed data be clear, thanks in advance!

You can check unicode-formatting and set:
pd.set_option('display.unicode.east_asian_width', True)
I test it with UTF-8 version csv:
dframe = pd.read_csv("test/geonlp_japan_station/geonlp_japan_station_20130912_u.csv")
and it seems align of output is better.
pd.set_option('display.unicode.east_asian_width', True)
print dframe
pd.set_option('display.unicode.east_asian_width', False)
print dframe

Related

how to insert data from list into excel in python

how to insert data from list into excel in python
for example i exported this data from log file :
data= ["101","am1","123450","2015-01-01 11:19:00","test1 test1".....]
["102","am2","123451","2015-01-01 11:20:00","test2 test3".....]
["103","am3","123452","2015-01-01 11:21:00","test3 test3".....]
Output result:
[1]: https://i.stack.imgur.com/7uTOE.png
.
The module pandas has a DataFrame.to_excel() function that would do that.
import pandas as pd
data= [["101","am1","123450","2015-01-01 11:19:00","test1 test1"],
["102","am2","123451","2015-01-01 11:20:00","test2 test3"],
["103","am3","123452","2015-01-01 11:21:00","test3 test3"]]
df = pd.DataFrame(data)
df.to_excel('my_data.xmls')
That should do it.

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!
You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.

How to delete \xa0 symbol from all dataframe in pandas?

I have a data frame with the 2011 census (Chile) information. In some columns names and some variable values, I have \xa0 symbol and is trouble to call some parts of data frames. My code is the following:
import numpy as numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
#READ AND SAVE EACH DATA SET IN A DATA FRAME
data_2011 = pd.read_csv("2011.csv")
#SHOW FIRST 5 ROWS
print(data_2011.head())
Doing this got the following output:
This far, everything is right, but when I want to see the column names using:
print("ATRIBUTOS DATOS CENSO 2011",list(data_2011))
I got the following ouput:
How can I fix this? Thanks in advance
Try the following code:
import re
inputarray = re.sub(r'\s', data_2011).split(",')

How to optimize python script to pyspark def function

I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")

Overwrite specific columns after modification pandas python

I have a csv file where i did some modifications in two columns. My question is the following: How can I print my csv file with the updated columns? My code is the following :
import pandas as pd
import csv
data = pd.read_csv("testdataset.csv")
data = data.join(pd.get_dummies(data["ship_from"]))
data = data.drop("ship_from", axis=1)
data['market_name'] = data['market_name'].map(lambda x: str(x)[39:-1])
data = data.join(pd.get_dummies(data["market_name"]))
data = data.drop("market_name", axis=1)
Thank you in advance!
You can write to a file with pandas.DataFrame.to_csv
data.to_csv('your_file.csv')
However, you can view it without writing with
print(data.to_csv())

Categories