Overwrite specific columns after modification pandas python - python

I have a csv file where i did some modifications in two columns. My question is the following: How can I print my csv file with the updated columns? My code is the following :
import pandas as pd
import csv
data = pd.read_csv("testdataset.csv")
data = data.join(pd.get_dummies(data["ship_from"]))
data = data.drop("ship_from", axis=1)
data['market_name'] = data['market_name'].map(lambda x: str(x)[39:-1])
data = data.join(pd.get_dummies(data["market_name"]))
data = data.drop("market_name", axis=1)
Thank you in advance!

You can write to a file with pandas.DataFrame.to_csv
data.to_csv('your_file.csv')
However, you can view it without writing with
print(data.to_csv())

Related

csv file to excel, resulting in messy table

I want to convert my csv file to excel, but the first line of the csv get read as header
I first created a csv with the lists below then I used pandas to convert it to excel
import pandas as pd
id=["id",1,2,3,4,5]
name=["name","Salma","Ahmad","Manar","Mustapha","Zainab"]
age=["age",14,12,15,13,10]
#this is how i created the csv file
Csv='path/csvfile.csv'
open_csv=open(Csv, 'w')
outfile=cvs.writer(open_csv)
outfile.writerows([id]+[name]+[age])
open_csv.close()
#Excel file
Excel='path/Excelfile.xlsx'
Excel_open=open(Excel, 'w')
csv_file=pd.read_csv(Csv)
csv_file.to_excel(Excel)
This is what I get from this code
"Results"
I want the Id title to be in the same column as name and age
I would suggest this instead:
import pandas as pd
df = pd.DataFrame({
"id": [1,2,3,4,5],
"name":["Salma","Ahmad","Manar","Mustapha","Zainab"],
"age":[14,12,15,13,10]
})
excel_file = df.to_excel("excel_file.xlsx", index=False)
In this way you can create a dataframe more easily and understandable.

how to insert data from list into excel in python

how to insert data from list into excel in python
for example i exported this data from log file :
data= ["101","am1","123450","2015-01-01 11:19:00","test1 test1".....]
["102","am2","123451","2015-01-01 11:20:00","test2 test3".....]
["103","am3","123452","2015-01-01 11:21:00","test3 test3".....]
Output result:
[1]: https://i.stack.imgur.com/7uTOE.png
.
The module pandas has a DataFrame.to_excel() function that would do that.
import pandas as pd
data= [["101","am1","123450","2015-01-01 11:19:00","test1 test1"],
["102","am2","123451","2015-01-01 11:20:00","test2 test3"],
["103","am3","123452","2015-01-01 11:21:00","test3 test3"]]
df = pd.DataFrame(data)
df.to_excel('my_data.xmls')
That should do it.

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!
You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.

How to optimize python script to pyspark def function

I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")

Read file into pandas dataframe (using soh to split data)

Question:
I have seen some websites about how to read files into dataframe but can't find one that teach me how to read file that use soh to split data.
The files I get don't have extension but they look like .txt file.
For now I read the files row by row to create dataframes and it takes lots of time. Is there any way to make it faster?
Code:
from pandas import DataFrame
openfile = open('filename','r')
column1 = []
column2 = []
for line in openfile:
line = line.strip().split('\x01') #soh equals to '\x01'
column1.append(line[0])
column2.append(line[1])
data = {'column1':column1, 'column2':column2}
table = DataFrame(data,columns = ['column1','column2'])
If your data doesn't have headers, this should do it:
import pandas as pd
table = pd.read_table('filename', sep='\x01', header=None, names=['column1','column2'])
You can rear more about reading files here.

Categories