Python - Write data from list into specific Excel column - python

I have a list of data which I want to write into a specific column (B2 column) in Excel
Input example:
mydata =[12,13,14,15]
Desired Output in Excel:
A2| B2|
| 12|
| 13|
| 14|
| 15|
I have tried using openpyxl to access the specific sheet (which works fine) and specific cell(B2) but it throws an error to write to the excel file as it is a list. It works fine if I assign a single value as shown in the code extract below:
mydata= my_wb['sheet2']['B2'] = 4
Can anyone point me in the right direction?

Iterate over the list and paste each value into the desired row in column B:
for i, n in enumerate(mydata):
my_wb["sheet2"].cell(i+2, 2).value = n

Related

Databricks pyspark, Difference in result of Dataframe.count() and Display(Dataframe) while using header='false'

I am reading CSV (present on Azure datalake store) file in dataframe by following code:
df = spark.read.load(filepath, format="csv", schema = mySchema, header="false", mode="DROPMALFORMED");
File filepath contain 100 rows and header. I want to ignore header from file while reading so I defined header="false" . (As sometimes file come with header and sometimes not)
After reading in dataframe when I display dataframe by display(df) statement I got all the data and showed 100 rows which is correct. But when I used to check count of dataframe by using df.count() It displayed me 101 rows. Does dataframe show count with header? or Am I missing something?
mySchema and filepath already separately defined in cells.
You have mode="DROPMALFORMED" while reading csv file.
When there are some malformed records spark drops them out in df.show() but counts them in df.count().
In your case as header is false and schema specified so spark reads data as per your specified types if there are issues then records will not be shown
Example:
#sample data
#cat employee.csv
#id,name,salary,deptid
#1,a,1000,101
#2,b,2000,201
ss=StructType([StructField("id",IntegerType()),StructField("name",StringType()),StructField("salary",StringType()),StructField("deptid",StringType())])
df=spark.read.load("employee.csv",format="csv",schema=ss,header="false",mode="DROPMALFORMED")
df.show()
#+---+----+------+------+
#| id|name|salary|deptid|
#+---+----+------+------+
#| 1| a| 1000| 101|
#| 2| b| 2000| 201|
#+---+----+------+------+
#issue in df.count
df.count()
#3 #has to be 2
To Fix:
Add notNull filter while reading as dataframe.
df=spark.read.load("employee.csv",format="csv",schema=ss,header="false",mode="DROPMALFORMED").filter(col("id").isNotNull())
df.show()
#+---+----+------+------+
#| id|name|salary|deptid|
#+---+----+------+------+
#| 1| a| 1000| 101|
#| 2| b| 2000| 201|
#+---+----+------+------+
#fixed count
df.count()
#2
To view malformed data remove mode:
spark.read.load("employee.csv",format="csv",schema= mySchema,header="false").show(100,False)
As per the pyspark documentation,
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
header – uses the first line as names of columns. If None is set, it uses the default value, false.
You might need to standardize the way you get your data, either with headers or without headers, and then set the flag.
If you set header=False, then the spark engine will simply read the first row as a data row.
To answer your question, Dataframe count does not count header.
I would recommend reading the data first and then dropping the headers for debugging purposes.
Also, display(df) is a python operation provided by Ipython, I would use dataframe.show() which a spark provided utility for debugging purposes.

Create one dataframe from multi csv files with different headers in Spark

In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns.
To say it more easily, i want only one dataframe from multiple csv files with different headers.
I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility".
This is my code in python 3 :
df = spark.read.load(
s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true',
quote = ''
)
This is an example of file_1.csv :
|raw_id|title|civility|
|1 |M |male |
And an example of file2.csv :
|raw_id|first_name|civility|
|2 |Tom |male |
The result i expect in my dataframe is :
|raw_id|first_name|title|civility|
|1 | |M |male |
|2 |Tom | |male |
But, what is happening is that i have all united columns but the data is not in the right place after the first file.
Do you know how to do this ?
Thank you very much by advance.
You need to load each of them in a different dataframe and join them together on the raw_id column.

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Remove column names from spark dataframe while storing it as textfile

My dataframe output is as below,
DF.show(2)
+--------------+
|col1|col2|col3|
+--------------+
| 10| 20| 30|
| 11| 21| 31|
+--------------+
after saving it as textfile - DF.rdd.saveAsTextFile("path")
Row(col1=u'10', col2=u'20', col3=u'30')
Row(col1=u'11', col2=u'21', col3=u'31')
the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes
10|20|30
11|21|31
while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,
data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))
Thanks in advance !
I think you can do just
.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)
In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
So in you're case, could just do something like
df.write.option("sep", "|").option("header", "false").csv("some/path/")
There is a databricks plugin that provides this functionality in spark 1.x
https://github.com/databricks/spark-csv
As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

Categories