Pandas: faster string operations in dataframes - python

I am working on a python script that read data from a database and save this data into a .csv file.
In order to save it correctly I need to escape different characters such as \r\n or \n.
Here is how I am currently doing it:
Firstly, I use the read_sql pandas function in order to read the data from the database.
import pandas as pd
df = pd.read_sql(
sql = 'SELECT * FROM exampleTable',
con = SQLAlchemyConnection
)
The table I get has different types of values.
Then, the script updates the dataframe obtained changing every string value to raw string.
In order to achive that I use two nested for loops in order to operate with every single value.
def update_df(df)
for rowIndex, row in df.iterrows():
for colIndex, values in row.items():
if isinstance(df[rowIndex, colIndex], str):
df.at[rowIndex, colIndex] = repr(df.at[rowIndex, colIndex])
return df
However, the amount of data I need to elaborate is large (more than 1 million rows with more than 100 columns) and it takes hours.
What I need is a way to create the csv file in a faster way.
Thank you in advance.

It should be faster to use applymap if really you have mixed types:
df = df.applymap(lambda x: repr(x) if isinstance(x, str) else x)
However, if you can identify string columns, then you can slice them, (maybe in combination with re.escape?).:
import re
str_cols = ['col1', 'col2']
df[str_cols] = df[str_cols].applymap(re.escape)

Related

Is there a better way to convert json within a pandas dataframe into additional dataframes?

I have a dataframe arguments with the columns RecordID(Int) and additional_arguments(Json formatted string object). I am trying to convert each of the jsons into a dataframe and then concatenate them all into one dataframe. Currently, I am doing this with a for loop:
arguments_output = pd.DataFrame([]);
for i in range(0, len(arguments)-1):
df = pd.DataFrame(arguments['additional_arguments'][i])
df['RecordID'] = arguments['RecordID'][i]
arguments_output = pd.concat([arguments_output, df])
This takes quite a bit of time, as there are over 55000 records total. Is there a better way to achieve this? Thank you
without seeing an example input, it's hard to say, but can un-nest json.
pandas.json_normalize()

Process single data set with different JSON schema rows using Pyspark

I am using PySpark and I need to process the log files that are appended into a single data frame. Most of the columns are look normal, but one of the columns has JSON string in {}. Basically, each row is an individual event and for JSON string I can apply individual Schema. But I don't know what is the best way to process data here.
Example:
This table later will help to aggregate the events in the way I need.
I tried to use function withColumn and use from_json. It worked successfully for a single column:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int")))
It did for my 1st row what I want when I will query nested_json. But it is applied schema on the whole column, and I would like to process each row depends on the event_name
I was naive and try to do this:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int"))
F.when(F.col("event_name") == "Action1",F.from_json("json_string","Name String, Version Int, UserName String, PosX int, PosY int"))
)
And this is failed to run with when() can only be applied on a Column previously generated by when() function
I assumed, my 1st withColumn applied schema for the whole column.
What other options do I have to apply JSON schema based on event_name value and flattened values?
What if you chain your when statements?
For example,
df.withColumn("nested_json", F.when(F.col("event_name") =="EventStart",F.from_json(...)).when(F.col("event_name") == "Action1", F. from_json(...)))

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

UpperCasing CSV Columns by reading the header

I have a function that reads the csv, splits it, uppercases only ONE or ALL columns (by index) and joins it again.
I want to be able to uppercase multiple columns but I have no idea how.
This is my code.
def specific_upper(line, c):
split = line.split(",")
split[c] = split[c].upper()
split = ','.join(split)
return split
EDIT: I wanted to do this only with python ( No spark, if possible )
EDIT2 : This is for NIFI, so its jython and not 100% python.
You can do that easily with read_csv from pandas. Default behaviour is your first row in the csv contains the columns names.
import pandas as pd
df = pd.read_csv('<filename>')
df.columns = [x.upper() for x in df.columns]
This will upper case all your columns. You can add some conditions in order to upper case only the columns of your desire.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories