Using Pandas to Manipulate Multiple Columns - python

I have a 30+ million row data set that I need to apply a whole host of data transformation rules to. For this task, I am trying to explore Pandas as a possible solution because my current solution isn't very fast.
Currently, I am performing a row by row manipulation of the data set, and then exporting it to a new table (CSV file) on disk.
There are 5 functions users can perform on the data within a given column:
remove white space
Capitalize all text
format date
replace letter/number
replace word
My first thought was to use the dataframe's apply or applmap, but this can only be used on a single column.
Is there a way to use apply or applymap to many columns instead of just one?
Is there a better workflow I should consider since I could be doing manipulations to 1:n columns in my dataset, where the maximum number of columns is currently around 30.
Thank you

You can use list comprehension with concat if need apply some function working only with Series:
import pandas as pd
data = pd.DataFrame({'A':[' ff ','2','3'],
'B':[' 77','s gg','d'],
'C':['s',' 44','f']})
print (data)
A B C
0 ff 77 s
1 2 s gg 44
2 3 d f
print (pd.concat([data[col].str.strip().str.capitalize() for col in data], axis=1))
A B C
0 Ff 77 S
1 2 S gg 44
2 3 D F

Related

Create dataframe column using another column for source variable suffix

Difficult to title, so apologies for that...
Here is some example data:
region FC_EA FC_EM FC_GL FC_XX FC_YY ...
GL 4 2 8 6 1 ...
YY 9 7 2 1 3 ...
There are many columns with a suffix, hence the ...
[edit] And there are many other columns. I want to keep all columns.
The aim is to create a column called FC that is the value according to the region column value.
So, for this data the resultant column would be:
FC
8
3
I have a couple of ways to achieve this at present - one way is minimal code (perhaps fine for a small dataset):
df['FC'] = df.apply(lambda x: x['FC_'+x.region], axis=1)
Another way is a stacked np.where query - faster for large datasets I am advised...:
df['FC'] = np.where(df.region=='EA', df.FC_EA,
np.where(df.region=='EM', df.FC_EM,
np.where(df.region=='GL', df.FC_GL, ...
I am wondering if anyone out there can suggest the best way to do this, if there is something better than these options?
That would be great.
Thanks!
You could use melt:
(df.melt(id_vars='region', value_name='FC')
.loc[lambda d: d['region'].eq(d['variable'].str[3:]), ['region', 'FC']]
)
or using apply (probably quite slower):
df['FC'] = (df.set_index('region')
.apply(lambda r: r.loc[f'FC_{r.name}'], axis=1)
.values
)
output:
region FC
4 GL 8
9 YY 3

Iterate through two dataframes and create a dictionary one data frame that is a substring in strings found in the second dataframe (values)

I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc
The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}

Fast datetime parsing with multiple columns, read_csv

I am reading in a large csv file (10GB+). The raw data loaded from the csv looks like:
SYMBOL DATE TIME PRICE CORR COND
0 BA 20090501 9:29:46 40.24 0 F
1 BA 20090501 9:29:59 40.38 0 F
2 BA 20090501 9:30:01 40.31 0 O
3 BA 20090501 9:30:01 40.31 0 Q
4 BA 20090501 9:30:08 40.38 0 F
My goal is to combine the DATE and TIME columns into a single DATE_TIME column when reading in the date via the read_csv function.
Loading the data first and doing it manually is not an option due to memory constraints.
Currently, I am using
data = pd.read_csv('200905.csv',
parse_dates=[['DATE','TIME']],
infer_datetime_format=True,
)
However, using the default dateutil.parser.parser as above increases the loading time by 4x as opposed to just loading the raw csv.
A promising approach could be using the lookup approach in the following:
Pandas: slow date conversion. This is because my dataset has a lot of repeated dates.
However, my issue is, how do I optimally exploit the repeated structure of the DATE column while combining into a DATE_TIME column (which is likely to have very few repeated entries).

Iterating on Pandas DataFrame to pass data into API

I am creating a script that reads a GoogleSheet, transforms the data and passes it into my ERP API to automate the creation of Purchase Orders.
I have got as far as outputting the data in a dataframe but I need help on how I can iterate through this and pass it in the correct format to the API.
DataFrame Example (dfRow):
productID vatrateID amount price
0 46771 2 1 1.25
1 46771 2 1 2.25
2 46771 2 2 5.00
Formatting of the API data:
vatrateID1=dfRow.vatrateID[0],
amount1=dfRow.amount[0],
price1=dfRow.price[0],
productID1=dfRow.productID[0],
vatrateID2=dfRow.vatrateID[1],
amount2=dfRow.amount[1],
price2=dfRow.price[1],
productID2=dfRow.productID[1],
vatrateID3=dfRow.vatrateID[2],
amount3=dfRow.amount[2],
price3=dfRow.price[2],
productID3=dfRow.productID[2],
I would like to create a function that would iterate thru the DataFrame and return the data in the correct format to pass to the API.
I'm new at Python and struggle most with iterating / loops so any help is much appreciated!
First, you can always loop over the rows of a dataframe using df.iterrows(). Each step through this iterator yields a tuple containing the row index and the row contents as a pandas Series object. So, for example, this would do the trick:
for ix, row in df.iterrows():
for column in row.index:
print(f"{column}{ix}={row[column]}")
You can also do it without resorting to loops. This is great if you need performance, but if performance isn't a concern then it is really just a matter of taste.
# first, "melt" the data, which puts all of the variables on their own row
x = df.reset_index().melt(id_vars='index')
# now join the columns together to produce the rows that we want
s = x['variable'] + x['index'].map(str) + '=' + x['value'].map(str)
print(s)
0 productID0=46771.0
1 productID1=46771.0
2 productID2=46771.0
3 vatrateID0=2.0
...
10 price1=2.25
11 price2=5.0

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Categories