I am trying to iterate through rows of a Pandas df to get data from one column of the row, and using that data to add new columns. The code is listed below but it is VERY slow. Is there any way to do what I am trying to do without iterating thru the individual rows of the dataframe?
ctqparam = []
wwy = []
ww = []
for index, row in df.iterrows():
date = str(row['Event_Start_Time'])
day = int(date[8] + date[9])
month = int(date[5] + date[6])
total = 0
for i in range(0, month-1):
total += months[i]
total += day
out = total // 7
ww += [out]
wwy += [str(date[0] + date[1] + date[2] + date[3])]
val = str(row['TPRev'])
out = ""
for letter in val:
if letter != '.':
out += letter
df.replace(to_replace=row['TPRev'], value=str(out), inplace = True)
val = str(row['Subtest'])
if val in ctqparam_dict.keys():
ctqparam += [ctqparam_dict[val]]
# add WWY column, WW column, and correct data format of Test_Tape column
df.insert(0, column='Work_Week_Year', value = wwy)
df.insert(3, column='Work_Week', value = ww)
df.insert(4, column='ctqparam', value = ctqparam)
It's hard to say exactly what your trying to do. However, if you're looping through rows chances are that there is a better way to do it.
For example, given a csv file that looks like this..
Event_Start_Time,TPRev,Subtest
4/12/19 06:00,"this. string. has dots.. in it.",{'A_Dict':'maybe?'}
6/10/19 04:27,"another stri.ng wi.th d.ots.",{'A_Dict':'aVal'}
You may want to:
Format Event_Start_Time as datetime.
Get the week number from Event_Start_Time.
Remove all the dots (.) from the strings in column TPRev.
Expand a dictionary contained in Subtest to its own column.
Without looping through the rows, consider doing thing by columns. Like doing it to the first 'cell' of the column and it replicates all the way down.
Code:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Event_Start_Time TPRev Subtest
0 4/12/19 06:00 this. string. has dots.. in it. {'A_Dict':'maybe?'}
1 6/10/19 04:27 another stri.ng wi.th d.ots. {'A_Dict':'aVal'}
# format 'Event_Start_Time' as as datetime
df['Event_Start_Time'] = pd.to_datetime(df['Event_Start_Time'], format='%d/%m/%y %H:%M')
# get the week number from 'Event_Start_Time'
df['Week_Number'] = df['Event_Start_Time'].dt.isocalendar().week
# replace all '.' (periods) in the 'TPRev' column
df['TPRev'] = df['TPRev'].str.replace('.', '', regex=False)
# get a dictionary string out of column 'Subtest' and put into a new column
df = pd.concat([df.drop(['Subtest'], axis=1), df['Subtest'].map(eval).apply(pd.Series)], axis=1)
print(df)
Event_Start_Time TPRev Week_Number A_Dict
0 2019-12-04 06:00:00 this string has dots in it 49 maybe?
1 2019-10-06 04:27:00 another string with dots 40 aVal
print(df.info())
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Event_Start_Time 2 non-null datetime64[ns]
1 TPRev 2 non-null object
2 Week_Number 2 non-null UInt32
3 A_Dict 2 non-null object
dtypes: UInt32(1), datetime64[ns](1), object(2)
So you end up with a dataframe like this...
Event_Start_Time TPRev Week_Number A_Dict
0 2019-12-04 06:00:00 this string has dots in it 49 maybe?
1 2019-10-06 04:27:00 another string with dots 40 aVa
Obviously you'll probably want to do other things. Look at your data. Make a list of what you want to do to each column or what new columns you need. Don't mention how right now as chances are it's possible and has been done before - you just need to find the existing method.
You may write down get the difference in days from the current row and the row beneath etc.). Finally search out how to do the formatting or calculation you require. Break the problem down.
I need some help:
How could I update the column of the file file_csv_reference.csv dataFrame using Pandas and Python?
file_csv_reference.csv:
cod_example
123456
123456
123456
789101
789101
121314
121314
there are lines with repeated information, I would like to replace all of them with the respective updated code in the file bellow:
file_with_updated_cod.csv
old_cod updated_cod
123456 ;1234567
789101 ;7891011
121314 ;1213141
Until now I'm thinking throught this way (but I can't do it run right):
import pandas as pd
file01 = pd.read_csv("file_csv_reference.csv", encoding = "utf-8", delimiter = ";", header = 0)
file02 = pd.read_csv("file_with_updated_cod.csv", encoding = "utf-8", delimiter = ";", header = 0)
for oldcod in file01['cod_example']:
for cod in file02['old_cod']:
if oldcod == cod:
#in this part I would like to replace the data in the file01 column cod_example
# with file01['updated_cod'] in the respective column
Could you help me please to solve this situation? Thank's!
You can use .map:
df1 = pd.read_csv("file_csv_reference.csv")
df2 = pd.read_csv("file_with_updated_cod.csv", sep=";")
df1["cod_example"] = df1["cod_example"].map(
df2.set_index("old_cod")["updated_cod"]
)
print(df1)
Prints:
cod_example
0 1234567
1 1234567
2 1234567
3 7891011
4 7891011
5 1213141
6 1213141
I have a csv containing details of documents located on Google Drive. I am trying to make it easier to read and deal with as this example has over 400 columns.
Each row in the csv represents a file on Google drive. There are multiple columns to denote who owns the file and who it is shared with.
Every time a file has been shared, the details of the person it has been shared with are appended as a new column to that row.
I have loaded the data into Pandas data frame and I'm struggling to move the contents of certain columns to a new row.
Below is an example
Input:
owner | id | title | permissions.0.name | permissions.0.email | permsissions.1.name | permissions.1.email
value 1 doc1 Tommy tommy#office.com Timmy timmy#office.com
value 2 doc2 Tommy tommy#office.com
value 3 doc3 Timmy timmy#office.com
value 4 doc4 Tammy tammy#office.com Tommy tommy#office.com
Output:
owner | id | title | permissions.0.name | permissions.0.email
value 1 doc1 Tommy tommy#office.com
value 2 doc2 Tommy tommy#office.com
value 3 doc3 Timmy timmy#office.com
value 4 doc4 Tammy tammy#office.com
value 5 doc1 Timmy timmy#office.com
value 6 doc4 Tommy tommy#office.com
I began by creating a list of and finding out the maximum number in the column headings (it is 46 in the full data). Then loop through from 1 to 46 building the column name to look at and moving the contents from that column to a different column on a new row. But I had no idea how to move the contents...
import pandas
df = pandas.read_csv(input.csv)
cols = list(df) #list of column names
maxcol =[]
for c in cols:
if '.' in c:
n = c.split('.')[1]
maxcol.append(int(n))
maxval = max(maxcol)
for i in range(1 to maxval):
colname = 'permissions.' + str(i) + '.name'
# move contents from this column to permissions.0.name in new row somehow
There are many more columns (over 400) and do not appear in an organised structure. For example columns are created when required. So we have columns like this:
permissions.5.email | permissions.1.withPhoto | permissions.6.name | permission.6.email
You can use this code. I think there are easier ways to do this, but it seems to be giving the required result:
cols = list(df)
# Get the variable column names
per_cols = [c for c in cols if '.' in c]
# Get the constant column names, i.e. title
main_cols = [c for c in cols if '.' not in c]
# Get the set of numbers present in columns
col_no = set([c.split('.')[1] for c in per_cols])
result = pd.DataFrame()
for col in col_no:
# Get the list of columns with the current number
part = [c for c in per_cols if c.split('.')[1] == col]
# Add the constant columns to get a complete df
part = main_cols + part
temp = df[part]
# Change the name of columns to '0' for unified result
temp.columns = [c.replace(col, '0') for c in temp.columns]
# Drop the NaN rows (preferrably use a subset you are certain wouldn't be null)
temp = temp.dropna()
# Append the chunck of df to result
# If the chunck has a new column, it will be added to result df
result = result.append(temp)
I am trying to re-arrange a file to match a BACS bank format. In order for it to work the columns in the csv below need to be of a specific length. I have figured out the abcdabcd column as it's a repeating pattern (as are a couple more in the file), but several columns have random numbers that I cannot easily target.
Is there a way for me to target either (ideally) a specific column based on its header, or alternatively target everything up to a comma to butcher something that could work?
In my example file below, you'll see three columns where the value changes. If targeting everything up to a specific character is the solution, I was thinking of using .ljust to fill the column up to the specified length (and then sorting it out manually in excel).
Original File
a,b,c,d,e,f,g,h,i,j,k
12345,1234567,0,11,123456,12345678,1234567,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,12345678,abcdabcd,A ABCD
123456,1234567,0,11,123456,12345678,12345,abcdabcd,A ABCD
12345,1234567,0,11,123456,12345678,1234567,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456789,abcdabcd,A ABCD
Ideal output
a,b,c,d,e,f,g,h,i,j,k
123450,12345670,0,11,123456,12345678,123456700,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456780,abcdabcd,A ABCD
123456,12345670,0,11,123456,12345678,123450000,abcdabcd,A ABCD
123450,12345670,0,11,123456,12345678,123456700,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456789,abcdabcd,A ABCD
Code
with open('file.txt', 'r') as file :
filedata = file.read()
filedata = filedata.replace('12345', '12345'.ljust(6, '0'))
with open('file.txt', 'w') as file:
file.write(filedata)
EDIT:
Something similar to this Python - How to add zeros to and integer/string? but while either targeting a specific column per header, or at least the first one.
EDIT2:
I am using the below to rearrange my columns, could this be modified to work with string lengths?
import pandas as pd
## Read csv / tab-delimited in this example
df = pd.read_csv('test.txt', sep='\t')
## Reorder columns
df = df[['h','i','c','g','a','b','e','d','f','j','k']]
## Write csv / tab-delimited
df.to_csv('test', sep='\t')
Using pandas, you can convert the column to str and then use .str.pad. You can make a dict with the requested lengths:
lengths = {
"a": 6,
"b": 8,
"c": 3,
"d": 6,
"e": 8,
}
and use it like this:
result = pd.DataFrame(
{
column_name: column.str.pad(
lengths.get(column_name, 0), side="right", fillchar="0"
)
for column_name, column in df.astype(str).items()
}
)
If the fillchar is different per column, you can get that from a dict as well
>>> print '{:0>5}'.format(4)
'00004'
>>> print '{:0<5}'.format(4)
'40000'
>>> print '{:0^5}'.format(4)
'00400'
Example:
#--------------DEFs------------------
def number_zero_right(number,len_number):
return ('{:0<'+str(len_number)+'}').format(number)
#--------------MAIN------------------
a = 12345
b = 1234567
c = 0
d = 11
e = 123456
f = 12345678
g = 1234567
h = 'abcdabcd'
i = 'A'
j = 'ABCD'
print(a,b,c,d,e,f,g,h,i,j)
# > 12345 1234567 0 11 123456 12345678 1234567 abcdabcd A ABCD
a = number_zero_right(a,6)
b = number_zero_right(b,8)
c = number_zero_right(c,1)
d = number_zero_right(d,2)
e = number_zero_right(e,6)
f = number_zero_right(f,8)
g = number_zero_right(g,9)
print(a,b,c,d,e,f,g,h,i,j)
#> 123450 12345670 0 11 123456 12345678 123456700 abcdabcd A ABCD
Managed to get there, so thought I'd post in case someone has a similar issue. This only works on one column, but that's enough for me now.
#import pandas
import pandas as pd
#open file and convert data to str
data = pd.read_csv('Test.CSV', dtype = str)
# width of output string
width = 6
# fillchar
char ="_"
#Change the contents of column named ColumnID
data["ColumnID"]= data["ColumnID"].str.ljust(width, char)
#print output
print(data)
I'm trying to find the correlation between the open and close prices of 150 cryptocurrencies using pandas.
Each cryptocurrency data is stored in its own CSV file and looks something like this:
|---------------------|------------------|------------------|
| Date | Open | Close |
|---------------------|------------------|------------------|
| 2019-02-01 00:00:00 | 0.00001115 | 0.00001119 |
|---------------------|------------------|------------------|
| 2019-02-01 00:05:00 | 0.00001116 | 0.00001119 |
|---------------------|------------------|------------------|
| . | . | . |
I would like to find the correlation between the Close and Open column of every cryptocurrency.
As of right now, my code looks like this:
temporary_dataframe = pandas.DataFrame()
for csv_path, coin in zip(all_csv_paths, coin_name):
data_file = pandas.read_csv(csv_path)
temporary_dataframe[f"Open_{coin}"] = data_file["Open"]
temporary_dataframe[f"Close_{coin}"] = data_file["Close"]
# Create all_open based on temporary_dataframe data.
corr_file = all_open.corr()
print(corr_file.unstack().sort_values().drop_duplicates())
Here is a part of the output (the output has a shape of (43661,)):
Open_QKC_BTC Close_QKC_BTC 0.996229
Open_TNT_BTC Close_TNT_BTC 0.996312
Open_ETC_BTC Close_ETC_BTC 0.996423
The problem is that I don't want to see the following correlations:
between columns starting with Close_ and Close_(e.g. Close_USD_BTC and Close_ETH_BTC)
between columns starting with Open_ and Open_ (e.g. Open_USD_BTC and Open_ETH_BTC)
between the same coin (e.g. Open_USD_BTC and Close_USD_BTC).
In short, the perfect output would look like this:
Open_TNT_BTC Close_QKC_BTC 0.996229
Open_ETH_BTC Close_TNT_BTC 0.996312
Open_ADA_BTC Close_ETC_BTC 0.996423
(PS: I'm pretty sure this is not the most elegant to do what I'm doing. If anyone has any suggestions on how to make this script better I would be more than happy to hear them)
Thank you very much in advance for your help!
This is quite messy but it at least shows you an option.
Her i am generating some random data and have made some suffixes (coin names) easier than in your case
import string
import numpy as np
import pandas as pd
#Generate random data
prefix = ['Open_','Close_']
suffix = string.ascii_uppercase #All uppercase letter to simulate coin-names
var1 = [None] * 100
var2 = [None] * 100
for i in range(len(var1)) :
var1[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
var2[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
df = pd.DataFrame(data = {'var1': var1, 'var2':var2 })
df['DropScenario_1'] = False
df['DropScenario_2'] = False
df['DropScenario_3'] = False
df['DropScenario_Final'] = False
df['DropScenario_1'] = df.apply(lambda row: bool(prefix[0] in row.var1) and (prefix[0] in row.var2), axis=1) #Both are Open_
df['DropScenario_2'] = df.apply(lambda row: bool(prefix[1] in row.var1) and (prefix[1] in row.var2), axis=1) #Both are Close_
df['DropScenario_3'] = df.apply(lambda row: bool(row.var1[len(row.var1)-1] == row.var2[len(row.var2)-1]), axis=1) #Both suffixes are the same
#Combine all scenarios
df['DropScenario_Final'] = df['DropScenario_1'] | df['DropScenario_2'] | df['DropScenario_3']
#Keep only the part of the df that we want
df = df[df['DropScenario_Final'] == False]
#Drop our messy columns
df = df.drop(['DropScenario_1','DropScenario_2','DropScenario_3','DropScenario_Final'], axis = 1)
Hope this helps
P.S If you find the secret key to trading bitcoins without ending up on r/wallstreetbets, ill take 5% ;)