So I am venturing into python scripts. I am a beginner, but i have been tasked with taking a excel formula to python code. I have a text file that contains 3+million rows, each row has three columns and is delimited by tab.
All rows comes as string and first two columns have no problem. Problem with third column is that the content when downloaded gets added padding of 0s to make 18 characters if data is numerical.
In the same column, there are also values that contain space in between. Like 00 5372. Some are fully text format either identified by letter or character like ABC3400 or 00-ab-fu-99 or 00-33-44-66.
A1 B1 Values Output
AA00 dd 000000000000056484 56484
AB00 dd 00 564842 00 564842
AC00 dd 00-563554-f 00-563554-f
AD00 dd STO45642 STO45642
AE00 dd 45632 45632
I need to clean this type of codes to make the output text to be clean, while
Leaving the spaces between,
Clean the leading and trailing spaces,
Clean the value if it is padded with 0's in front.
I do in excel for limited amount by using following formula.
=TRIM(IFERROR(IF((FIND(" ";A2))>=1;A2);TEXT(A2;0)))
*Semicolon due to regional language setting.
For large file, I use following power query steps.
= Table.ReplaceValue(#"Trimmed Text", each [Values], each if Text.Contains([Values]," ") then [Values] else if Number.From([Values]) is number then Text.TrimStart([Values],"0") else [Values],Replacer.ReplaceValue,{"Values"})
First trim and then replace values. This does the job in Power Query very well. Now I would like to do it with Python Script. But as a noob I am stuck at very beginning. Can anyone help me with the library and code ?
My end target is to get the data saved in txt/csv with cleaned values.
Excel ScreenShot
*Edited to correct point 1) Leaving and not removing and further clarification with data.
try the code below(replace the column1,column2,column3 with respectiove column names and provide file_address to variable file_name, if python script and excel file is saved at same location then only name will be enough):
import pandas as pd
df = pd.read_excel(file_name, sep='\t', lineterminator='\r', skipinitialspace=True)
df['column1'] = df['column1'].str.replace(' ','')
df['column2'] = df['column2'].str.replace(' ','')
df['column3'] = df['column3'].str.replace(' ','')
df.to_csv('output.csv',index=False)
Related
I have the following string in a column within a row in a pandas dataframe. You could just treat it as a string.
;2;613;12;1;Ajc hw EEE;13;.387639;1;EXP;13;2;128;12;1;NNN XX Ajc;13;.208966;1;SGX;13;..
It goes on like that.
I want to convert it into a table and use the semi colon ; symbol as a delimiter. The problem is there is no new line delimiter and I have to estimate it to be every 10 items.
So, it should look something like this.
;2;613;12;1;Ajc hw EEE;13;.387639;1;EXP;13;
2;128;12;1;NNN XX Ajc;13;.208966;1;SGX;13;..
How do I convert that string into a new dataframe in pandas. After every 10 semi colon delimiters, a new row should be created.
I have no idea how to do this, any help would be greatly appreciated in terms of tools or ideas.
This should work
# removing first value as it's a semi colon
data = ';2;613;12;1;Ajc hw EEE;13;.387639;1;EXP;13;2;128;12;1;NNN XX Ajc;13;.208966;1;SGX;13;'[1:]
data = data.split(';')
row_count = len(data)//10
data = [data[x*10:(x+1)*10] for x in range(row_count)]
pd.DataFrame(data)
I used a double slash for dividing but as your data length should be dividable by 10, you can use only one.
Here's a screenshot of my output.
I'm very new to Python (Python 3.6 to be exact), and I'm having difficulty with str.split().
I have a .dat file with 11 columns of numbers in several rows (in string format for now) but I need to plot a graph with the data.
The first column of numbers is "x", and the 10 other columns are "Y". I've split files into 2 columns before, but not 11. One of the requisites was that it needs to be infinitely expandable, and that's what I can't figure out.
So I have so far;
#Make Columns Data_X and _Y
data_X=[]
data_Y=[]
#Open file.dat in Python and split columns
file = open('file.data','r')
for line in file.readlines():
data_x,data_y, data_y2, data y3...,data y10 =line.split()
Then after this;
#Convert string to float
data_X = numpy.array(x, dtype=float)
data_Y = numpy.array(y, dtype=float)
This can make the 11 columns, and I can make them floats afterwards for my plot afterwards, but I know this isn't infinitely repeatable (a y12 column will bust it)... and I'm not so sure about the Data_X/Data_Y=[] part either.
How do I split the strings into columns with the potential to do it infinitely? A big stipulation in this is that I can't use pandas either (on top of that, I don't what they do).
Thank you, and I'm sorry if this has been asked a lot but the closest solution I found to my problem was this, which only brought up one row;
for line in file.readlines():
data_X, data_Y = line.split(' ', 1)
I am using code below to remove all non english characters below:
DF.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
where df has a column called text with text in it like below:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。\n
¡Hola miguel! Lamento mucho la confusión cau
expected output:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
For my rows where my code removes characters -
I want to delete those rows from the df completely, meaning if it does replace any non-english characters, I want to delete that row from the df completely to avoid having that row with either 0 characters or a few characters that are meaningless after they have been altered by the code above.
You can use
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['hi what are you saying?', 'ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。'], 'another_col':['demo 1', 'demo 2']})
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
# text another_col
# 0 hi what are you saying? demo 1
Notes:
df['text'].str.contains(r'[^\x00-\x7F]') finds all values in text column that contain a character other than ASCII char (it is our "mask")
df[~...] only keeps those rows that did not match the regex.
str.contains() returns a Series of booleans that we can use to index our frame
patternDel = "[^\x00-\x7F]"
filter = df['Event Name'].str.contains(patternDel)
I tend to keep the things we want as opposed to deleting rows. Since filter represents things we want to delete we use ~ to get all the rows that don't match and keep them
df = df[~filter]
I have a set of columns that contain dates (imported from Excel file), and I need to process them as follows:
If a cell in one of those columns is blank, set another column to 1, else that column is 0. This allows me to sum all the 1's and show that those items are missing.
This is how I am doing that at present:
df_combined['CDR_Form_notfound'] = np.where(df_combined['CDR-Form'].mask(df_combined['CDR-Form'].str.len()==0).isnull(),1,0)
A problem I am having is that I have to format those columns so that A) dates are trimmed to show only the day/month/year and B) some of the columns have a value of "see notes" in them, instead of being a date or blank. That "see notes" is essential to properly accounting for missing items, it has to be there to keep the cell from flagging as empty and the item counting as missing (adding to the 'blank cells' count). The actual problem is that if I run this code before the .isnull code above, evry blank becomes a NaN or a nan or a NaT, and then NOTHING flags as null/missing.
This is the code I am using to trim the date strings and change the "see notes" to a string...because otherwise it just ends up blank in the output.
for c in df_combined[dateColumns]:
df_combined[c] = df_combined[c].astype(str) # uncomment this if columns change from dtype=str
df_combined[c] = np.where(df_combined[c].str.contains("20"), df_combined[c].str[:10], df_combined[c])
df_combined[c] = np.where(df_combined[c].str.contains("see notes"), df_combined[c].str, df_combined[c])
I think my problem might have something to do with the dtypes of the columns. When I run print(df.dtypes), every column shows as 'object', except for one I specifically set to int using this:
df_combined['Num'] = df_combined['Num'].apply(lambda x: int(x) if x == x else "")
Are you trying to count NaNs?
If yes, you can just do:
len(df.loc[:, df.isnull().any()])
I see that you mention "blank" because it is comming from excel, so what you could do is to transform these blanks into nan before running the command above using:
df['CDR-Form'].replace('', np.NaN,inplace=True)
Today is my 1st time using with Open Office and I need some help to get started.
My boss gave me two Excel files containing data I have to complete.
The 1st sheet contains in 'A' a supplier code and a bunch a other columns with datas like the product dimensions that I need to keep.
The 2nd contains in 'A' the same supplier code, tho not in the same order as there is also a lot of them in double, and it has in 'B' a store code.
I need to add a column to the 1st sheet containing the store code if there is a match in both sheet with the code contained in 'A'.
It's not really my job but since everyone is gone for the summer he charged me to do it.
My problem is that both sheets are over 12 000 lines longs, and I know that there is only 700 to 800 articles that will have a match. And since its my 1st time using Excel/OpenOffice (I know I know...) I was wondering if there was a way to automate this work, either with OpenOffice or if I could use a script to do it. I've found a lot similar post but none of them is quite what I need.
Any help is welcome.
Thx !
I would first try to let it run disregarding memory concerns and only fix that problem if you encounter it. 12,000 seems like a lot, but you might be surprised at what you can throw at python and have it "just work". I almost exclusively use csv files in programming when I encounter excel or the like...
import csv
# B.csv:
# store# part#
# xx xx
# xx xx
# xx xx
# ...
partNums = {}
with open('B.csv') as bfile:
breader = csv.DictReader(bfile)
for row in breader:
partNums[row['part#']] = row['store#']
# A.csv
# part# tag1 tag2 tag3 ...
# xx xx xx xx ...
# xx xx xx xx ...
# xx xx xx xx ...
# ...
with open('outfile.csv', 'wb') as outfile:
with open('A.csv', 'rb') as afile:
areader = csv.reader(afile)
outwriter = csv.writer(outfile)
headers = areader.next()
headers.append('StoreCode')
outwriter.writerow(headers)
for row in areader:
if row[0] in partNums: #assuming first row is part#
row.append(partNums[row[0]])
else: #value if no store number exists
row.append('')
outwriter.writerow(row)
You might also try a VLOOKUP() formula. It might work in Calc with the .xls files, but I would save them both as .ods first just to be safe. You would write the formula at the top of an empty column in File1, something like:
=VLOOKUP(A1;'file:///C:/FolderName/File2.ods'#$Sheet1.$A$1:$B$12000;2;0)
The syntax for VLOOKUP with examples is explained in this tutorial: VLOOKUP questions and answers.
Make sure to include the $ symbols in front of the column letter and row numbers - this tells the spreadsheet to NOT adjust those letters and numbers when you copy/paste the formula. The first argument (A1 here) that is just plain WILL be adjusted to A2, A3, etc. when copy/pasted so the formula will be matching the appropriate store code on each row.
Once you have that first cell working, copy that cell with the VLOOKUP formula. Then click on the name box (it's to the left of the formula bar) and type in the range you want to paste that formula into, something like H2:H12000 and press enter. The range you typed should be highlighted. Now paste your formula.
All the matching codes should appear and the non-matching ones will show #N/A. If you want blanks instead of #N/A you could do something like
=IF(ISERROR(VLOOKUP(A1;'file:///C:/FolderName/File2.ods'#$Sheet1.$A$1:$B$12000;2;0);"";VLOOKUP(A1;'file:///C:/FolderName/File2.ods'#$Sheet1.$A$1:$B$12000;2;0))
Basically that says "if this formula returns an error; then show a blank; otherwise show the result of this formula"