So, relatively new to Python/Pandas but I do have a couple of years of programming under my belt but it's mainly with Java/C++ so nothing like a scripting language like Python.
My new job has me doing some scripting stuff and it's been pretty basic so far so I decided to try and do more and hopefully show my bosses that I am driven and willing to work hard and move up on the ladder and with that I wanted to make one of our data analysis tasks more efficient, by using Pandas to remove redundancies from an excel sheet. However, the redundancies that I'm trying to "parse" for is a substring within a "description" excel column.
import pandas as pd
xlsx = pd.ExcelFile('Filename.xlsx')
sheet1 = xlsx.parse(0)
So I read the excel file and parsed it into a data frame. I realized it may be easier to just use the read_csv instead, but by the time I thought about it, I was already committed into following through with excel. (Unless the transition isn't difficult, i'm just confused how I could export as a comma delimited when the original file is space delimited)
Here is how the data is kind of laid out:
ID Count#1 Count#2 Count#3 Description
1A42H4 1 0 2 Blahblah JIG GN=TRAC Blah Blah
242JB4 0 0 2 Blahblah JIG GN=SMOOTH Blah Blah
3MIVJ2 2 0 2 Blahblah JIG GN=TRAC Blah Blah
4JAXI3 1 0 3 BlahBlah JIG GN=TRAC Blah Blah
So I want to parse this datasheet and look for any redundant GN=TRAC(just similar GN=something) and then organize them all together into a separate datasheet. So I made an array of just the description column
`array = dataframe.description`
Then, I decided to use a string split on "JIG", because I didn't need that, and it was constant for all rows. so
`Splits = array.str.split('JIG')`
Because of that I was left with the
`array[0] = Blahblah,
GN=TRAC Blah Blah`
and now I wanted to isolate it again for
GN=TRAC
so I added them all into an array
`array2[n] = splits[n][1]`
again and did another split splits2 = array2.str.split(' ') to reorganize the
GN=TRAC
as the first position and isolate by itself. I realize I could have just did a space delimited on the original description, but there are different amounts of words so I wouldn't be able to parse or compare since the location for the
GN=TRAC
are all varying.
Now to iterate and compare them all I came up with this little function.
counter = 0
temp = counter + 1
print(sheet1.iloc[counter])
while counter <= len(sheet1):
if splits2[counter][0] == splits2[temp][0]:
print(sheet1.iloc[temp])
temp += 1
if splits2[counter][0] != splits2[temp][0]:
temp += 1
counter += 1
But I can't get past here. I'm able to iterate through and find all of the redundant rows with the first row GN=TRAC value, but the counter isn't iterating for the next row for comparison. I've tried a couple of variations, but I was hoping for a new pair of eyes. Based off of that table above, it would then go to the second row and find all the rows that match the GN=SMOOTH and on and on until the counter reaches the final iterated row.
Lastly, I was hoping I could get some help on the best way to organize them together based on the GN=? into an output.xlsx. I realize that there is the writer and to_excel but I'm just not sure how I would use that then. I read through the documentation as much as I can and it doesn't seem like there is a function that I could use to help me which is why it's pretty complicated (do let me know how to make it more efficient and scriptable though, I can generalize it later)
p.s. Is there also a way to write to the excel but in descending order of Count#1?
You could try
sheet1['GN'] = sheet1.Description.apply(lambda x: x.split('JIG')[1].split()[0])
which should insert a new column with the name GN into your DataFrame with the appropriate GN=* values.
To sort the DataFrame from any particular column you can sort the DataFrame by sheet1.sort('GN').
To save the DataFrame to an excel file you can use sheet1.to_excel('filename'). You can chain this with the above sort function to write a file sorted by a particular column.
Related
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks
I have an excel sheet which has a column as remarks. So, for example, a cell contains data in a format like
There is a
book.
There is also a pen along with the book.
So, I decided to study for a while.
When I convert that excel into a pandas data frame, the data frame only captures the 1st point till the new line. It won't capture point no 2. So, How can I get all points in an excel to one cell of a data frame?
The data which I get looks like:
There is a book.
The data which I want should look like:
There is a book. 2. There is also a pen along with the book. 3. So, I decided to study for a while.
I created an excel file with a column named remarks which looks like below:
remarks
0 1. There is a book.
2. There is also a pen along with the book.
3. So, I decided to study for a while.
Here, I have entered all the text mentioned in your question into single cell.
import pandas as pd
df = pd.read_excel('remarks.xlsx')
Now when I try to print the column remarks it gives:
df['remarks']
0 1. There is a book.\n2. There is also a pen al...
Name: A, dtype: object
To solve your problem try:
df['remarks_without_linebreak'] = df['remarks'].replace('\n',' ', regex=True)
If you print the row in the column 'remarks_without_linebreak' you will get the result as you want
I am building a shallow array in pandas containing pair of values (concept - document)
doc1 doc2
concept1 1 0
concept2 0 1
concept3 1 0
I parse a XML file and get pairs (concepts - doc)
every time a new pair comes in I add it to the pandas.
Since the pairs coming in might or might not contain values already present in the rows and/or columns (whatever new concept or whatever new column) I use the following code:
onp=np.arange(1,21,1).reshape(4,5)
oindex=['concept1','concept2','concept3','concept4',]
ohead=['doc1','doc2','doc3','doc5','doc6']
data=onp
mydf=pd.DataFrame(data,index=oindex, columns=ohead)
#... loop ...
mydf.loc['conceptXX','ep8']=1
it works well, only that the value in the data frame is 1.0 and not 1, boolean, and when a new row and/or column is added then the rest of the values are NaN. How can I avoid that. All the values added should be 0 or 1. (Note: the intention is to have also some columns for calculations, so I can not transform just all the dataframe into boolean type for instance:
mydf=mydf.astype(object)
thanks.
SECOND EDIT AFTER ALollz COMMENT
More explanation of the real problem.
I have an XML file that gives me the data in the following way:
<names>
<name>michael</name>
<documents>
<document>doc1</document>
<document>doc2</document>
</documents>
</name>
<name>mathieu</name>
<documents>
<document>doc1</document>
<document>docN</document>
</documents>
</name>
</names>
...
I want to pass this data to a dataframe to make calculations. Basically there are names that appear in different documents when parsing the XML with:
tree = ET.parse(myinputFile)
root = tree.getroot()
I am going adding one by one new values into the dataframe.
When adding sometimes a name is already present in the dataframe, but a new doc has to be added and viceversa.
I hope to have clarify a bit
I was about to write this as solution:
mydf.fillna(0, inplace=True)
mydf=mydf.astype(int)
changing all the NaN values by 0 and then convert them into int to avoid floats.
that has a negative side because i might want to have some columns with text data. in that case an error occur.
I am going to start off with stating I am very much new at working in Python. I have a very rudimentary knowledge of SQL but this is my first go 'round with Python. I have a csv file of customer related data and I need to output the records of customers who have spent more than $1000. I was also given this starter code:
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(data[0])
print(data[1]["Name"])
print(data[2]["Spent Past 30 Days"])
I am not looking for anyone to give me the answer, but maybe nudge me in the right direction. I know that it has opened the file to read and created a list (data) and is ready to output the values of the first and second row. I am stuck trying to figure out how to call out the column value without limiting it to a specific row number. Do I need to make another list for columns? Do I need to create a loop to output each record that meets the > 1000 criteria? Any advice would be much appreciated.
To get a particular column you could use a for loop. I'm not sure exactly what you're wanting to do with it, but this might be a good place to start.
for i in range(0,len(data)):
print data[i]['Name']
len(data) should equal the number of rows, thus iterating through the entire column
The sample code does not give away the secret of data structure. It looks like maybe a list of dicts. Which does not make much sense, so I'll guess how data is organized. Assuming data is a list of lists you can get at a column with a list comprehension:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_column = [row[1] for row in data]
print(spent_column) # prints: ['Spent Past 30 Days', 890, 1200]
But you will probably want to know who is a big spender so maybe you should return the names:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_names = [row[0] for row in data[1:] if int(row[1])>1000]
print(spent_names) # prints: ['Bert']
If the examples are unclear I suggest you read up on list comprehensions; they are awesome :)
You can do all of the above with regular for-loops as well.