I have a CSV file. I load it in pandas dataframe. Now, I am practicing the loc method. This CSV file contains a list of James bond movies and I am passing letters in the loc method. I could not interpret the result shown.
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)
bond.loc["A": "I"]
The result for the above code is:
bond.loc["a": "i"]
And the result for the above code is:
What is happening here? I could not understand. Please someone help me to understand the properties of pandas.
Following is the file:
Your dataframe uses the first column ("Film") as an index when it is imported (because of the option index_col = "Film"). The column contains the name of each film stored as a string, and they all start with a capital letter. bond.loc["A":"I"] returns all films where the index is greater than or equal to "A" and less than or equal to "I" (pandas slices are upper-bound inclusive), which by the rules of string comparison in Python includes all films beginning with "A"-"H", and would also include a film called "I" if there was one. If you enter e.g. "A" <= "b" <="I" in the python prompt you will see that lower-case letters are not within the range, because ord("b") > ord("I").
If you wrote bond.index = bond.index.str.lower() that would change the index to lower case and you could search films using e.g. bond["a":"i"] (but bond["A":"I"] would no longer return any films).
DataFrame.loc["A":"I"] returns the rows that start with the letter in that range - from what I can see and tried to reproduce. Might you attach the data?
Related
I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.
I need really serious help with some code.
I've a dataframe in which I want to find duplicates across 2 columns- Material Part Number & Manufacturer.The columns have null values. The way I need to find duplicates is as follows.
I first check the Part Number column for rows with no null values. As I do not want the null values to be treated as duplicates
If the part numbers match, for two same part numbers, I check the manufacturer column for duplicates.
Incase both the manufacturer and the part numbers are the same for two or more rows, I output result into a new column called level of duplicacy. The output is high for all the rows in which the part numbers and the manufacturers have an exact match.
However, if the part numbers match and the manufacturer doesn't match, the output into the column is 'Moderate'.
If the part number itself doesn't match, then Level of Duplicacy is 'No Duplicate'
Also, For rows which are NA in the part number or manufacturer, put the level of duplicacy as 'No Duplicate' incase of Part number and Moderate incase of Manufacturer.
This is my input table
enter image description here
The code I have written for the same is
`for i in range(len(df)):
if pd.isnull(df.loc[i,'Material Part Number'])==False:
if (df['Material Part Number'].duplicated(keep=False))[i]==True:
if pd.isnull(df.loc[i,'Manufacturer'])==False:
if (df['Manufacturer'].duplicated(keep=False))[i]==True:
df.loc[i,'Level of Duplicacy']='High'
else:
df.loc[i,'Level of Duplicacy']='Moderate'
else:
df.loc[i,'Level of Duplicacy']='Moderate'
else:
df.loc[i,'Level of Duplicacy']='Not duplicate'
else:
df.loc[i,'Level of Duplicacy']='Not duplicate'`
The output I need is
enter image description here
The output I'm getting is
enter image description here
As you can see in the rows highlighted in yellow, my code isn't comparing manufacturers within one particular/unique part number but it's doing it across all the part numbers and I don't want it to do that. I know that .duplicated() function compares for the entire column but what if I want it to compare within each unique part number and then find a match? More of a groupby with duplicated? Can one of you help me modify the code I have written?
Thanks a lot.
Running a loop through the data frame would require you to do element wise comparisons of each item. I would suggest using matrix algebra to achieve this. Have a look below, this might be helpful.
df["Level of Duplicacy"] = "Not Duplicate"
Partdups = df.loc[df["Material Part Number"].duplicated(),"Material Part Number"].unique()
for dup in Partdups:
Nums = df.loc[df["Material Part Number"] == dup,:]
dupNums = Nums.loc[Nums["Manufacturer"].duplicated(),"Manufacturer"].unique()
for num in dupNums:
Nums.loc[Nums["Manufacturer"] == num,"Level of Duplicacy"] = "High"
Nums.loc[Nums["Manufacturer"] != num,"Level of Duplicacy"] = "Moderate"
df.iloc[Nums["Material Part Number"].index,:] = Nums
df.loc[pd.isnull(df["Material Part Number"]),"Level of Duplicacy"] = "Not Duplicate"
minor problem doing my head in. I have a dataframe similar to the following:
Number Title
12345678 A
34567890-S B
11111111 C
22222222-L D
This is read from an excel file using pandas in python, then the index set to the first column:
db = db.set_index(['Number'])
I then lookup Title based on Number:
lookup = "12345678"
title = str(db.loc[lookup, 'Title'])
However... Whilst anything postfixed with "-Something" works, anything without it doesn't find a location (eg. 12345678 will not find anything, 34567890-S will). My only hunch is it's to do with looking up as either strings or ints, but I've tried a few things (converting the table to all strings, changing loc to iloc,ix,etc) but so far no luck.
Any ideas? Thanks :)
UPDATE: So trying this from scratch doesn't exhibit the same behaviour (creating a test db presumably just sets everything as strings), however importing from CSV is resulting in the above, and...
Searching "12345678" (as a string) doesn't find it, but 12345678 as an int will. Likewise the opposite for the others. So the dataframe is only matching the pure numbers in the index with ints, but anything else with strings.
Also, I can't not search for the postfix, as I have multiple rows with differing postfix eg 34567890-S, 34567890-L, 34567890-X.
If you want to cast all entries to one particular type, you can use pandas.Series.astype:
db["Number"] = df["Number"].astype(str)
db = db.set_index(['Number'])
lookup = "12345678"
title = db.loc[lookup, 'Title']
Interestingly this is actually slower than using pandas.Index.map:
x1 = [pd.Series(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
x2 = [pd.Index(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
def series_astype(x1):
return x1.astype(str)
def index_map(x2):
return x2.map(str)
Consider all the indeces as strings, as at least some of them are not numbers. If you want to lookup a specific item that possibly could have a postfix, you could match it by comparing the start of the strings with .str.startswith:
lookup = db.index.str.startswith("34567890")
title = db.loc[lookup, "Title"]
I am iterating over some data in a pandas dataframe searching for specific keywords, however the resulting regex search results in a KeyError: 19.
I've tried to pull out the data in the specific cell, place it in a string object and search through that, but every time I attempt to point anything to look a the data in that column, I get a KeyError: 19.
To preface my code example, I have pulled out specific chunks of the dataframe and placed them in a list of lists. (Of these chunks, I have kept all of the columns that were in the original dataframe)
Here is an example of the iteration I am attempting:
for eachGroup in mainList:
for lineItem in eachGroup:
if re.search(r'( keyword )', lineItem[19], re.I):
dostuff
As you might have guessed, the data I am searching for keywords in is column 19 which has data formatted like this:
3/23/2019 11:32:0 3/23/2019 11:32:0 3/23/2019 14:3:0 CSG CHG H6 27 1464D Random Random Random 81
Every other attempt at searching for keywords in different columns executes fine without any errors. Why would this case alone return a KeyError?
To add some more clarity, even the following code produces the same KeyError:
for eachGroup in mainList:
for lineItem in eachGroup:
text = lineItem[19]
Here's a WTF moment...
Instead of using python's smart for looping, I decided to be more granular and loop through with a while loop. Needless to say it worked.
The below code implementation fixes the issue though why it does I have no clue:
bigCount = len(mainList)
count = 0
while count < bigCount:
while smallCount < len(mainList[count]):
if re.search(r'( keyword )', mainList[count][smallCount][19], re.I):
dostuff
Try changing re.search(r'( keyword )', lineItem[19], re.I): to re.match('(.*)keyword(.*)', lineItem[19]):. re.search will return the corresponding matching object, while re.match will return a logical value that you need in an if statement. The suffix and prefix (.*) is to ignore any other character to the left or right of the string. Hope it helps.
name played wins loses
Leeroy,19,7,12
Jenkins,19,8,11
Tyler,19,0,19
Napoleon Wilson,19,7,12
Big Boss,19,7,12
Game Dude,19,5,14
Macho Man,19,3,16
Space Pirate,19,6,13
Billy Casper,19,7,12
Otacon,19,7,12
Big Brother,19,7,12
Ingsoc,19,5,14
Ripley,19,5,14
M'lady,19,4,15
Einstein100,19,8,11
Dennis,19,5,14
Esports,19,8,11
RNGesus,19,7,12
Kes,19,9,10
Magnitude,19,6,13
Basically, this is a file called firesideResults, which i open in my code and i have to check through it. If the win column contains a 0 i do not print it out, so if it contains a number other than zero i display it on the screen. However, i have multiple lists of numbers to deal with and i can't find how to only deal with one column of numbers.
my code was going to be
if option == ("C") or option == ("c"):
answer = False
file_3 = open("firesideResults.txt")
for column in file_3:
if ("0" not in column):
print(column)
But unfortunately, one of the other columns of code contain a 0 so i cannot do that. Thank you for your help and if possible please list any questions that i could check for help as i have been searching for so long.
Since you have comma-separated fields, the best way would be to use csv!
import csv
with open("firesideResults.txt") as file_3:
cr = csv.reader(file_3)
for row in cr:
if row[2]!="0":
print(row)
if third column of each row is not 0, then print it.
no substring issue: checks for exact field
checks for field at the "win" column only, not the other ones.