Check for duplicates in a python panda data structure

Check for duplicates in a python panda data structure - python

I have a csv file. It looks something like this;
name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,
I would like to extract the data in id column and placed inside a data structure. For this, I used python panda. Here is the code for doing this;
import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
column_items = df['id']
I would like to check whether there is a duplicate among the data items in the id column. The data items are stored in column_items. In this case, there is a duplicate.
I am using python 2.7 and panda library.

To find out whether there are duplicate IDs in that whole column, do
df['id'].duplicated().any()

Related

Create a new dataset after selection in Python

Totally newebie with Python, and I'm trying to learn "on the field".
So basically I managed to open a csv file, pick only the rows that have certain values in specific columns, and then print the rows.
What I'd love to do after this is basically get a random selection of one of the found rows.
I thought to do that by creating a new csv file first, which at this point will only contains the filtered rows, and then randomly select from it.
Any ideas on the simplest way to do that?
Here's the portion of the code so far:
import csv
with open("top2018.csv") as f:
reader = csv.reader(f)
for row in reader:
if (row[4] >= "0.8") and (row[6] <= "-4") and (row[12] >= "0.8"):
print(row[2] + " -", row[1])
It will find 2 rows (I checked).
And then, for creating a new csv file:
import pandas as pd
artist = [row[2]]
name = [row[1]]
dict = {'artist': artist, 'name': name}
df = pd.DataFrame(dict)
df.to_csv('test.csv')
But I don't know why with this method, the new csv file has only 1 entry, while I'd want to have all of the found rows in it.
Hope something I wrote make sense!
Thanks guys!

You are mixing columns and rows, maybe you should rename the variable row to record so you see better what is happening. Unfortunately, I have to guess as to how the data file could look like...
The dict variable (try not to use this name, this is actually a built-in function and you don't want to overwrite it) is creating two columns, "artist", and "name", which seem to have values like [1.2]. So, dict (try to print it) could look like {"artist":[2.0], "name":[3.1]}, which is a single row, two column entity.
artist name
2.0 3.1
Try to get into pandas, use the df = pd.read_csv() and df[df.something > 0.3] style notation to filter tables, using the csv package is better suited for truly tricky data wrangling.

Converting weird NBA data structure into a pandas dataframe from an API

I am trying to convert this string into a pandas dataframe where each value before the colon is a header and the next value is put in the column. This is what the data structure looks like:
{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,"abbreviation":"CHA","city":"Charlotte","conference":"East","division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"}}
This sting I am converting has hundreds of games but I stop pasting after two. What is this data structure called and how can I quickly move it to a dataframe?

It looks like json, but it's not structured in a very useful way; everything is under the "data" key. To get it into a DataFrame, use a combination of Python's json module and pandas.normalize_json().
import json
import pandas as pd
data_string = """{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,
"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,
"abbreviation":"CHA","city":"Charlotte","conference":"East",
"division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},
"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"}}]}"""
raw_data = json.loads(data_string)
df = pd.json_normalize(raw_data['data'])
There is a lot of redundant information about the teams. Look at "home_team" for the two games, the information is the same. You could pull all the team data into a separate DataFrame and use the team "id" or "abbreviation" in the DataFrame with all the games.

Using python pandas how can we select very specific rows and associated column

I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.

As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.

As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck

How to import specific columns and rows from Excel using Pandas

My excel has many columns and rows data. But I want to import specific columns and rows data.
My code:
L_pos_org = pd.read_excel('EXCELFILE.xlsx',sheet_name='Sheet1',na_values=['NA'],usecols = "M:U")
Above code extract the columns that I want but it also extracts all rows.
In above excel file, I am trying to extract the data of Columns M:U and rows 106:114.
How to extract this?

Looking at the documentation here, it seems that with a recent enough version of Pandas you could extract a specific block of rows using the parameters skiprows and nrows. I think the command would look something like
pd.read_excel('EXCELFILE.xlsx',sheet_name='Sheet1',header=None,na_values=['NA'],usecols="M:U",skiprows=range(105),nrows=9)

Is this possible in python with pandas or xlwt

I have an existing excel. That looks like
and I have another excel that has around 40000 rows and around 300 columns. shortened version looks like
I would like to append values to my existing excel from second excel. But only values that match values in col4 from my existing excel. So i would get something like this
Hope you guys get the picture of what I am trying to do.

yes, that is possible in pandas and it is way faster than anything in excel
df_result = pd.merge(FirstTable, SecondTable, how='left', on='col4')
this will look into both the tables for column "col4" so it needs to be named this way in both the tables.
Also be aware of the fact that if you have multiple values in second table for single value in the first table it will make as many lines in the result as in the second table.
to read the excel you can use:
import pandas as pd
xl=pd.ExcelFile('MyFile.xlsx')
FirstTable = pd.read_excel(xl, 'sheet_name_FIRST_TABLE')
for more detailed description see documentation

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check for duplicates in a python panda data structure - python

To find out whether there are duplicate IDs in that whole column, do df['id'].duplicated().any()

Related

Create a new dataset after selection in Python

Converting weird NBA data structure into a pandas dataframe from an API

Using python pandas how can we select very specific rows and associated column

How to import specific columns and rows from Excel using Pandas

Is this possible in python with pandas or xlwt

Categories

Resources