I have a Pandas dataframe that have duplicate names but with different values, and I want to remove the duplicate names but keep the rows. A snippet of my dataframe looks like this:
And my desired output would look like this:
I've tried using the builtin pandas function .drop_duplicates(), but I end up deleting all duplicates and their respective rows. My current code looks like this:
df = pd.read_csv("merged_db.csv", encoding = "unicode_escape", chunksize=50000)
df = pd.concat(df, ignore_index=True)
df2 = df.drop_duplicates(subset=['auth_given_name', 'auth_surname'])
and this is output I am currently getting:
Basically, I want to return all the values of the coauthor but remove all duplicate data of the original author. My question is what is the best way to achieve the output that I want. I tried using the subset parameter but I don't believe I'm using it correctly.I also found a similar post, but I couldn't really apply it to python. Thank you for your time!
You may consider this code
df = pd.read_csv("merged_db.csv", encoding = "unicode_escape", chunksize=50000)
first_author = df.columns[:24]
df.loc[df.duplicated(first_author), first_author] = np.empty(len(first_author))
print(df)
I am tring to remove a column and special characters from the dataframe shown below.
The code below used to create the dataframe is as follows:
dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')
The above produces the following output:
I need help with regex to remove the characters  and delete the first column.
As regards regex, I have tried the following:
dt.withColumn('COUNTRY ID', regexp_replace('COUNTRY ID', #"[^0-9a-zA-Z_]+"_ ""))
However, I'm getting a syntax error.
Any help much appreciated.
If the position of incoming column is fixed you can use regex to remove extra characters from column name like below
import re
colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)
And for dropping index column you can refer to this stack answer
You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below
df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()
I have a dataframe where i'd like to add a column "exists" based on the item existing in another dataframe.
Using the isin function only answers back with 1 match based on that other dataframe. Same for a loc filter when i set the column i want to filter as index.
It just doesn't work as expected when i use a reference to a list or column of another DF like this:
table.loc[table.index.isin(tableOther['column']), : ]
In this case it only returns 1 item.
import pandas as pd
import numpy as np
# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';', index_col='Keyword')
# Source to compare keywords against
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])
### This column based check only returns 1 - seemingly random - match ###
table.loc[table.index.isin(tableSubject['subjects']), : ]
--------------
######## also tried it like this:
# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';')
# Source to compare keywords against
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])
mask = table['Keyword'].isin(tableSubject.subjects)
table[mask]
I've also tried using .query and turning the pd subject column to a list which ends with the same singular match result as above.
as the output is the same in all tries, I expect that it is something with the datasource..
Thank you for your thoughts!
Found the answer to be as simple as capitalization of words. Both sources of data were not set in lower characters. One list had Capitalized Words Like This and the other was random.
Learning: Make sure to set columns to be exactly the same as all options for matching look for exact matches.
This can be done as following:
table['Keyword'] = table['Keyword'].str.lower()
Also found a great answer here in case you don't need exact match:
How to test if a string contains one of the substrings in a list, in pandas?
I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df
I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.