I have a given dataframe
new_df :
ID
summary
text_len
1
xxx
45
2
aaa
34
I am performing some df manipulation by concatenating keywords from different df, like that:
keywords = df["keyword"].to_list()
for key in keywords:
new_df[key] = new_df["summary"].str.lower().str.count(key)
new_df
from here I need two separate dataframes to perform few actions (to each of them add some columns, do some calculations etc).
I need a dataframe with occurrences as per given piece of code and a binary dataframe.
WHAT I DID:
assign dataframe for occurrences:
df_freq = new_df (because it is already calculated an done)
I created another dataframe - binary one - on the top of new_df:
#select only numeric columns to change them to binary
numeric_cols = new_df.select_dtypes("number", exclude='float64').columns.tolist()
new_df_binary = new_df
new_df_binary['text_length'] = new_df_binary['text_length'].astype(int)
new_df_binary[numeric_cols] = (new_df_binary[numeric_cols] > 0).astype(int)
Everything works fine - I perform the math I need, but when I want to come back to df_freq - it is no longer dataframe with occurrences.. looks like it changed along with binary code
I need separate tables and perform separate math on them. Do you know how I can avoid this hmm overwriting issue?
You may use pandas' copy method with the deep argument set to True:
df_freq = new_df.copy(deep=True)
Setting deep=True (which is the default parameter) ensures that modifications to the data or indices of the copy do not impact the original dataframe.
I wanted to create a 2D dataframe about coronavirus such that it contains a column containing countries and another one containing number of deaths. the csv file that I am using is date oriented so for some days the number of deaths is 0 so I decided to group them by Country and sum them up. yet it returned a dataframe with 1 column only. but when I write it to a csv file it creates 2 columns.
here is my code:
#import matplotlib.pyplot as plt
from pandas.core.frame import DataFrame
covid_data = pd.read_csv('countries-aggregated.csv')
bar_data = pd.DataFrame(covid_data.groupby('Country')['Deaths'].sum())
Difficult to give you a perfect answer without the dataset, however, groupby will set your key as index, thus returning a Series. You can pass as_index=False:
bar_data = covid_data.groupby('Country', as_index=False)['Deaths'].sum()
Or, if you have only one column in the DataFrame to aggregate:
bar_data = covid_data.groupby('Country', as_index=False).sum()
How to implement the Excel 'COUNTIF()' using python
Please see the below image for the reference,
I have a column named 'Title' and it contains some text (CD,PDF). And I need to find the count of the string in the column as given below.
No.of CD : 4
No.of PDF: 1
By using Excel I could find the same by using the below formula
=COUNTIF($A$5:$A$9,"CD")
How can I do the same using python.
For a simple summary of list item counts, try .value_counts() on a Pandas data frame column:
my_list = ['CD','CD','CD','PDF','CD']
df['my_column'] = pd.DataFrame(my_list) # create data frame column from list
df['my_column'].value_counts()
... or on a Pandas series:
pd.Series(my_list).value_counts()
Having a column of counts can be especially useful for scrutinizing issues in larger datasets. Use the .count() method to create a column with corresponding counts (similar to using COUNTIF() to populate a column in Excel):
df['countif'] = [my_list.count(i) for i in my_list] # count list item occurrences and assign to new column
display(df[['my_column','countif']]) # view results
I guess you can do map to compare with "CD" then sum all the values
Example:
Create "title" data:
df = pd.DataFrame({"Title":["CD","CD","CD","PDF","CD"]})
The countif using map then sum
df["Title"].map(lambda x: int(x=="CD")).sum()
I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image
I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.