Python Pandas 'Unnamed' column keeps appearing - python

I am running into an issue where each time I run my program (which reads the dataframe from a .csv file) a new column shows up called 'Unnamed'.
sample output columns after running 3 times -
Unnamed: 0 Unnamed: 0.1 Subreddit Appearances
here is my code. for each row, the 'Unnamed' columns simply increase by 1.
df = pd.read_csv(Location)
while counter < 50:
#gets just the subreddit name
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
if e in df['Subreddit'].values:
#adds 1 to Appearances if the subreddit is already in the DF
df.loc[df['Subreddit'] == e, 'Appearances'] += 1
else:
#adds new row with the subreddit name and sets the amount of appearances to 1.
df = df.append({'Subreddit': e, 'Appearances': 1}, ignore_index=True)
df.reset_index(inplace=True, drop=True)
print(e)
counter = counter + 2
#(doesn't work) df.drop(df.columns[df.columns.str.contains('Unnamed', case=False)], axis=1)
The first time i run it, with a clean .csv file, it works perfect, but each time after, another 'Unnamed' column shoes up.
I just wanted the 'Subreddit' and 'Appearances' columns to show each time.

An other solution would be to read your csv with the attribute index_col=0 to not take into account the index column : df = pd.read_csv(Location, index_col=0).

each time I run my program (...) a new column shows up called 'Unnamed'.
I suppose that's due to reset_index or maybe you have a to_csv somewhere in your code as #jpp suggested. To fix the to_csv be sure to use index=False:
df.to_csv(path, index=False)
just wanted the 'Subreddit' and 'Appearances' columns
In general, here's how I would approach your task.
What this does is to count all appearances first (keyed by e), and from these counts create a new dataframe to merge with the one you already have (how='outer' adds rows that don't exist yet). This avoids resetting the index for each element which should avoid the problem and is also more performant.
Here's the code with these thoughts included:
base_df = pd.read_csv(location)
appearances = Counter() # from collections
while counter < 50:
#gets just the subreddit name
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
appearances[e] += 1
counter = counter + 2
appearances_df = pd.DataFrame({'e': e, 'appearances': c }
for e, c in x.items())
df = base_df.merge(appearances_df, how='outer', on='e')

Related

Extract several values from a row when a certain value is found using a list

I have a .csv file with 29 columns and 1692 rows.
The columns D_INT_1 and D_INT_2 are just dates.
I want to check for these 2 columns if there is dates between :>= "2022-03-01" and <= "2024-12-31.
And, if a value is found, I want to display the date found + the value of the column "NAME" that is located on the same row of said found value.
This is what I did right now, but it only grab the dates and not the adjacent value ('NAME').
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
This is what I would like to get:
Example of DF:
NAME, ADDRESS, D_INT_1, D_INT_2
Mark, H4N1V8, 2023-01-02, 2019,-01-01
Expected output:
MARK, 2023-01-02
Lots of times the compact [for in] syntax can be used efficiently for simple code, but in this case I wouldn't recommend it. I suggest you use a normal for. Here's an example:
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
# loop for every row in the data
# (i will start as 0 and increase by 1 every iteration)
for i in range(0, len(D_INT_1)):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
print(NAME[i], D_INT_1[i])
if D_INT_2[i] >= "2022-03-01" and D_INT_2[i] <= "2024-12-31":
print(NAME[i], D_INT_2[i])
First for performance dont use loops, because exist vectorized alternatives unpivot by DataFrame.melt and filter by Series.between with DataFrame.loc:
df = df.melt(id_vars='NAME', value_vars=['D_INT_1','D_INT_2'], value_name='Date')
df1 = df.loc[df['Date'].between("2022-03-01","2024-12-31"), ['NAME','Date']]
print (df1)
NAME Date
0 Mark 2023-01-02
Or filter original DataFrame and last join in concat:
df1 = df.loc[df['D_INT_1'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_1']]
df2 = df.loc[df['D_INT_2'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_2']]
df = pd.concat([df1.rename(columns={'D_INT_1':'date'}),
df2.rename(columns={'D_INT_2':'date'})])
print (df)
NAME date
0 Mark 2023-01-02
Last if need loops output with print:
for i in df.itertuples():
print (i.NAME, i.Date)
Mark 2023-01-02 00:00:00
Mark 2019-01-01 00:00:00
So there a few things to be of note here:
In this case, you are better off probably with a normal for-loop since it can be a bit more complicated.
To do what you want, you want to first:
Load the names:
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
Use enumerate since we know all lists are aligned the same in your loop, keep in mind that enumerate gets both value and index, but I am getting the value manually just for cleaner (and clearer) code:
ext = []
for i,_ in enumerate(D_INT_1):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_1[i],NAMES[i]))
if D_INT_2[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_2[i],NAMES[i]))
Of course, you can use a list comprehension (or in this case, two), but this form should be easier to understand for this answer.
To do so, you will need to still load the names like in the first step, then use enumerate in the list comprehension, while adding the name after i in a tuple, perhaps something like this:
ext = [(i,NAMES[ind]) for ind,i in enumerate(D_INT_1 + D_INT_2) if i >= "2022-03-01" and i <= "2024-12-31"]
Keep in mind that I didn't test the above code since I have no access to the original csv, but it should be a good starting point.

Loop through cell range (Every 3 cells) and add ranking to it

The problem is I am trying to make a ranking for every 3 cells in that column
using pandas.
For example:
This is the outcome I want
I have no idea how to make it.
I tried something like this:
for i in range(df.iloc[1:],df.iloc[,:],3):
counter = 0
i['item'] += counter + 1
The code is completely wrong, but I need help with the range and put df.iloc in the brackets in pandas.
Does this match the requirements ?
import pandas as pd
df = pd.DataFrame()
df['Item'] = ['shoes','shoes','shoes','shirts','shirts','shirts']
df2 = pd.DataFrame()
for i, item in enumerate(df['Item'].unique(), 1):
df2.loc[i-1,'rank'] = i
df2.loc[i-1, 'Item'] = item
df2['rank'] = df2['rank'].astype('int')
print(df)
print("\n")
print(df2)
df = df.merge(df2, on='Item', how='inner')
print("\n")
print(df)

how to parse multi index values and create a csv file while parsing json data in python

I have few static key columns EmployeeId,type and few columns coming from first FOR loop.
While in the second FOR loop if i have a specific key then only values should be appended to the existing data frame columns else whatever the columns getting fetched from first for loop should remain same.
First For Loop Output:
"EmployeeId","type","KeyColumn","Start","End","Country","Target","CountryId","TargetId"
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","","",""
After Second For Loop i have below output:
"EmployeeId","type","KeyColumn","Start","End","Country","Target","CountryId","TargetId"
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","AMAZON","1",""
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","FLIPKART","2",""
As per code if i have Employee tag available , i have got above 2 records but i may have few json files without Employee tag then output should remain same as per First Loop Output.
But i am getting 0 records as per my code. Please help me if my way of coding is wrong.
Really sorry -- If the way of asking question is not clear , as i am new to python . Please find the code in the below hyper link
Please find below code
for i in range(len(json_file['enty'])):
temp = {}
temp['EmployeeId'] = json_file['enty'][i]['id']
temp['type'] = json_file['enty'][i]['type']
for key in json_file['enty'][i]['data']['attributes'].keys():
try:
temp[key] = json_file['enty'][i]['data']['attributes'][key]['values'][0]['value']
except:
temp[key] = None
for key in json_file['enty'][i]['data']['attributes'].keys():
if(key == 'Employee'):
for j in range(len(json_file['enty'][i]['data']['attributes']['Employee']['group'])):
for key in json_file['enty'][i]['data']['attributes']['Employee']['group'][j].keys():
try:
temp[key] = json_file['enty'][i]['data']['attributes']['Employee']['group'][j][key]['values'][0]['value']
except:
temp[key] = None
temp_df = pd.DataFrame([temp])
df = pd.concat([df, temp_df], sort=True)
# Rearranging columns
df = df[['EmployeeId', 'type'] + [col for col in df.columns if col not in ['EmployeeId', 'type']]]
# Writing the dataset
df[columns_list].to_csv("Test22.csv", index=False, quotechar='"', quoting=1)
If Employee Tag is not available i am getting 0 records as output but i am expecting 1 record as per output of first FOR loop. If the "Employee tag" is available then i am expecting 2 records along with my static columns "EmployeeId","type","KeyColumn","Start","End", else if the tag is not available then all the static columns "EmployeeId","type","KeyColumn","Start","End", and remaining columns as blanks
enter link description here
A long solution with modifying your code, so adding one more loop, changing indexes, as well as modifying the range parameters:
df = pd.DataFrame()
num = max([len(v) for k,v in json_file['data'][0]['data1'].items()])
for i in range(num):
temp = {}
temp['Empid'] = json_file['data'][0]['Empid']
temp['Empname'] = json_file['data'][0]['Empname']
for key in json_file['data'][0]['data1'].keys():
if key not in temp:
temp[key] = []
try:
for j in range(len(json_file['data'][0]['data1'][key])):
temp[key].append(json_file['data'][0]['data1'][key][j]['relative']['id'])
except:
temp[key] = None
temp_df = pd.DataFrame([temp])
df = pd.concat([df, temp_df],ignore_index=True)
for i in json_file['data'][0]['data1'].keys():
df[i] = pd.Series([x for y in df[i].tolist() for x in y]).drop_duplicates()
And now:
print(df)
Is:
Empid Empname XXXX YYYYY
0 1234 ABC Naveen Kumar
1 1234 ABC NaN Rajesh

Filter data through multiple columns and print rows?

Kind of a follow up on my last question. So I got this data in .csv file that looks like:
id,first_name,last_name,email,gender,ip_address,birthday
1,Ced,Begwell,cbegwell0#google.ca,Male,134.107.135.233,17/10/1978
2,Nataline,Cheatle,ncheatle1#msn.com,Female,189.106.181.194,26/06/1989
3,Laverna,Hamlen,lhamlen2#dot.gov,Female,52.165.62.174,24/04/1990
4,Gawen,Gillfillan,ggillfillan3#hp.com,Male,83.249.190.232,31/10/1984
5,Syd,Gilfether,sgilfether4#china.com.cn,Male,180.153.199.106,11/07/1995
What I want is that when the python program runs it asks the user what keywords to search for. It then takes all keywords entered ( maybe they are stored in a list???), then prints out all rows that contain all keywords no matter what column that keyword is in.
I've been playing around with csv and pandas, and have been googling for hours but just can't seem to get it to work like I want it to. I'm still kinda new to python3. Please help.
**Edit to show what I've got so far:
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("MOCK_DATA.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)
Works great if only searching by one keyword. But I want the choice of filtering by one or mutiple keywords.
Here you go, using try and except because if the datatype is not matched with your keyword it would raise an error
import pandas as pd
def fun(data,keyword):
ans = pd.DataFrame()
for i in data.columns:
try:
ans = pd.concat((data[data[i]==keyword],ans))
except:
pass
ans.drop_duplicates(inplace=True)
return ans
Try the following code for AND search with the keywords:
def AND_serach(df,list_of_keywords):
# init a numpy array to store the index
index_arr = np.array([])
for keyword in list_of_keywords:
# drop the nan if entire row is nan and get remaining rows' indexs
index = df[df==keyword].dropna(how='all').index.values
# if index_arr is empty then assign to it; otherwise update to intersect of two arrays
index_arr = index if index_arr.size == 0 else np.intersect1d(index_arr,index)
# get back the df by filter the index
return df.loc[index_arr.astype(int)]
Try the following code for ORsearch with the keywords:
def OR_serach(df,list_of_keywords):
index_arr = np.array([])
for keyword in list_of_keywords:
index = df[df==keyword].dropna(how='all').index.values
# get all the unique index
index_arr = np.unique(np.concatenate((index_arr,index),0))
return df.loc[index_arr.astype(int)]
OUTPUT
d = {'A': [1,2,3], 'B': [10,1,5]}
df = pd.DataFrame(data=d)
print df
A B
0 1 10
1 2 1
2 3 5
keywords = [1,5]
AND_serach(df,keywords) # return nothing
Out[]:
A B
OR_serach(df,keywords)
Out[]:
A B
0 1 10
1 2 1
2 3 5

Python - How to improve the dataframe performance?

There are 2 CSV files. Each file has 700,000 rows.
I should read one file line by line and find the same row from the other file.
After then, make two files data as one file data.
But, It takes about 1 minute just per 1,000 rows!!
I don't know how to improve the performance.
Here is my code :
import pandas as pd
fail_count = 0
match_count = 0
count = 0
file1_df = pd.read_csv("Data1.csv", sep='\t')
file2_df = pd.read_csv("Data2.csv", sep='\t')
columns = ['Name', 'Age', 'Value_file1', 'Value_file2']
result_df = pd.DataFrame(columns=columns)
for row in fil1_df.itterow():
name = row[1][2]
position = row[1][3]
selected = file2_df[(file2_df['Name'] == name ) & (file2_df['Age'] == age)]
if selected.empty :
fail_count += 1
continue
value_file1 = row[1][4]
value_file2 = selected['Value'].values[0]
result_df.loc[len(result_df)] = [name, age, value_file1, value_file2]
match_count += 1
print('match : ' + str(match_count))
print('fail : ' + str(fail_count))
result_df.to_csv('result.csv', index=False, encoding='utf-8')
Which line can be changed?
Is there any other way to do this process?
This might be too simplistic, but have you tried using pandas.merge() functionality?
See here for syntax.
For your tables:
result_df = pd.merge(left=file1_df, right=file2_df, on=['Name', 'Age'], how='inner')
That will do an "inner" join, only keeping rows with Names & Ages that match in both tables.

Categories