Issue with columns in csv using pandas groupby - python

I have these below columns in my csv . Usually all these columns have value like below and the code works smoothly .
dec list_namme list device Service Gate
12 food cookie 200.56.57.58 Shop 123
Now I encountered issue, I got one csv file that has all these columns but there is no content for them. Here it looks like..
dec list_namme list device Service Gate
and once the code runs over it , it creates new csv with below columns that was not expected. I got new columns name as index and also , instead of 3(device service Gate) columns I am getting wrong 2.
index Gate
For the csv having contents I didnot faced any issue , even the columns are coming correctly.
Below is the code.
The code is :
if os.path.isfile(client_csv_file):
df=pd.read_csv(csv_file) #Read CSV
df['Gate']=df.Gate.astype(str)
df = df.groupby(['device', 'Service'])['Gate'].apply(lambda x: ', '.join(set(x))).reset_index()
df.to_csv(client_out_file, index=False)
Please help me in this code to fix this.

Performing a groupby on an empty dataframe is resulting in a dataframe without groupby-key columns.
One solution is to test if your dataframe is empty before performing manipulations:
if os.path.isfile(client_csv_file):
df = pd.read_csv(csv_file)
if df.empty:
df = df[['device', 'Service', 'Gate']]
else:
df['Gate'] = df.Gate.astype(str)
df = df.groupby(['device', 'Service'])['Gate']\
.apply(lambda x: ', '.join(set(x))).reset_index()
df.to_csv(client_out_file, index=False)

Related

How to write to a csv file in a next column in python

I currently have a csv file which has four columns
the next time I write to the file, I want to write from E1. I've searched for solutions but none seems to work.
with open(file_location,"w") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerows(list_of_parameters)
where list_of_parameters is a zip of all the four columns.
list_of_parameters = zip(timestamp_list,request_count_list,error_rate_list,response_time_list)
Anyone have any idea to implement this? Appreciate your help.
The Python library Pandas is very good for these sorts of things. Here is how you could do this in Pandas:
import pandas as pd
# Read file as a DataFrame
df = pd.read_csv(file_location)
# Create a second DataFrame with your new data,
# you want the dict keys to be your column names
new_data = pd.DataFrame({
'Timestamp': timestamp_list,
'Key Request': request_count_list,
'Failure Rate': error_rate_list,
'Response': response_time_list
})
# Concatenate the existing and new data
# along the column axis (adding to E1)
df = pd.concat([df, new_data], axis=1)
# Save the combined data
df.to_csv(file_location, index=False)

Exporting dataframe to csv not showing first column

I'm trying to export my df to a .csv file. The df has just two columns of data: the image name (.jpg), and the 'value_counts' of how many times that .jpg name occurs in the 'concat_Xenos.csv' file, i.e:
M116_13331848_13109013329679.jpg 19
M116_13331848_13109013316679.jpg 14
M116_13331848_13109013350679.jpg 12
M116_13331848_13109013332679.jpg 11
etc. etc. etc....
However, whenever I export the df, the .csv file only displayes the 'value_counts' column. How do I modify this?
My code is as follows:
concat_Xenos = r'C:\file_path\concat_Xenos.csv'
df = pd.read_csv(concat_Xenos, header=None, index_col=False)[0]
counts = df.value_counts()
export_csv = counts.to_csv (r'C:\file_path\concat_Xenos_valuecounts.csv', index=None, header=False)
Thanks! If any clarification is needed please ask :)
R
This is because the first column is set as index.
Use index=True:
export_csv = counts.to_csv (r'C:\file_path\concat_Xenos_valuecounts.csv', index=True, header=False)
or you can reset your index before exporting.
counts.reset_index(inplace=True)

Why is Pandas' whitespace delimiter skipping one of my values?

I'm currently trying to use Python read a text file into Sqlite3 using Pandas. Here are a few entries from the text file:
1 Michael 462085 2.2506 Jessica 302962 1.5436
2 Christopher 361250 1.7595 Ashley 301702 1.5372
3 Matthew 351477 1.7119 Emily 237133 1.2082
The data consists of popular baby names, and I have to separate male names and female names into their own tables and perform queries on them. My method consists of first placing all the data into both tables, then dropping the unneeded columns afterwards. My issue is that when I try to add names to the columns, I get a value error: The expected axis has 6 elements, but 7 values. I'm assuming it's because Pandas possibly isn't reading the last values of each line, but I can't figure out how to fix it. My current delimiter is a whitespace delimiter that you can see below.
Here is my code:
import sqlite3
import pandas as pd
import csv
con = sqlite3.connect("C:\\****\\****\\****\\****\\****\baby_names.db")
c=con.cursor()
# Please note that most of these functions will be commented out, because they will only be run once.
def create_and_insert():
# load data
df = pd.read_csv('babynames.txt', index_col=0, header=None, sep= '\s+', engine = 'python')
# Reading the textfile
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber', 'Girlpercent']
# Adding Column names
df.columns = df.columns.str.strip()
con = sqlite3.connect("*************\\baby_names.db")
# drop data into database
df.to_sql("Combined", con)
df.to_sql("Boys", con)
df.to_sql("Girls", con)
con.commit()
con.close()
create_and_insert()
def test():
c.execute("SELECT * FROM Boys WHERE Rank = 1")
print(c.fetchall())
test()
con.commit()
con.close()
I've tried adding multiple delimiters, but it didn't seem to do anything. Using just regular space as the delimiter seems to just create 'blank' column names. From reading the Pandas docs, it says that multiple delimiters are possible, but I can't quite figure it out. Any help would be greatly appreciated!
Note that:
your input file contains 7 columns,
but the initial column is set as the index (you passed index_col=0),
so your DataFrame contains only 6 regular columns.
Print df to confirm it.
Now, when you run df.columns = ['Rank', ...], you attempt to assing the
7 passed names to existing 6 data columns.
Probably you should:
read the DataFrame without setting the index (for now),
assign all 7 column names,
set Rank column as the index.
The code to do it is:
df = pd.read_csv('babynames.txt', header=None, sep='\s+', engine='python')
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent']
df.set_index('Rank', inplace=True)
Or even shorter (all in one):
df = pd.read_csv('babynames.txt', sep='\s+', engine='python',
names=['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent'], index_col='Rank')

Not able to select any other column except the first 1 in pandas dataframe

Searched alot on the internet to understand the issue. Tried most of it but in vain. I am reading a tsv file which is tab delimited.
import pandas as pd
df = pd.read_csv('abc.tsv',delimiter="\t", engine="python", encoding="UTF-8")
When I print the columns, I am getting this:
Index(['date', '​time', '​user_id', '​url', '​IP'], dtype='object')
When trying to access the dataframe, I am able to select only the first column by name while the rest gives KeyError:
print(df.loc[:, "time"])
KeyError: 'the label [time] is not in the [columns]'
Upgraded pandas also:
Successfully installed numpy-1.14.0 pandas-0.22.0 python-dateutil-2.6.1 pytz-2017.3 six-1.11.0
Any help would be highly appreciated
EDIT:
I can access all the columns with iloc
print(df.iloc[:, 1])
Comments to answer:
If return:
print (df.columns.tolist())
['date', '\u200btime', '\u200buser_id', '\u200burl', '\u200bIP']
then use strip for remove trailing whitespaces:
df.columns = df.columns.str.strip()
In this particular case, it was :
df.columns = df.columns.str.strip("\u200b")

Join two csv files with pandas/python without duplicates

I would like to concatenate 2 csv files. Each CSV file has the following structure:
File 1
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
File 2
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
I got a final csv that look like
Final file
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
So I have done this:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
full_df = pd.concat(df1,df2)
full_df = full_df.groupby(['id','category_id','lat','lng']).count()
full_df2 = full_df[['id','category_id']].groupby('id').agg('count')
full_df2.to_csv("final.csv",index=False)
I tried to groupby by id, categoy_id, lat and lng, the name could change
After the first groupby I want to groupby again but now by id and category_id because as showed in my example the first row changed in long but that is probably because file2 is an update of file1
I don't understand about groupby because when i tried to print I got just the count value.
One way to solve this problem is to just use df.drop_duplicates() after you have concatenated the two DataFrames. Additionally, drop_duplicates has an argument "keep", which allows you to specify that you want to keep the last occurrence of the duplicates.
full_df = pd.concat([df1,df2])
unique_df = full_df.drop_duplicates(keep='last')
Check the documentation for drop_duplicates if you need further help.
I could resolve this problemen with the next code:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
df_final=pd.concat([df1,df2]).drop_duplicates(subset=['id','category_id','lat','lng']).reset_index(drop=True)
print(df_final.shape)
df_final2=df_final.drop_duplicates(subset=['id','category_id']).reset_index(drop=True)
df_final2.to_csv('final', index=False)

Categories