Pandas Dataframe to String with separator - python

I want to turn a dataframe into a string.
this topic How to turn a pandas dataframe row into a comma separated string is close to what I want. The only problem of this solution : I have a column 'Country' with string which have separator (for example, with this solution, the dataframe is converting into string but I have 'United States' that become 'United,States')
So currently I just have the following code:
df = df.to_string(index=False).split('\n')
df = [','.join(ele.split()) for ele in df]
df = '\r\n'.join(df)
df = df.encode('utf8')
but for a dataframe like this one:
data = [['United States', 10, 12], ['United Kingdom', 15, 25], ['France', 14, 18]]
df = pd.DataFrame(data, columns = ['Country', 'Number1', 'Number2'])
I will have
b'Country,Number1,Number2\r\nUnited,States,10,12\r\n,United,Kingdom,15,25\r\nFrance,14,18'
Instead of:
b'Country,Number1,Number2\r\nUnited States,10,12\r\n,United Kingdom,15,25\r\nFrance,14,18'
Currently I have solved the problem by many :
df= df.replace('United,States', 'United States')
But it is not a really good solution because each times a new country with space come, I have to update the script
(the final goal is to convert dataframe into string in utf-8 to be allow to compute it's md5 , without using df.to_csv() and compute the md5 of the file created, if you have a better way than this trick it can also help me)
Thanks!

data = [['United States', 10, 12], ['United Kingdom', 15, 25], ['France', 14, 18]]
df = pd.DataFrame(data, columns = ['Country', 'Number1', 'Number2'])
df = df.to_csv(header=None, index=False).strip('\n').split('\n')
df_string = '\r\n'.join(df) # <= this is the string that you can use with md5
df_bytes = df_string.encode('utf8') # <= this is bytes object to write the file
print(df_bytes)
Use df_string for md5 and df_bytes to write the file.
df_bytes contains this:
b'United States,10,12\r\nUnited Kingdom,15,25\r\nFrance,14,18'

Variant without sending it to csv:
import pandas as pd
data = [['United States', 10, 12], ['United Kingdom', 15, 25], ['France', 14, 18]]
df = pd.DataFrame(data, columns = ['Country', 'Number1', 'Number2'])
df['Country']=df['Country'].str.replace(' ','_')
df = df.to_string(index=False).split('\n')
df = [','.join(ele.split()) for ele in df]
df = [element.replace('_',' ') for element in df]
df = '\r\n'.join(df)
df = df.encode('utf8')
df

Related

How to check file is as per format in python

So i have excel sheet of data which have 20 something columns the customer have requirment that they want to know if any of column is missing from excel im using pandas for converting data into dataframes i used if statements for few columns but as its rigid soulution they want something better
any suggestion ? are there any libraries there?
Thanks
want to check if file have all required columns and display check file if there is some erorr
Here I created a dataframe, but you would be usingdf = pd.read_excel('myfile.xlsx)`
My dataframe has only the three following columns
data = {'Name':['Tom', 'Nick', 'Sarah', 'Jack'],
'Age':[20, 21, 19, 18],
'Sex':['M', 'M', 'F', 'M']}
df = pd.DataFrame(data)
I'll make a list then of required cols
REQUIRED_COLUMNS = [
'Name',
'Age',
'Occupation',
'Sex'
]
# I'll make the columns a set to avoid O^2 looping.
dfColumns = set(df.columns)
for col in REQUIRED_COLUMNS:
if col not in dfColumns:
print(f"Column '{col}' is missing.")
Et voilĂ 
>>> Column 'Occupation' is missing.

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

Using iterrows() to fill text into column

I want to use iterrows() to fill predetermined substrings (Name and Age) in the column 'Unique Code' with the values coming from the other two columns - 'Name'and 'Age'. However while the loop prints the correct output - 'Unique Code' values do not update?
lst = [['tom', 25, 'EVT-PS-Name-Age' ], ['krish', 30, 'EVT-PS-Name-Age'],
['nick', 26, 'EVT-PS-Name-Age'], ['juli', 22, 'EVT-PS-Name-Age']]
df = pd.DataFrame(lst, columns =['Name', 'Age', 'Unique Code'])
for index, row in df.iterrows():
row['Unique Code'] = str(row['Unique Code'])
row['Age'] = str(row['Age'])
row['Unique Code'] = row['Unique Code'].replace('Name', row['Name'])
row['Unique Code'] = row['Unique Code'].replace('Age', row['Age'])
print(row['Unique Code'])
df.head()
This is my intended outcome - thanks!
lst = [['tom', 25, 'EVT-PS-tom-25' ], ['krish', 30, 'EVT-PS-krish-30'],
['nick', 26, 'EVT-PS-nick-26'], ['juli', 22, 'EVT-PS-juli-22']]
df = pd.DataFrame(lst, columns =['Name', 'Age', 'Unique Code'])
If you want to use loop/iterrows in your code you can assign
using this snippet at the end of your for loop:df["Unique Code"][index] = row["Unique Code"]
As per why this does not work, The row variable defined by the loop here is a temporary one and does not affect the dataframe rows.

Finding the min of a column across multiple lists in python

I need to find the minimum and maximum of a given a column from a csv file and currently the value is a string but I need it to be an integer, right now my output after I have split all the lines into lists looks like this
['FRA', 'Europe', 'France', '14/06/2020', '390', '10\n']
['FRA', 'Europe', 'France', '11/06/2020', '364', '27\n']
['FRA', 'Europe', 'France', '12/06/2020', '802', '28\n']
['FRA', 'Europe', 'France', '13/06/2020', '497', '24\n']
And from that line along with its many others I want to find the minimum of the
5th column and currently when I do
min(column[4])
It just gives the min of each individual list which is just the number in that column rather than grouping them all up and getting that minimum.
P.S: I am very new to python and coding in general, I also have to do this without any importing of modules.
For you Azro.
def main(csvfile,country,analysis):
infile = csvfile
datafile = open(infile, "r")
country = country.capitalize()
if analysis == "statistics":
for line in datafile.readlines():
column = line.split(",")
if column[2] == country:
You may use pandas that allows to read csv file and manipulate them as DataFrame, then it's very easy to retrieve a min/max from a column
import pandas as pd
df = pd.read_csv("test.txt", sep=',')
mini = df['colName'].min()
maxi = df['colName'].max()
print(mini, maxi)
Then if you have already read your data in a list of lists, you max use builtin min and max
# use rstrip() when reading line, to remove leading \n
values = [
['FRA', 'Europe', 'France', '14/06/2020', '390', '10'],
['FRA', 'Europe', 'France', '14/06/2020', '395', '10']
]
mini = min(values, key=lambda x: int(x[4]))[4]
maxi = max(values, key=lambda x: int(x[4]))[4]
Take a look at the library pandas and especially the DataFrame class. This is probably the go-to method for handling .csv files and tabular data in general.
Essentially, your code would be something like this:
import pandas as pd
df = pd.read_csv('my_file.csv') # Construct a DataFrame from a csv file
print(df.columns) # check to see which column names the dataframe has
print(df['My Column'].min())
print(df['My Column'].max())
There are shorter ways to do this. But this example goes step by step:
# After you read a CSV file, you'll have a bunch of rows.
rows = [
['A', '390', '...'],
['B', '750', '...'],
['C', '207', '...'],
]
# Grab a column that you want.
col = [row[1] for row in rows]
# Convert strings to integers.
vals = [int(s) for s in col]
# Print max.
print(max(vals))

Python pandas: appending information from a dictionary to rows while looping through dataframe

I would like to know a better way to append information to a dataframe while in a loop, specifically, to add COLUMNS of information to a dataframe from a dictionary. The code below technically works, but in subsequent analyses I would like to preserve the data classifications of numpy/pandas to be able to efficiently classify missing data or odd values as np.nan or null. Any tips would be great.
raw_data = {'first_name': ['John', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 17, 16, 24, '']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age'])
headers = df.columns.values
count = 0
adults = {'John':True,'Molly':False}
for index, row in df.iterrows():
count += 1
if str(row['first_name']) in adults:
adult = adults[str(row['first_name'])]
else:
adult = 'null'
headers = np.append(headers,'ADULT')
vals = np.append(row.values,adult)
if count == 1:
print ','.join(headers.tolist())
print str(vals.tolist()).replace('[','').replace(']','').replace("'","")
else:
print str(vals.tolist()).replace('[','').replace(']','').replace("'","")
Output:
first_name,last_name,age,ADULT
John, Miller, 42, True
Molly, Jacobson, 20, True
Tina, Ali, 16, NA
Jake, Milner, 24, NA
Amy, Cooze, , NA
Instead of loop, I think you can simply use lambda with if and else condition:
df['ADULT'] = df['first_name'].apply(lambda v: adults[v] if v in adults else np.nan)
print(df.to_csv(index=False, na_rep='NA'))
# Output is:
# first_name,last_name,age,ADULT
# John,Miller,42,True
# Molly,Jacobson,17,False
# Tina,Ali,16,NA
# Jake,Milner,24,NA
# Amy,Cooze,,NA
In above, adults[val] if val in adults else np.nan simply looks for if val i.e. first_name for each row is in dictionary, if it is then value is kept for new column else np.nan
You can use to_csv to print in above format, here without specifying filename, it converts to string with comma separated and na_rep specifies string to use for missing values.

Categories