I am trying to read an excel which has some blank rows as well as columns. The process becomes more complicated as it has some junk values before the header as well.
Currently, I am hardcoding a column name to extract the table. This has two drawbacks what if the column is not present in the table and what if the column name repeats in the column value. Is there a way to dynamically write a program that automatically detects the table header and reads the table?
snippet of the code:
raw_data = pd.read_excel('test_data1.xlsx','Sheet8',header=None)
data_duplicate = pd.DataFrame()
for row in range(raw_data.shape[0]):
for col in range(raw_data.shape[1]):
if raw_data.iloc[row,col] == 'Currency':
data_duplicate = raw_data.iloc[(row+1):].reset_index(drop=True)
data_duplicate.columns = list(raw_data.iloc[row])
break
data_duplicate.dropna(axis=1, how='all',inplace=True)
data_duplicate
Also, the number of bank rows + garbage rows before the header is not fixed.
Here's my way: You can drop all rows and all columns containing Nan
data = pd.read_excel('test.xlsx')
data = data.dropna(how='all', axis = 1)
data = data.dropna(how='all', axis = 0)
data = data.reset_index(drop = True)
better if you put it into a function if you need to open multiple DataFrame in the same code:
data = pd.read_excel('test.xlsx')
def remove_nans(df):
x = df.dropna(how = 'all', axis = 1)
x = x.dropna(how = 'all', axis = 0)
x = x.reset_index(drop = True)
return x
df = remove_nans(data)
print(df)
Related
To process data from Synchro into a readable format in excel, there are extra columns that should be dropped.
Raw Data from a txt file:
Lane Group WBT WBR NBL NBT NBR SBL SBT SBR Ø3 Ø7
Lane Configurations <1> 0 1 2> 0 1 2 1
Reading this txt file as a csv puts every line into a single string
My goal is to:
1.) read as a csv with correctly delimited rows/columns (using \t as a separator)
2.) Drop any columns and data after the 'SBR' column
Code I'm using:
AMtxt = pd.read_csv('AM.txt', sep='\t+', header = None, error_bad_lines=False, warn_bad_lines = False, quoting=3, skiprows=1, engine='python')
AMtxt.drop(columns = AMtxt.columns[-2:], axis = 1, inplace = True)
AMtxt.to_csv('AM.csv')
When I try to use this, it removes the "Lane Group" row for some of the entries in the read_csv stage.
What CSV should look like:
The CSV looks like for problematic entries (ones where there are data in the columns I'm removing
EDIT: SOLVED
AMtxt = pd.DataFrame(open('AM.txt','r').readlines())
AMtxt = AMtxt[0].str.split('\t', expand = True)
for column in AMtxt:
AMtxt[column] = AMtxt[column].str.strip()
AMtxt.to_csv('AM.csv')
This method worked for me.
Solution that worked for me:
AMtxt = pd.DataFrame(open('AM.txt','r').readlines())
AMtxt = AMtxt[0].str.split('\t', expand = True)
for column in AMtxt:
AMtxt[column] = AMtxt[column].str.strip()
AMtxt.drop(columns = AMtxt.columns[-(len(AMtxt.columns) - 14):], axis = 1, inplace = True)
AMtxt.to_csv('AM.csv')
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
I have a raw dataframe that looks like this
I am trying to import this data as a csv, do some calculations on the data, and then export the data. Before doing this, however, I need to remove the three lines of "header information", but keep the data as I will need to add it back to the dataframe prior to exporting. I have done this using the following lines of code:
import pandas as pd
data = pd.read_csv(r"test.csv", header = None)
info = data.iloc[0:3,]
data = data.iloc[3:,]
data.columns = data.iloc[0]
data = data[1:]
data = data.reset_index(drop = True)
The problem I am having is, how do I add the rows stored in "info" back to the top of the dataframe to make the format equivalent to the csv I imported.
Thank you
You can just use the append() function of pandas to merge two data frames. Please check by printing the final_data.
import pandas as pd
data = pd.read_csv(r"test.csv", header = None)
info = data.iloc[0:3,]
data = data.iloc[3:,]
data.columns = data.iloc[0]
data = data[1:]
data = data.reset_index(drop = True)
# Here first row of data is column header so converting back to row
data = data.columns.to_frame().T.append(data, ignore_index=True)
data.columns = range(len(data.columns))
final_data = info.append(data)
final_data = final_data.reset_index(drop = True)
I have a Python code that pulls data from a 3 rd party API.Below is the code.
for sub in sublocation_ids:
city_num_int = sub['id']
city_num_str = str(city_num_int)
city_name = sub['name']
filter_text_new = filter_text.format(city_num_str)
data = json.dumps({"filters": [filter_text_new], "sort_by":"fb_tw_and_li", "size":200, "from":1580491663000, "to":1588184960000, "content_type":"stories"})
r = requests.post(url = api_endpoint, data = data).json()
if r['articles'] != empty_list:
articles_list = r["articles"]
time.sleep(5)
articles_list_normalized = json_normalize(articles_list)
df = articles_list_normalized
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df['citystate'] = city_name
df = df.drop('has_video', 1)
df.to_excel(writer, sheet_name = city_name)
writer.save()
Now city_num_int = sub['id'] is a unique ID for different cities. Now the API returns a "videos" column for few cities and not for other. I want to get rid of that video column before it gets written to Excel file.
I was able to drop "has_video" column using df.drop as that column is present in each and every city data pull. But how do do conditional dropping for "videos" column as it is only present for few cities.
You can ignore the errors raised by Dataframe.drop:
df = df.drop(['videos'], axis=1, errors='ignore')
Another way is to first check if column is present in DF, and only then delete it
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
You can use list comprehension on the column names to achieve what you want:
cols_to_keep = [c for c in df.columns if c != "videos"]
df = df[cols_to_keep]
I have a dataframe with 6 columns. And I want to add firstly one extra column to that, then use for loop to fill the value of that column.
df = pd.read_csv("/home/Datasets/custom_data/lab_data.csv") # This has 31 Rows x 6 colums
for i in range(len(df['filename'])):
if df['region_count'][i] != 0:
filename = df['filename'][i]
json_acceptable_string = df['region_attributes'][i].replace("'", "\"")
node_features_dict = json.loads(json_acceptable_string)
center = (node_features_dict['x']+node_features_dict['width']/2, node_features_dict['y']+node_features_dict['height']/2) # center calculation
So I want to add one extra column Center to df file and fill this column with the value of center.
I tried with
data = [{'Center': center}]
Node_label = pd.DataFrame(data)
with open('/home/Datasets/custom_data/'+'lab_data.csv', 'a+') as csvFile: # save in .csv format
Node_label.to_csv(csvFile, header=csvFile.tell()==0)
But it append values in any column randomly.
You should be able to append the Center column using the concat method of pandas, read more here
Something similar to
data = [{'Center': center}]
Node_label = pd.DataFrame(data)
dfWithCenter = pd.concat([df, data], axis=1)