I want to read the following .txt file and convert it to a pandas dataFrame object. But I cannot separate the columns. I have tried pd.read_fwf('housing.txt',delimiters=','). It didn't work.
df = pd.read_fwf('housing.txt',delimiter = ',',names=['a','b','c'])
The output:
a b c
0 2104,3,399900 NaN NaN
1 1600,3,329900 NaN NaN
2 2400,3,369000 NaN NaN
3 1416,2,232000 NaN NaN
4 3000,4,539900 NaN NaN
5 1985,4,299900 NaN NaN
6 1534,3,314900 NaN NaN
Here is the housing.txt file
2104,3,399900
1600,3,329900
2400,3,369000
1416,2,232000
3000,4,539900
1985,4,299900
1534,3,314900
1427,3,198999
1380,3,212000
You can use read_csv, as this is a comma-separated values (CSV) file:
df = pd.read_csv("housing.txt", names=['a','b','c'])
Related
So I have a irregular dataframe with unnamed columns which looks something like this:
Unnamed:0 Unnamed:1 Unnamed:2 Unnamed:3 Unnamed:4
nan nan nan 2022-01-01 nan
nan nan nan nan nan
nan nan String Name Currency
nan nan nan nan nan
nan nan nan nan nan
nan nan String nan nan
nan nan xx A CAD
nan nan yy B USD
nan nan nan nan nan
Basically what I want to do is to find in which column and row the 'String' name is and start the dataframe from there, creating:
String Name Currency
String nan nan
xx A CAD
yy B USD
nan nan nan
My initial thought has been to use
locate_row = df.apply(lambda row: row.astype(str).str.contains('String').any(), axis=1) combined with
locate_col = df.apply(lambda column: column.astype(str).str.contains('String').any(), axis=0)
This gives me series with the rows with the string and column with the string. My main problem is solving this without hardcoding using for eg. iloc[6:, 2:]. Any help to get to the desired dataframe without hardcoding is of great help.
In your example you can drop the columns that are entirely null, then drop rows with any null values. The result is the slice you are looking for. You can then promote the first row to headers.
df = df.dropna(axis=1,how='all').dropna().reset_index(drop=True)
df = df.rename(columns=df.iloc[0]).drop(df.index[0])
Output
String Name Currency
1 xx A CAD
2 yy B USD
Hey you would need to iterate over the dataframe and then search for equality in the strings:
in this article iteration is described How to iterate over rows in a DataFrame in Pandas
with this you can check for equality
if s1 == s2:
print('s1 and s2 are equal.')
I have a df that a columns with some ID for companies. How can I split this ID in columns?
In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?
Here is an example of the column:
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
The division would be at each comma, I imagine an output like this:
columnA
columnB
columnC
4773300
Nan
Nan
NaN
Nan
Nan
6201501
6319400
6202300
8230001
Nan
Nan
And so on depending on the number of IDs
You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.
>>> df
ID
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
>>> df['ID'].str.split(',', expand=True)
0 1 2 3 4 5
0 4773300 None None None None None
1 NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 None None None
3 8230001 None None None None None
4 NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470
You can also clean up the output a little for better aesthetics
replace None for NaN
alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
df['ID'].str.split(',', expand=True)
.replace({None: float('nan')})
.pipe(lambda d:
d.set_axis(
pd.Series(list(ascii_uppercase))[d.columns],
axis=1
)
)
.add_prefix("column")
.join(df)
)
columnA columnB columnC columnD columnE columnF ID
0 4773300 NaN NaN NaN NaN NaN 4773300
1 NaN NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 NaN NaN NaN 6201501,6319400,6202300
3 8230001 NaN NaN NaN NaN NaN 8230001
4 NaN NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470 4742300,4744004,4744003,7319002,4729699,475470
Consider each entry as a string, and parse the string to get to individual values.
from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})
words = []
for items in df['company']:
for word in items:
words.append(word)
FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.
I have a strange one (for me at least, because im just a beginner). Anyway , I have a pandas dataframe made from an excel file.
df = pd.read_excel(excel_file_path_from_db, engine='openpyxl', sheet_name='Sheet1', skiprows=1)
Straight forward, and it works. I then do some number crunching and add in a couple of columns in the excel file and update it using openpyxl in this case. after the number crunching i save the excel file using openpyxl.
wb.save(excel_file_path_from_db)
All the updated values are saved in the file. Perfect, its going well so far. Now i want to make a new dataframe from the last 12 columns i have inputted into the excel file. So I make a dataframe by reading the file again.
df_from_updated_excel = pd.read_excel(excel_file_path_from_db, engine='openpyxl', sheet_name='Sheet1', skiprows=1)
Now I select the last 12 columns as my new dataframe
df_last_12 = df_from_updated_excel[:, -12:]
I then try to print the "hello" column in my df_last_12 I do this by
print(df_last_12['hello'])
The problem is that there was a "hello" column before in my original dataframe and i inputed a new hello column in my new dataframe, so i am getting hello.1 hello.2 when i thought should be getting just "hello" in my dataframe.
the funny thing is if i print df_last_12 What i expected was that there would just be a "hello" column. but it seems to have these weird iterations. Any ideas how i set it up so that i not getting these iterations of hello?
Your logic appears correct. It must be the column names are not as you expect. What does df_from_updated_excel.columns return? It must include hello.1 and hello.2
df = pd.DataFrame(columns=list("ABCDEFGHIJKLMNONPQRSTUVWXYZ")+["hello"], index=[i for i in range(5)])
df["hello"] = df.index
df = df.iloc[:,-12:]
print(df, "\n", df["hello"])
output
P Q R S T U V W X Y Z hello
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4
0 0
1 1
2 2
3 3
4 4
Name: hello, dtype: int64
I'm analyzing excel files generated by an organization who publishes yearly reports in Excel files. Each year, the column names (Year, A1, B1, C1, etc) remain identical. But each year the organization publishes those column names that start at different row numbers and column numbers.
Each year I manually search for the starting row and column, but it's tedious work given the number of years of reports to wade through.
So I'd like something like this:
...
df = pd.read_excel('test.xlsx')
start_row,start_col = df.find_columns('Year','A1','B1')
...
Thanks.
Let's say you have three .xlsx files on your desktop prefixed with Yearly_Report that when combined in python look like this after reading into one dataframe with something like: df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]):
0 1 2 3 4 5 6 7 8 9 10
0 A B C NaN NaN NaN NaN NaN NaN NaN NaN
1 1 2 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN A B C NaN NaN NaN NaN NaN NaN
4 NaN NaN 4 5 6 NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN A B C
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
As you can see, the columns and values are scattered across various columns and rows. The following steps would get you the desired result. First, you need to pd.concat the files and .dropna rows. Then, transpose the dataframe with .T before removing all cells with NaN values. Next, revert the dataframe back with another transpose .T. Finally, simply name the columns and drop rows that are equal to the column headers.
import glob, os
import pandas as pd
main_folder = 'Desktop/'
yearly_files = glob.glob(f'{main_folder}Yearly_Report*.xlsx')
df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]) \
.dropna(how='all').T \
.apply(lambda x: pd.Series(x.dropna().values)).T
df.columns = ['A','B','C']
df = df[df['A'] != 'A']
df
output:
A B C
1 1 2 3
4 4 5 6
2 4 5 6
Soething Like this not totally sure what you are looking for
df = pd.read_excel('test.xlsx')
for i in df.index:
print(df.loc[i,'Year'])
print(df.loc[i, 'A1'])
print(df.loc[i, "B1"])
Say there is a csv file as follows:
# data.csv
0,1,2,3,4
a,3.0,3.0,3.0,3.0,3.0
b,3.0,3.0,3.0,3.0,3.0
c,3.0,3.0,3.0,3.0,3.0
d,3.0,3.0,3.0,3.0,3.0
Now I create two dataframes: one from the csv file, another using DataFrame().
I expect both DataFrame to be equal.
# Read the csv file into a pandas.DataFrame
A = pandas.read_csv('data.csv')
# Create (same?) dataframe by hand
B = pandas.DataFrame(3*numpy.ones((4,5)), index=['a', 'b', 'c', 'd'])
However, if I substract them, I obtain:
print(A-B)
0 1 2 3 4 0 1 2 3 4
a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Any idea(s) why?
DataFrames are not equal, because in A are columns names strings, in B are integers.
So need convert integers columns to integers:
A = pandas.read_csv('data.csv').rename(columns=int)
Or convert B columns to strings:
B = pandas.DataFrame(3*numpy.ones((4,5)), index=['a', 'b', 'c', 'd']).rename(columns=str)