I am trying to do some file merging with Latitude and Longitude.
Input File1.csv
Name,Lat,Lon,timeseries(n)
London,80.5234,121.0452,523
London,80.5234,121.0452,732
London,80.5234,121.0452,848
Paris,90.4414,130.0252,464
Paris,90.4414,130.0252,829
Paris,90.4414,130.0252,98
New York,110.5324,90.0023,572
New York,110.5324,90.0023,689
New York,110.5324,90.0023,794
File2.csv
Name,lat,lon,timeseries1
London,80.5234,121.0452,500
Paris,90.4414,130.0252,400
New York,110.5324,90.0023,700
Now Expected output is
File2.csv
Name,lat,lon,timeseries1,timeseries(n) #timeseries is 24 hrs format 17:45:00
London,80.5234,121.0452,500,2103 #Addition of all three values
Paris,90.4414,130.0252,400,1391
New York,110.5324,90.0023,700,2055
With python, numpy and dictionaries it would be straight as key = sum of values but I want to use Pandas
Please suggest me how to start with or may be a point me to some example. I have not see anything like Dictionary types with Pandas with Latitude and Longitude.
Perform a groupby aggregation on the first df, call sum and then merge this with the other df:
In [12]:
gp = df.groupby('Name')['timeseries(n)'].sum().reset_index()
df1.merge(gp, on='Name')
Out[14]:
Name Lat Lon timeseries1 timeseries(n)
0 London 80.5234 121.0452 500 2103
1 Paris 90.4414 130.0252 400 1391
2 New York 110.5324 90.0023 700 2055
the aggregation looks like this:
In [15]:
gp
Out[15]:
Name timeseries(n)
0 London 2103
1 New York 2055
2 Paris 1391
Your csv files can loaded using read_csv so something like:
df = pd.read_csv('File1.csv')
df1 = pd.read_csv('File2.csv')
Related
I have a csv with one single column and I want to get a multiple columns dataframe in python to work on it.
My test.csv is the following (in one single column):
ID;location;level;temperature;season
001;leeds;63;11;autumn
002;sydney;36;11;spring
003;lyon;250;11;autumn
004;edmonton;645;8;autumn
I want to get the information in a Data Frame like this:
ID Location Level Temperature Season
001 Leeds 63 11 Autumn
002 Sydney 36 11 Sprint
003 Lyon 250 11 Autumn
004 Edmonton 645 8 Autumn
I've tried:
df = pd.read_csv(r'test.csv', header=0, index_col=0)
df= df.str.split(';', expand=True)
But I've got this error: 'DataFrame' object has no attribute 'str' this was an attempt to use it as in Series. Is the similar way to do it in dataframes?
I would like to know if there is a pythonic way to do it or I should iterate per rows.
Is it str.split deprecated? I found it that is for Series but seems to be deprecated.
Any guidance is much appreciated.
I would like to read multiple data sets and combine them into a single Pandas dataframe with a year column.
My sample data sets include newyork2000.txt, newyork2001.txt, newyork2002.txt.
Each data set contains 'address' and 'price'.
Below is the newyork2000.txt:
253 XXX st, 150000
2567 YYY st, 200000
...
3896 ZZZ rd, 350000
My final single dataframe should look like this:
year address price
2000 253 XXX st 150000
2000 2567 YYY st 200000
...
2000 3896 ZZZ rd 350000
...
2002 789 XYZ ave 450000
So, I need to combine all data sets, create the year column, and name the columns.
Here is my code to create a single dataframe:
years=[2000,2001,2002]
df=[]
for i years:
df.append(pd.read_csv("newyork" + str(i) + ".txt", header=None))
dfs=pd.concat(df)
But, I could not create the year column and name the columns. Please help me solve this problem.
It is preferred to programmatically extract the year from the filename, than to manually create a list of years.
Use pathlib with .glob to find the files, use the .stem method to extract the filename, and then slice the year from the stem, with [-4:], providing the names of the files are consistent, with the year as the last 4 characters of the filename.
The .stem method will extract the final path component (e.g. 'newyork2000'), without its suffix (e.g. '.txt')
Use pandas.DataFrame.insert to add the 'year' column to a specific location in the dataframe. This method inserts the column inplace, so do not use x = x.insert(...),
import pandas as pd
from pathlib import Path
# set the file path
file_path = Path('e:/PythonProjects/stack_overflow/data/example')
# find your files
files = file_path.glob('newyork*.txt')
# create a list of dataframes
df_list = list()
for f in files:
# extract year from filename, by slicing the last four characters off the stem
year = (f.stem)[-4:]
# read the file and add column names
x = pd.read_csv(f, header=None, names=['address', 'price'])
# add a year column at index 0; use int(year) if the year should be an int, otherwise use only year
x.insert(0, 'year', int(year))
# append to the list
df_list.append(x)
# create one dataframe from the list of dataframes
df = pd.concat(df_list).reset_index(drop=True)
Result
year address price
2000 253 XXX st 150000
2000 2567 YYY st 200000
2000 3896 ZZZ rd 350000
2001 456 XYZ ave 650000
2002 789 XYZ ave 450000
Sample data files
'newyork2000.txt'
253 XXX st, 150000
2567 YYY st, 200000
3896 ZZZ rd, 350000
'newyork2001.txt'
456 XYZ ave, 650000
'newyour2002.txt'
789 XYZ ave, 450000
You can read the files first, then insert the columns corresponding to years, and then concatenate them:
import pandas as pd
years = [2000,2001,2002]
# Read all CSV files
dfs = [pd.read_csv(f"newyork{year}.txt", header=None) for year in years]
# Insert column in the beginning
for i, df in enumerate(dfs):
df.insert(0, 'year', years[i])
# Concatenate all
df = pd.concat(dfs)
I'm having some difficulties to convert the values from object to float!!!
I saw some examples but I couldn't be able to get it.
I would like to have a for loop to convert the values in all columns.
I didn't have yet a script cause I saw different ways to do it
Terawatt-hours Total Asia Pacific Total CIS Total Europe
2000 0.428429 0 0.134473
2001 0.608465 0 0.170166
2002 0.829254 0 0.276783
2003 1.11654 0 0.468726
2004 1.46406 0 0.751126
2005 1.85281 0 1.48641
2006 2.29128 0 2.52412
2007 2.74858 0 3.81573
2008 3.3306 0 7.5011
2009 4.3835 7.375e-06 14.1928
2010 6.73875 0.000240125 23.2634
2011 12.1544 0.00182275 46.7135
I tried this:
df = pd.read_excel(r'bp-stats-review-2019-all-data.xls')
columns = list(df.head(0))
for i in range(len(columns)):
df[columns[i]].astype(float)
Your question is not clear as to which column you are trying to convert, So I am sharing the example for the 1st column in your screenshot.
df['Terawatt-hours'] = df.Terawatt-hours.astype(float)
or same for any other column.
EDIT
for creating a loop on the dataframe and change it for all the columns, you can do the following :
Generating a dummy dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
Check the type of column in dataframe :
for column in df.columns:
print(df[column].dtype)
Change the type of all the columns to float
for column in df.columns:
df[column] = df[column].astype(float)
your question is not clear? which columns are you trying to convert to float and also post what you have done.
EDIT:
what you tried is right until the last line of your code where you failed to reassign the columns.
df[columns[i]] = df[columns[i]].astype(float)
also try using df.columns to get column names instead of list(df.head(0))
the link here to pandas docs on how to cast a pandas object to a specified dtype
I have a dataset that is similar in this format:
CITY - YEAR - ATTRIBUTE - VALUE
## example:
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100
etc....
I'm trying to transpose this long to wide format so that each city+year value is a row and all the distinct combinations of attributes are the columns-names.
Thus this new dataframe would look like:
###
city - year - population - crime - median_income- etc....
I've looked at the pivot function, but it doesn't seem to support a multi-index for reshaping. Can someone let me know how to work around transposing? Additionally, I tried to look at
pd.pivot_table but it seems this typically only works with numerical data with sums,means, etc. Most of my VALUE attributes are actually strings, so I don't seem to be able to use this.
### doesn't work - can't use a multindex
df.pivot(index=['city','year'], columns = 'attribute', values='value')
Thank you for your help!
Is this what you are looking for:
import pandas as pd
from io import StringIO
data = """city-year-attribute-value
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100"""
df = pd.read_csv(StringIO(data), sep="-")
pivoted = df.pivot_table(
index=["city", "year"],
columns=["attribute"],
values=["value"]
)
print(pivoted.reset_index())
Result:
city year value
attribute crime population
0 dallas 2002 100.0 4000.0
1 dallas 2003 101.0 NaN
2 houston 2002 NaN 4100.0
I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN