Exemple de comment sélectionner une ou des colonnes de données dans une DataFrame avec pandas ?
Créer une DataFrame avec Pandas
Soit par exemple le fichier csv suivant train.csv (que l'on peut télécharger sur kaggle). Pour lire le fichier il existe la fonction pandas read_csv():
>>> import pandas as pd>>> df = pd.read_csv('train.csv')>>> df.shape(1460, 81)
Obtenir le noms des colonnes
Première étape: obtenir le nom des colonnes avec la commande pandas df.columns:
>>> df.columnsIndex(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street','Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig','LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType','HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd','RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType','MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual','BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1','BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating','HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF','LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual','TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType','GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual','GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF','EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC','Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType','SaleCondition', 'SalePrice'],dtype='object')
Afficher le nom des colonnes et l'indice associé:
>>> for idx,column in enumerate(df.columns):... print(idx,column)...0 Id1 MSSubClass2 MSZoning3 LotFrontage4 LotArea5 Street6 Alley7 LotShape8 LandContour9 Utilities10 LotConfig11 LandSlope12 Neighborhood13 Condition114 Condition215 BldgType16 HouseStyle17 OverallQual18 OverallCond19 YearBuilt20 YearRemodAdd21 RoofStyle22 RoofMatl23 Exterior1st24 Exterior2nd25 MasVnrType26 MasVnrArea27 ExterQual28 ExterCond29 Foundation30 BsmtQual31 BsmtCond32 BsmtExposure33 BsmtFinType134 BsmtFinSF135 BsmtFinType236 BsmtFinSF237 BsmtUnfSF38 TotalBsmtSF39 Heating40 HeatingQC41 CentralAir42 Electrical43 1stFlrSF44 2ndFlrSF45 LowQualFinSF46 GrLivArea47 BsmtFullBath48 BsmtHalfBath49 FullBath50 HalfBath51 BedroomAbvGr52 KitchenAbvGr53 KitchenQual54 TotRmsAbvGrd55 Functional56 Fireplaces57 FireplaceQu58 GarageType59 GarageYrBlt60 GarageFinish61 GarageCars62 GarageArea63 GarageQual64 GarageCond65 PavedDrive66 WoodDeckSF67 OpenPorchSF68 EnclosedPorch69 3SsnPorch70 ScreenPorch71 PoolArea72 PoolQC73 Fence74 MiscFeature75 MiscVal76 MoSold77 YrSold78 SaleType79 SaleCondition80 SalePrice
Sélectionner une colonne de données
Exemple 1 sur comment extraire les données de la colonne "SalePrice":
>>> df['SalePrice']0 2085001 1815002 2235003 1400004 2500005 1430006 3070007 2000008 1299009 11800010 12950011 34500012 14400013 27950014 15700015 13200016 14900017 9000018 15900019 13900020 32530021 13940022 23000023 12990024 15400025 25630026 13480027 30600028 20750029 68500...1430 1921401431 1437501432 645001433 1865001434 1600001435 1740001436 1205001437 3946171438 1497001439 1970001440 1910001441 1493001442 3100001443 1210001444 1796001445 1290001446 1579001447 2400001448 1120001449 920001450 1360001451 2870901452 1450001453 845001454 1850001455 1750001456 2100001457 2665001458 1421251459 147500Name: SalePrice, Length: 1460, dtype: int64
Exemple 2: en utilisant la fonction pandas loc:
>>> df.loc[:,'SalePrice']0 2085001 1815002 2235003 1400004 2500005 1430006 3070007 2000008 1299009 11800010 12950011 34500012 14400013 27950014 15700015 13200016 14900017 9000018 15900019 13900020 32530021 13940022 23000023 12990024 15400025 25630026 13480027 30600028 20750029 68500...1430 1921401431 1437501432 645001433 1865001434 1600001435 1740001436 1205001437 3946171438 1497001439 1970001440 1910001441 1493001442 3100001443 1210001444 1796001445 1290001446 1579001447 2400001448 1120001449 920001450 1360001451 2870901452 1450001453 845001454 1850001455 1750001456 2100001457 2665001458 1421251459 147500Name: SalePrice, dtype: int64
Exemple 3: en utilisant la fonction pandas iloc (Note: l'indice de 'SalePrice' est ici 80 voir ci-dessus):
>>> df.iloc[:,80]0 2085001 1815002 2235003 1400004 2500005 1430006 3070007 2000008 1299009 11800010 12950011 34500012 14400013 27950014 15700015 13200016 14900017 9000018 15900019 13900020 32530021 13940022 23000023 12990024 15400025 25630026 13480027 30600028 20750029 68500...1430 1921401431 1437501432 645001433 1865001434 1600001435 1740001436 1205001437 3946171438 1497001439 1970001440 1910001441 1493001442 3100001443 1210001444 1796001445 1290001446 1579001447 2400001448 1120001449 920001450 1360001451 2870901452 1450001453 845001454 1850001455 1750001456 2100001457 2665001458 1421251459 147500Name: SalePrice, dtype: int64
Sélectionner plusieurs colonnes de données
>>> mycolumns = ['GrLivArea','SalePrice']>>> df[mycolumns]GrLivArea SalePrice0 1710 2085001 1262 1815002 1786 2235003 1717 1400004 2198 2500005 1362 1430006 1694 3070007 2090 2000008 1774 1299009 1077 11800010 1040 12950011 2324 34500012 912 14400013 1494 27950014 1253 15700015 854 13200016 1004 14900017 1296 9000018 1114 15900019 1339 13900020 2376 32530021 1108 13940022 1795 23000023 1060 12990024 1060 15400025 1600 25630026 900 13480027 1704 30600028 1600 20750029 520 68500... ... ...1430 1838 1921401431 958 1437501432 968 645001433 1792 1865001434 1126 1600001435 1537 1740001436 864 1205001437 1932 3946171438 1236 1497001439 1725 1970001440 2555 1910001441 848 1493001442 2007 3100001443 952 1210001444 1422 1796001445 913 1290001446 1188 1579001447 2090 2400001448 1346 1120001449 630 920001450 1792 1360001451 1578 2870901452 1072 1450001453 1140 845001454 1221 1850001455 1647 1750001456 2073 2100001457 2340 2665001458 1078 1421251459 1256 147500[1460 rows x 2 columns]
Exemple 2 en utilisant la fonction pandas loc:
>>> df.loc[:,mycolumns]GrLivArea SalePrice0 1710 2085001 1262 1815002 1786 2235003 1717 1400004 2198 2500005 1362 1430006 1694 3070007 2090 2000008 1774 1299009 1077 11800010 1040 12950011 2324 34500012 912 14400013 1494 27950014 1253 15700015 854 13200016 1004 14900017 1296 9000018 1114 15900019 1339 13900020 2376 32530021 1108 13940022 1795 23000023 1060 12990024 1060 15400025 1600 25630026 900 13480027 1704 30600028 1600 20750029 520 68500... ... ...1430 1838 1921401431 958 1437501432 968 645001433 1792 1865001434 1126 1600001435 1537 1740001436 864 1205001437 1932 3946171438 1236 1497001439 1725 1970001440 2555 1910001441 848 1493001442 2007 3100001443 952 1210001444 1422 1796001445 913 1290001446 1188 1579001447 2090 2400001448 1346 1120001449 630 920001450 1792 1360001451 1578 2870901452 1072 1450001453 1140 845001454 1221 1850001455 1647 1750001456 2073 2100001457 2340 2665001458 1078 1421251459 1256 147500[1460 rows x 2 columns]
Exemple 3 en utilisant iloc:
>>> df.iloc[:,[46,80]]GrLivArea SalePrice0 1710 2085001 1262 1815002 1786 2235003 1717 1400004 2198 2500005 1362 1430006 1694 3070007 2090 2000008 1774 1299009 1077 11800010 1040 12950011 2324 34500012 912 14400013 1494 27950014 1253 15700015 854 13200016 1004 14900017 1296 9000018 1114 15900019 1339 13900020 2376 32530021 1108 13940022 1795 23000023 1060 12990024 1060 15400025 1600 25630026 900 13480027 1704 30600028 1600 20750029 520 68500... ... ...1430 1838 1921401431 958 1437501432 968 645001433 1792 1865001434 1126 1600001435 1537 1740001436 864 1205001437 1932 3946171438 1236 1497001439 1725 1970001440 2555 1910001441 848 1493001442 2007 3100001443 952 1210001444 1422 1796001445 913 1290001446 1188 1579001447 2090 2400001448 1346 1120001449 630 920001450 1792 1360001451 1578 2870901452 1072 1450001453 1140 845001454 1221 1850001455 1647 1750001456 2073 2100001457 2340 2665001458 1078 1421251459 1256 147500[1460 rows x 2 columns]
L'avantage de iloc est de pouvoir facilement sélectionner plusieurs colonnes consécutives:
>>> df.iloc[:,76:80]MoSold YrSold SaleType SaleCondition0 2 2008 WD Normal1 5 2007 WD Normal2 9 2008 WD Normal3 2 2006 WD Abnorml4 12 2008 WD Normal5 10 2009 WD Normal6 8 2007 WD Normal7 11 2009 WD Normal8 4 2008 WD Abnorml9 1 2008 WD Normal10 2 2008 WD Normal11 7 2006 New Partial12 9 2008 WD Normal13 8 2007 New Partial14 5 2008 WD Normal15 7 2007 WD Normal16 3 2010 WD Normal17 10 2006 WD Normal18 6 2008 WD Normal19 5 2009 COD Abnorml20 11 2006 New Partial21 6 2007 WD Normal22 9 2008 WD Normal23 6 2007 WD Normal24 5 2010 WD Normal25 7 2009 WD Normal26 5 2010 WD Normal27 5 2010 WD Normal28 12 2006 WD Normal29 5 2008 WD Normal... ... ... ... ...1430 7 2006 WD Normal1431 10 2009 WD Normal1432 8 2007 WD Normal1433 5 2008 WD Normal1434 5 2006 WD Normal1435 7 2008 COD Abnorml1436 5 2007 WD Normal1437 11 2008 New Partial1438 4 2010 WD Normal1439 11 2007 WD Normal1440 9 2008 WD Normal1441 5 2008 WD Normal1442 4 2009 WD Normal1443 5 2009 WD Normal1444 11 2007 WD Normal1445 5 2007 WD Normal1446 4 2010 WD Normal1447 12 2007 WD Normal1448 5 2007 WD Normal1449 8 2006 WD Abnorml1450 9 2009 WD Normal1451 5 2009 New Partial1452 5 2006 WD Normal1453 7 2006 WD Abnorml1454 10 2009 WD Normal1455 8 2007 WD Normal1456 2 2010 WD Normal1457 5 2010 WD Normal1458 4 2010 WD Normal1459 6 2008 WD Normal[1460 rows x 4 columns]
Références
| Liens | Site |
|---|---|
| Selecting Subsets of Data in Pandas: Part 1 | medium.com |
| Select Rows & Columns by Name or Index in DataFrame using loc & iloc Python Pandas | thispointer.com |
| pandas.DataFrame.loc | pandas doc |
| pandas.DataFrame.iloc | pandas doc |
