Introduction
Cet article introduit, comment avec le langage python, obtenir
différents éléments relatifs aux statistiques descriptives à 1 variable ( moyenne, médiane, etc et les représentations graphiques usuelles). Pour illustrer l'article on a utilisé un exemple provenant d'un cours video sur une introduction aux statistiques descriptives
(voir les statistiques descriptives ).
- Télécharger le fichier de données: [attachment:203]
- Télécharger le code python: [attachment:204]
- Exécution du code: python DescriptiveStatistics_01.py
Description intrinsèque
La Moyenne
np.mean(Taille)
La Médiane
np.median(Taille)
Le mode
stats.mode(Taille,axis=0)
Le maximum minimum
max(Taille), min(Taille)
L'écart type (et la variance)
np.std(Taille)
np.std(Taille, ddof=1)
Les quartiles
print 'First quartile: ', stats.scoreatpercentile(Taille, 25)
print 'Second quartile: ', stats.scoreatpercentile(Taille, 50)
print 'Third quartile: ', stats.scoreatpercentile(Taille, 75)
Exemple
- Moyenne (mean): 169.7
- L'écart type (standard deviation) 9.95540054443
- L'écart type non biasé (standard deviation unbiased): 10.2140254449
- La médiane (median): 167.5
- Maximum et minimum (Max and Min Value): 190.0, 150.0
- Étendue (Range): 40.0
- Mode (Mode): (array([ 164.]), array([ 3.]))
- First quartile: 163.75
- Second quartile: 167.5
- Third quartile: 175.5
Représentations graphiques
Histogramme
(Histogram)
fig = plt.figure()
plt.xticks(x_pos, people,rotation=45)
plt.ylabel(r'Absolute Frequency $n_i$')
bar1 = plt.bar(X,AbsoluteFrequency,width=1.0,bottom=0,color='Green',alpha=0.65,label='Legend')
plt.savefig('Histogram.png', bbox_inches='tight')
plt.show()
Fonction de répartition
(Cumulative distribution function)
fig = plt.figure()
for i in np.arange(NbClass):
plt.plot([CumulativeFrequency_xStart[i],CumulativeFrequency_xEnd[i]], \
[CumulativeFrequency[i],CumulativeFrequency[i]], 'k--')
if i < NbClass - 1:
plt.scatter(CumulativeFrequency_xEnd[i], CumulativeFrequency[i], \
s=80, facecolors='none', edgecolors='r')
if i == 0:
plt.plot([CumulativeFrequency_xStart[i],CumulativeFrequency_xEnd[i]], \
[0,CumulativeFrequency[i]], 'r--')
else:
plt.plot([CumulativeFrequency_xStart[i],CumulativeFrequency_xEnd[i]], \
[CumulativeFrequency[i-1],CumulativeFrequency[i]], 'r--')
plt.xlim(0,NbClass)
plt.ylim(0,1)
plt.xticks(x_pos, LabelList,rotation=45)
plt.title("Cumulative Distribution Function")
plt.savefig('CumulativeDistributionFunction.png', bbox_inches='tight')
plt.show()
Boîte à moustaches
(Box Plot)
fig = plt.figure()
plt.xticks([0], ['Taille'])
plt.boxplot(Taille)
plt.savefig('BoxPlot.png', bbox_inches='tight')
plt.show()
Code Python
Code Source:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import math
# '---------- Read Data ----------'
Taille, Poids = np.loadtxt("data.txt", unpack=True, skiprows=1)
Taille = np.sort(Taille)
# '---------- Print Descriptive statistics: Continuous Case ----------'
print Taille
print 'Taille Dim: ', Taille.shape
print 'mean', np.mean(Taille)
print 'std', np.std(Taille)
print 'std (unbiased): ', np.std(Taille, ddof=1)
print 'Median: ', np.median(Taille)
print 'Max and Min Value: ', max(Taille), min(Taille)
print 'Range: ', max(Taille) - min(Taille)
print 'Mode: ', stats.mode(Taille,axis=0)
print 'First quartile: ', stats.scoreatpercentile(Taille, 25)
print 'Second quartile: ', stats.scoreatpercentile(Taille, 50)
print 'Third quartile: ', stats.scoreatpercentile(Taille, 75)
# '---------- Discrete Case ----------'
NbData = Taille.shape[0]
NbClass = 4 #int( math.log(NbData,2) ) + 1
Range = max(Taille) - min(Taille)
ClassRange = float( Range ) / NbClass
print 'NbData: ', NbData
print 'NbClass: ', NbClass
print 'ClassRange: ', ClassRange
X = np.arange(NbClass)
AbsoluteFrequency = np.zeros(NbClass)
for i in np.arange(NbData-1):
c = int((Taille[i]-min(Taille))/ClassRange)
AbsoluteFrequency[c] = AbsoluteFrequency[c] + 1
AbsoluteFrequency[NbClass-1] = AbsoluteFrequency[NbClass-1] + 1
ClassLabel = []
j = round(min(Taille),2)
for i in np.arange(NbClass+1):
ClassLabel.append(j)
j = round(j + ClassRange,2)
LabelList = (ClassLabel)
x_pos = np.arange(len(LabelList))
# '---------- Plot Absolute Frequency Histogram ----------'
fig = plt.figure()
plt.xticks(x_pos, LabelList,rotation=45)
plt.ylabel(r'Absolute Frequency $n_i$')
bar1 = plt.bar(X,AbsoluteFrequency,\
width=1.0,bottom=0,color='Green',alpha=0.65,label='Legend')
plt.savefig('Histogram.png', bbox_inches='tight')
plt.show()
RelativeFrequency = np.zeros(NbClass)
RelativeFrequency = AbsoluteFrequency / NbData
# '---------- Plot Cumulative distribution function ----------'
CumulativeFrequency = np.zeros(NbClass)
CumulativeFrequency_xStart = np.zeros(NbClass)
CumulativeFrequency_xEnd = np.zeros(NbClass)
j = 0
k = 0
for i in np.arange(NbClass):
CumulativeFrequency[i] = j + RelativeFrequency[i]
j = j + RelativeFrequency[i]
CumulativeFrequency_xStart[i] = k
CumulativeFrequency_xEnd[i] = k + 1
k += 1
fig = plt.figure()
for i in np.arange(NbClass):
plt.plot([CumulativeFrequency_xStart[i],CumulativeFrequency_xEnd[i]], \
[CumulativeFrequency[i],CumulativeFrequency[i]], 'k--')
if i < NbClass - 1:
plt.scatter(CumulativeFrequency_xEnd[i], CumulativeFrequency[i], \
s=80, facecolors='none', edgecolors='r')
if i == 0:
plt.plot([CumulativeFrequency_xStart[i],CumulativeFrequency_xEnd[i]], \
[0,CumulativeFrequency[i]], 'r--')
else:
plt.plot([CumulativeFrequency_xStart[i],CumulativeFrequency_xEnd[i]], \
[CumulativeFrequency[i-1],CumulativeFrequency[i]], 'r--')
plt.xlim(0,NbClass)
plt.ylim(0,1)
plt.xticks(x_pos, LabelList,rotation=45)
plt.title("Cumulative Distribution Function")
plt.savefig('CumulativeDistributionFunction.png', bbox_inches='tight')
plt.show()
# '---------- Plot Box Plot ----------'
fig = plt.figure()
plt.xticks([0], ['Taille'])
plt.boxplot(Taille)
plt.savefig('BoxPlot.png', bbox_inches='tight')
plt.show()
Références
Liste non exhaustive des pages web consultées lors de la rédaction de cet article:
Principaux Liens | Description |
---|---|
How to do a scatter plot with empty circles in Python? | Lien externe (stackoverflow) matplotlib |
Inconsistent standard deviation and variance implementation in scipy vs scipy stats | Lien externe (forum) |
Calculer une standard déviation avec numpy ? | Lien externe (numpy) |
Calculer une moyenne avec numpy ? | Lien externe (numpy) |
Find the most frequent number in a numpy vector | Lien externe (Question sur StackoverFlow) |
Most efficient way to find mode in numpy array ? | Lien externe (Question sur StackoverFlow) |