Skalieren und Anpassen an eine logarithmische Normalverteilung in Python

Question

Jan 25, 2016, 09:20 PM

Skalieren und Anpassen an eine logarithmische Normalverteilung in Python

Ich habe einen logarithmisch normalverteilten Satz von Proben. Ich kann die Proben mit einem Histrogramm mit linearer oder logarithmischer x-Achse visualisieren. Ich kann eine Anpassung an das Histogramm vornehmen, um das PDF zu erhalten, und es dann an das Histrogramm im Diagramm mit der linearen X-Achse skalieren, siehe auchdiese zuvor gepostete Frage.

Ich kann die PDF-Datei jedoch nicht richtig mit der logarithmischen x-Achse in die Grafik zeichnen.

Leider ist es nicht nur ein Problem mit der Skalierung des PDF-Bereichs zum Histogramm, sondern das PDF wird auch nach links verschoben, wie Sie aus der folgenden Grafik ersehen können.

Meine Frage ist nun, was mache ich hier falsch? Verwenden der CDF zum Zeichnen des erwarteten Histogramms, wie in dieser Antwort vorgeschlagen funktioniert. Ich möchte nur wissen, was ich in diesem Code falsch mache, da es meines Erachtens auch funktionieren sollte.

Dies ist der Python-Code (es tut mir leid, dass er ziemlich lang ist, aber ich wollte eine "vollständige Standalone-Version" veröffentlichen):

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

# generate log-normal distributed set of samples
np.random.seed(42)
samples   = np.random.lognormal( mean=1, sigma=.4, size=10000 )

# make a fit to the samples
shape, loc, scale = scipy.stats.lognorm.fit( samples, floc=0 )
x_fit       = np.linspace( samples.min(), samples.max(), 100 )
samples_fit = scipy.stats.lognorm.pdf( x_fit, shape, loc=loc, scale=scale )

# plot a histrogram with linear x-axis
plt.subplot( 1, 2, 1 )
N_bins = 50
counts, bin_edges, ignored = plt.hist( samples, N_bins, histtype='stepfilled', label='histogram' )
# calculate area of histogram (area under PDF should be 1)
area_hist = .0
for ii in range( counts.size):
    area_hist += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# oplot fit into histogram
plt.plot( x_fit, samples_fit*area_hist, label='fitted and area-scaled PDF', linewidth=2)
plt.legend()

# make a histrogram with a log10 x-axis
plt.subplot( 1, 2, 2 )
# equally sized bins (in log10-scale)
bins_log10 = np.logspace( np.log10( samples.min()  ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )
# calculate area of histogram
area_hist_log = .0
for ii in range( counts.size):
    area_hist_log += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# get pdf-values for log10 - spaced intervals
x_fit_log       = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# oplot fit into histogram
plt.plot( x_fit_log, samples_fit_log*area_hist_log, label='fitted and area-scaled PDF', linewidth=2 )

plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend()
plt.show()

Update 1:

Ich habe vergessen, die Versionen zu erwähnen, die ich verwende:

python      2.7.6
numpy       1.8.2
matplotlib  1.3.1
scipy       0.13.3

Update 2:

Wie von @Christoph und @zaxliu (danke an beide) hervorgehoben, liegt das Problem in der Skalierung der PDF. Es funktioniert, wenn ich dieselben Bins wie für das Histogramm verwende, wie in @ zaxlius Lösung, aber ich habe immer noch einige Probleme, wenn ich eine höhere Auflösung für das PDF verwende (wie in meinem obigen Beispiel). Dies ist in der folgenden Abbildung dargestellt:

Der Code für die Abbildung auf der rechten Seite lautet (ich habe den Import und das Generieren von Datenbeispielen weggelassen, die Sie im obigen Beispiel beide finden):

# equally sized bins in log10-scale
bins_log10 = np.logspace( np.log10( samples.min()  ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )

# calculate length of each bin (required for scaling PDF to histogram)
bins_log_len = np.zeros( bins_log10.size )
for ii in range( counts.size):
    bins_log_len[ii] = bin_edges[ii+1]-bin_edges[ii]

# get pdf-values for same intervals as histogram
samples_fit_log = scipy.stats.lognorm.pdf( bins_log10, shape, loc=loc, scale=scale )

# oplot fitted and scaled PDF into histogram
plt.plot( bins_log10, np.multiply(samples_fit_log,bins_log_len)*sum(counts), label='PDF using histogram bins', linewidth=2 )

# make another pdf with a finer resolution
x_fit_log       = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# calculate length of each bin (required for scaling PDF to histogram)
# in addition, estimate middle point for more accuracy (should in principle also be done for the other PDF)
bins_log_len       = np.diff( x_fit_log )
samples_log_center = np.zeros( x_fit_log.size-1 )
for ii in range( x_fit_log.size-1 ):
    samples_log_center[ii] = .5*(samples_fit_log[ii] + samples_fit_log[ii+1] )

# scale PDF to histogram
# NOTE: THIS IS NOT WORKING PROPERLY (SEE FIGURE)
pdf_scaled2hist = np.multiply(samples_log_center,bins_log_len)*sum(counts)

# oplot fit into histogram
plt.plot( .5*(x_fit_log[:-1]+x_fit_log[1:]), pdf_scaled2hist, label='PDF using own bins', linewidth=2 )

plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend(loc=3)