Irohabook
0
8412

Pythonのpandasで列データの最大値・最小値・平均・分散・標準偏差を求める(read_csvのオプション引数thousandsに注意しよう)

はじめに結論を述べる。

  • pandas の Series には max や std などの標準的な関数がある
  • カンマ区切りの数値データを使うときは、read_csv のオプションに thousands を指定する

pandas の DataFrame から一次元データを取りだし、その最大値を求める

pandas の DataFrame から 1 つの列を選択し、さらにその最大値や分散を求めてみよう。今回も東京都の自治体別人口データを使う。

import pandas as pd

df = pd.read_csv('population.csv', thousands=',', index_col=0)
rows = df['総数']

print(rows)

max = rows.max()
min = rows.min()
mean = rows.mean()
var = rows.var()
std = rows.std()

print('最大値 {0}'.format(max))
print('最小値 {0}'.format(min))
print('平均 {0}'.format(mean))
print('分散 {0}'.format(var))
print('標準偏差 {0}'.format(std))

結果はこうなる。

市区町村
千代田区      63635
中央区      162502
港 区      257426
新宿区      346162
文京区      221489
台東区      199292
墨田区      271859
江東区      518479
品川区      394700
目黒区      279342
大田区      729534
世田谷区     908907
渋谷区      226594
中野区      331658
杉並区      569132
豊島区      289508
北 区      351976
荒川区      215966
板橋区      566890
練馬区      732433
足立区      688512
葛飾区      462591
江戸川区     698031
八王子市     562460
立川市      183822
武蔵野市     146399
三鷹市      187199
青梅市      134086
府中市      260011
昭島市      113215
          ...  
小金井市     121443
小平市      193596
日野市      185393
東村山市     150789
国分寺市     123689
国立市       76038
福生市       58243
狛江市       82481
東大和市      85565
清瀬市       74737
東久留米市    116896
武蔵村山市     72546
多摩市      148745
稲城市       90585
羽村市       55607
あきる野市     80851
西東京市     202817
瑞穂町       33213
日の出町      16732
檜原村        2217
奥多摩町       5179
大島町        7716
利島村         323
新島村        2722
神津島村       1898
三宅村        2481
御蔵島村        317
八丈町        7465
青ヶ島村        159
小笠原村       2625
Name: 総数, Length: 62, dtype: int64

最大値 908907
最小値 159
平均 221624.70967741936
分散 48636512171.81597
標準偏差 220536.87259008634

ここで population.csv には下のデータが入っている。

市区町村,世帯数,総数,男,女,人口密度
千代田区,"35,830","63,635","31,935","31,700","5,458"
中央区,"91,852","162,502","77,241","85,261","15,916"
港 区,"145,865","257,426","121,326","136,100","12,638"
新宿区,"219,639","346,162","173,743","172,419","18,999"
文京区,"121,128","221,489","105,462","116,027","19,618"
台東区,"118,858","199,292","101,917","97,375","19,712"
墨田区,"150,855","271,859","134,678","137,181","19,743"
江東区,"267,262","518,479","256,116","262,363","12,910"
品川区,"220,678","394,700","193,644","201,056","17,281"
目黒区,"156,583","279,342","132,206","147,136","19,042"
大田区,"391,146","729,534","362,653","366,881","11,993"
世田谷区,"479,792","908,907","431,026","477,881","15,657"
渋谷区,"137,582","226,594","108,768","117,826","14,996"
中野区,"204,613","331,658","167,378","164,280","21,274"
杉並区,"321,531","569,132","273,057","296,075","16,710"
豊島区 ,"179,880","289,508","145,334","144,174","22,253"
北 区,"196,580","351,976","174,910","177,066","17,078"
荒川区,"115,944","215,966","107,283","108,683","21,256"
板橋区,"309,133","566,890","278,662","288,228","17,594"
練馬区,"370,567","732,433","356,279","376,154","15,234"
足立区,"346,739","688,512","345,291","343,221","12,930"
葛飾区,"233,158","462,591","231,272","231,319","13,293"
江戸川区,"342,016","698,031","351,914","346,117","13,989"
八王子市,"267,736","562,460","281,506","280,954","3,018"
立川市,"91,270","183,822","91,460","92,362","7,546"
武蔵野市,"76,765","146,399","70,120","76,279","13,333"
三鷹市,"93,665","187,199","91,624","95,575","11,401"
青梅市,"63,142","134,086","67,393","66,693","1,298"
府中市,"125,060","260,011","130,582","129,429","8,835"
昭島市,"53,827","113,215","56,384","56,831","6,529"
調布市,"118,804","235,169","114,909","120,260","10,898"
町田市,"195,643","428,685","209,971","218,714","5,991"
小金井市,"60,367","121,443","59,955","61,488","10,747"
小平市,"91,602","193,596","95,312","98,284","9,439"
日野市,"88,402","185,393","92,983","92,410","6,729"
東村山市,"72,676","150,789","73,621","77,168","8,797"
国分寺市,"60,111","123,689","60,901","62,788","10,793"
国立市,"37,728","76,038","37,161","38,877","9,330"
福生市,"30,506","58,243","29,132","29,111","5,733"
狛江市,"42,157","82,481","40,005","42,476","12,908"
東大和市,"38,852","85,565","42,208","43,357","6,376"
清瀬市,"35,454","74,737","36,092","38,645","7,306"
東久留米市,"54,257","116,896","57,066","59,830","9,076"
武蔵村山市,"31,640","72,546","36,177","36,369","4,735"
多摩市,"71,851","148,745","72,927","75,818","7,080"
稲城市,"39,991","90,585","45,589","44,996","5,041"
羽村市,"25,718","55,607","28,251","27,356","5,617"
あきる野市,"35,519","80,851","40,304","40,547","1,100"
西東京市,"97,350","202,817","98,839","103,978","12,877"
瑞穂町,"14,912","33,213","16,922","16,291","1,971"
日の出町,"7,383","16,732","8,224","8,508",596
檜原村,"1,181","2,217","1,100","1,117",21
奥多摩町,"2,685","5,179","2,601","2,578",23
大島町,"4,635","7,716","3,971","3,745",85
利島村,174,323,175,148,78
新島村,"1,381","2,722","1,325","1,397",99
神津島村,917,"1,898",975,923,102
三宅村,"1,620","2,481","1,356","1,125",45
御蔵島村,170,317,167,150,15
八丈町,"4,365","7,465","3,720","3,745",103
青ヶ島村,109,159,92,67,27
小笠原村,"1,492","2,625","1,451","1,174",25

引用:住民基本台帳による東京都の世帯と人口(町丁別・年齢別)

上のコードは次の処理を順番に行っている。

  1. pandas の read_csv でファイルの内容を DataFrame にする
  2. DataFrame に「総数」を指定して一次元データ(Series)を取りだす
  3. Series の最大値などを求める

pandas で最大値などを求めることは簡単だが、陥りやすいポイントがいくつかある。もともとの表データを見てほしい。東京都のデータをそのままダウンロードすると、数値はすべてカンマ区切りになっている。

上のコードをもう一度見ると read_csv のオプションに thousands という引数がある。今回のポイントはここだ。試しにこれを削除するとどうなるか?

import pandas as pd

df = pd.read_csv('population.csv', index_col=0)
rows = df['総数']

max = rows.max()
min = rows.min()
mean = rows.mean()
var = rows.var()
std = rows.std()

このコードはエラーになる。カンマ区切りのデータを使って平均や分散は求められない。最大値・最小値はどうだろう。

import pandas as pd

df = pd.read_csv('population.csv', index_col=0)
rows = df['総数']

print(rows)

max = rows.max()
min = rows.min()

print('最大値 {0}'.format(max))
print('最小値 {0}'.format(min))

実はこのコードはエラーにならない。

最大値 908,907
最小値 1,898

しかし最小値が 1,898 になっている。これは間違いで、本当は青ヶ島村の 159 人が正解。結局 Series の関数を使うときは、カンマ区切りの値を適正に処理しないといけないことがわかる。

次の記事

pandas