科學(xué)和技術(shù)
Computer vision
計算機模擬視覺
Eye robot
你是我的眼
Poor eyesight remains one of the main obstacles to letting robots loose among humans.
放手讓機器人在人類社會自由活動仍存在重大障礙—它們看不清楚。
But it is improving, in part by aping natural vision.
然而人工視能正在逐漸提高,途徑之一就是模擬自然視覺。
ROBOTS are getting smarter and more agile all the time.
機器人的反應(yīng)總是在變得越來越靈活,動作也越來越敏捷。
They disarm bombs, fly combat missions, put together complicated machines, even play football.
它們會拆卸炸彈、駕駛戰(zhàn)斗機執(zhí)行任務(wù)、組裝復(fù)雜機械,甚至還會踢足球。
Why, then, one might ask, are they nowhere to be seen, beyond war zones, factories and technology fairs?
那么,人們不禁要問,為什么除了在戰(zhàn)場、工廠和科技產(chǎn)品展銷會,生活中都看不到機器人的蹤影呢?
One reason is that they themselves cannot see very well.
一個原因就是它們自己眼神不大好。
And people are understandably wary of purblind contraptions bumping into them willy-nilly in the street or at home.
機器人跟睜眼瞎差不多,要是把它們弄到大街上,或者擺在家里,搞不好就沒頭沒腦地把人給撞了——對這玩意兒謹慎一點也是人之常情。
All that a camera-equipped computer sees is lots of picture elements, or pixels.
裝有攝像頭的計算機能看到的一切,僅僅是大量的圖像元素,又稱像素。
A pixel is merely a number reflecting how much light has hit a particular part of a sensor.
像素只不過是一個數(shù)值,反映照到傳感器某個部位的光線亮度是多少。
The challenge has been to devise algorithms that can interpret such numbers as scenes composed of different objects in space.
困難在于,要編寫出一套計算程序,可以把這些數(shù)字再現(xiàn)為空間中不同物體構(gòu)成的景象。
This comes naturally to people and, barring certain optical illusions, takes no time at all as well as precious little conscious effort.
這一切對于人類來說,只是一種本能—除非出現(xiàn)某些錯覺—立桿而見影,在意識上可謂不費吹灰之力。
Yet emulating this feat in computers has proved tough.
然而事實證明,在計算機上模擬人的這一天賦實非易事。
In natural vision, after an image is formed in the retina it is sent to an area at the back of the brain, called the visual cortex, for processing.
自然視覺的過程是:視網(wǎng)膜成像后,圖像被傳送到大腦后部叫做視覺皮層的地方,在那里進行信息處理。
The first nerve cells it passes through react only to simple stimuli, such as edges slanting at particular angles.
圖像經(jīng)過的第一組神經(jīng)元只能對簡單的視覺刺激作出反射,例如物體朝某些角度傾斜的邊緣。
They fire up other cells, further into the visual cortex, which react to simple combinations of edges, such as corners.
第一組神經(jīng)元又將興奮傳給視覺皮層更深處的神經(jīng)元,這些神經(jīng)細胞可以對簡單的物體輪廓作出反應(yīng),例如空間中的角落。
Cells in each subsequent area discern ever more complex features, with those at the top of the hierarchy responding to general categories like animals and faces, and to entire scenes comprising assorted objects.
越往后,神經(jīng)元能識別的圖像特征越復(fù)雜,最高級別神經(jīng)區(qū)域可以對動物和臉等總體類別作出反應(yīng),最后將包羅萬象的場景整合到一起。
All this takes less than a tenth of a second.
而整個過程只需要不到十分之一秒。
The outline of this process has been known for years and in the late 1980s Yann LeCun, now at New York University, pioneered an approach to computer vision that tries to mimic the hierarchical way the visual cortex is wired.
很早以前,人們就已經(jīng)了解這一過程的大致情形。二十世紀(jì)80年代末,現(xiàn)就職于紐約大學(xué)的雅安?勒存率先涉足計算機視覺研究,試圖模擬人腦視覺皮層內(nèi)神經(jīng)元層層遞進的布線方式。
He has been tweaking his convolutional neural networks ever since.
從那時起,他就一直在調(diào)整和改良他的卷積神經(jīng)網(wǎng)絡(luò)。
Seeing is believing
眼見為實
A ConvNet begins by swiping a number of software filters, each several pixels across, over the image, pixel by pixel.
卷積神經(jīng)網(wǎng)絡(luò)首先用幾個軟件濾光器,對圖像逐一像素地進行掃描,每個濾光器只能通過幾個像素。
Like the brain's primary visual cortex, these filters look for simple features such as edges.
就像大腦的初級視覺皮層,這些濾光器只負責(zé)收集物體邊緣等簡單圖像特征。
The upshot is a set of feature maps, one for each filter, showing which patches of the original image contain the sought-after element.
結(jié)果得到一組特征圖,每一張?zhí)卣鲌D對應(yīng)一個濾光器,顯示出原始圖像中的哪些塊包含要篩選到的要素。
A series of transformations is then performed on each map in order to enhance it and improve the contrast.
隨后,每一張?zhí)卣鲌D都要進行一系列調(diào)整,以提高它的畫質(zhì)、改善它的明暗對比度。
Next, the maps are swiped again, but this time rather than stopping at each pixel, the filter takes a snapshot every few pixels.
接下來,對這些特征圖再次進行掃描,但這一次,濾光器不會在像素上逐一停留,而是每幾個像素快拍一次。
That produces a new set of maps of lower resolution.
這樣,得到一組新的分辨率較低的特征圖。
These highlight the salient features while reining in computing power.
這些步驟凸顯圖像最顯著的特征,同時對計算資源進行嚴(yán)格控制。
The whole process is then repeated, with several hundred filters probing for more elaborate shapes rather than just a few scouring for simple ones.
然后,將整個過程重復(fù)一遍,用幾百個濾光器探查更為精細的物體形狀,而不是隨便掃視一些簡單的形狀。
The resulting array of feature maps is run through one final set of filters.
由此得到的特征圖陣列,被輸送經(jīng)過最后一組濾光器。
These classify objects into general categories, such as pedestrians or cars.
它們可以對物體進行大體歸類—是行人還是汽車等等。
Many state-of-the-art computer-vision systems work along similar lines.
許多尖端計算機視覺模擬系統(tǒng)都采用類似的原理運行。
The uniqueness of ConvNets lies in where they get their filters.
卷積神經(jīng)網(wǎng)絡(luò)的獨特之處在于它們的濾光器已經(jīng)做得登峰造極。
Traditionally, these were simply plugged in one by one, in a laborious manual process that required an expert human eye to tell the machine what features to look for, in future, at each level.
以往,濾光器只是一個接一個地接通。這一過程由手工完成,極為繁瑣,需要一名專家全程用肉眼觀察,然后向機器下達指令,告訴它下一步檢索什么樣的特征。
That made systems which relied on them good at spotting narrow classes of objects but inept at discerning anything else.
于是,依靠手動操作濾光器的計算機視覺系統(tǒng),可以識別的物體類別十分有限,而無法分辨其他更多的東西。
Dr LeCun's artificial visual cortex, by contrast, lights on the appropriate filters automatically as it is taught to distinguish the different types of object.
相比之下,勒存博士的人工視覺皮層,可以在按照設(shè)定程序識別不同類型的物體時,自動接通相應(yīng)的濾光器。
When an image is fed into the unprimed system and processed, the chances are it will not, at first, be assigned to the right category.
把一張圖像輸入他的系統(tǒng)進行處理,如果這個系統(tǒng)沒有預(yù)先存儲任何資料,第一次使用時體統(tǒng)有可能會把這張圖像錯誤歸類。
But, shown the correct answer, the system can work its way back, modifying its own parameters so that the next time it sees a similar image it will respond appropriately.
但是,告訴它正確答案之后,系統(tǒng)將重新識別圖像,并修改自身的參數(shù),以便下一次再看到類似的圖像,可以做出恰當(dāng)?shù)幕貞?yīng)。
After enough trial runs, typically 10,000 or more, it makes a decent fist of recognising that class of objects in unlabelled images.
經(jīng)過足夠的試運行之后——通常需要進行1萬次以上——要在未經(jīng)標(biāo)示的圖像上識別那一類物體,卷積神經(jīng)網(wǎng)絡(luò)可以完成得相當(dāng)出色。
This still requires human input, though.
然而,這個階段還是需要人類對其進行信息輸入。
The next stage is unsupervised learning, in which instruction is entirely absent.
下一個階段為無監(jiān)督學(xué)習(xí),在這個過程中沒有任何人的指導(dǎo)。
Instead, the system is shown lots of pictures without being told what they depict.
是的,向勒存的計算機視覺系統(tǒng)展示大量圖片,不告訴系統(tǒng)圖上畫的是什么。
It knows it is on to a promising filter when the output image resembles the input.
如果輸出的圖像和輸入的圖像幾乎一樣,系統(tǒng)就知道自身的濾光器升級了。
In a computing sense, resemblance is gauged by the extent to which the input image can be recreated from the lower-resolution output.
在計算機學(xué)上,兩張圖片是否相像的判斷標(biāo)準(zhǔn)是,像素較低的輸出圖像可以在多大程度上復(fù)原為輸入的圖像。
When it can, the filters the system had used to get there are retained.
一旦可以還原,為系統(tǒng)所用而達到這種效果的濾光器就被保留下來。
In a tribute to nature's nous, the lowest-level filters arrived at in this unaided process are edge-seeking ones, just as in the brain.
在這個體現(xiàn)自然界智能的過程中,在無人輔助階段,濾光器達到的最初等級為物體邊緣搜索,正如人腦中的情形一樣。
The top-level filters are sensitive to all manner of complex shapes.
最高等級的濾光器各種光怪陸離的形狀都十分敏感。
Caltech-101, a database routinely used for vision research, consists of some 10,000 standardised images of 101 types of just such complex shapes, including faces, cars and watches.
加州理工101是進行視覺研究常規(guī)使用的數(shù)據(jù)庫,它存儲了約1萬幅標(biāo)準(zhǔn)化圖像,描述101類和標(biāo)準(zhǔn)化圖像復(fù)雜程度相當(dāng)?shù)奈矬w形狀,包括臉、汽車和手表等。
When a ConvNet with unsupervised pre-training is shown the images from this database it can learn to recognise the categories more than 70% of the time.
當(dāng)給事先經(jīng)過無人監(jiān)督訓(xùn)練的卷積神經(jīng)網(wǎng)絡(luò)展示這個數(shù)據(jù)庫中的圖像時,它可以通過學(xué)習(xí)辨認圖像的類別,成功幾率超過70%。
This is just below what top-scoring hand-engineered systems are capable of—and those tend to be much slower.
而最先進的手動視覺系統(tǒng)可以做到的也只比這個高一點點—并且它們的辨認速度往往慢得多。
This approach which Geoffrey Hinton of the University of Toronto, a doyen of the field, has dubbed deep learning need not be confined to computer-vision.
勒存的方法多倫多大學(xué)的杰弗里?希爾頓是該領(lǐng)域的泰斗,他將這一方法命名為深度學(xué)習(xí)不一定局限于計算機視覺領(lǐng)域。
In theory, it ought to work for any hierarchical system:language processing, for example.
理論上,該方法還可以用在任何等級系統(tǒng)當(dāng)中,譬如語言處理。
In that case individual sounds would be low-level features akin to edges, whereas the meanings of conversations would correspond to elaborate scenes.
在這種情況下,音素就是語言識別的初級特征,相當(dāng)于模擬視覺中的物體邊緣,而對話的含義則相當(dāng)于復(fù)雜場景。
For now, though, ConvNet has proved its mettle in the visual domain.
然而,目前卷積神經(jīng)網(wǎng)絡(luò)已經(jīng)在視覺領(lǐng)域大顯神威。
Google has been using it to blot out faces and licence plates in its Streetview application.
谷歌一直在街道實景應(yīng)用程序中使用該系統(tǒng),識別人臉和車牌,對其進行模糊處理。
It has also come to the attention of DARPA, the research arm of America's Defence Department.
它還引起了美國國防部高等研究計劃局的注意。
This agency provided Dr LeCun and his team with a small roving robot which, equipped with their system, learned to detect large obstacles from afar and correct its path accordingly—a problem that lesser machines often, as it were, trip over.
他們?yōu)槔沾娌┦亢退膱F隊提供了一個漫游機器人,給它裝上卷積神經(jīng)網(wǎng)絡(luò)系統(tǒng)后,這個機器人學(xué)會了探測遠處的大型障礙物,并相應(yīng)地糾正行進路線—可以說,沒有安裝該系統(tǒng)的機器人通常都會在這個問題上絆住。
The scooter-sized robot was also rather good at not running into the researchers.
這個漫游機器人只有小孩玩的滑板車那么大,卻還相當(dāng)有眼色:它不會撞向研究人員。
In a selfless act of scientific bravery, they strode confidently in front of it as it rode towards them at a brisk walking pace, only to see it stop in its tracks and reverse.
研究人員們發(fā)揚科學(xué)家舍身忘我的大無畏精神,做了一個實驗:當(dāng)機器人步履輕盈地向他們開過來時,他們突然昂首闊步迎面沖向機器人。結(jié)果發(fā)現(xiàn),機器人半路停下并轉(zhuǎn)向。
Such machines may not quite yet be ready to walk the streets alongside people, but the day they can is surely not far off.
當(dāng)然,這類機器人要走上街頭與人為伍,或許還略欠火候。但是,它們可以自由行走那一天想必已經(jīng)不遠了。