lí kám ū thiann-tio̍h in ê giân

編輯歷史

時間	作者	版本
2017-07-07 16:42 – 16:42	(unknown)	r0 – r1
顯示 diff + lí kám ū thiann-tio̍h in ê giân + + Github page: + https://github.com/a-tsioh/kam-u-thiann-tioh + + 目的 + 讓人瀏覽在立法院附近（包括PTT，IRC,網路）人人所說的言 + Provide a webpage to search and browse public speeches made around the 立法院 + + + + User story + 使用者可能無法當時一直聽或看文博，或許想看global picture + 加上幾個月後，想記起現在在發生什麼 + 從一個PO，可以看到上下一個跟類似的（同一個話題） + + A typical user may not have followed the whole event or missed part of it. He wants to have a clearer picture of the though shared on the different stages around the legislative yuan. + + 需要什麼 + understand and select what source of data may be relevant and shall be indexed + 選重要資料的來源， + Structure the data so that the meta-data can be used (timestamps, location, kind of intervention/speaker) + Create a semantic space to find similar messages + Define a list of relevant keyword that should be spotted by the system to relate messages. + word cloud or sth like that for what people are saying now + a timeline for quickly understanding what have been said + a map for quickly identifying where the message is delivered + analyze semantic to recognize stand point of specific message + is this possible? (can you explain more ?give an example) + e.g., "KMT的奧步啊" "深藍是超想當中國人的" have similar stand pointa + sounds hard if we don't define the topics on which the standpoints are given beforehand + FYI, Gene Hong (黑貘) did a related project: http://gene.speaking.tw/2014/03/tvbs-tvbs-10-1129.html + + The Data + See 319 Event Data Collection + (minimalist fetch and json from irc included in this project, will use 319 Event Data Collection when available) + + Transcripts typically include: + Timestamp + Place + person + text + may have an English translation + From the text, we may extract some keywords (like names of politic figures, event, places...) + + dealing with videos + Just a question, if we precise location and time of the beginning of each recording, does it make sense to link it to the text feeds ? + + + 用的技術 + Django on the serverside ( I need some python libs) + elasticsearch for the index + a-tsioh: just curious, how do you plan to host the server? or do we have a hosted elaticsearch server ready? (cuz one server should be able to serve multiple projects, if the loading is not heavy) + got a server in Europe, if the link is not fast enough, we can see how to host it in Taiwan, installing elasticsearch is extremely simple. + Livescript on the clientside, we may need some d3js (just because I like it) + select a UI toolkit (no preference for now + + ---- + 先把FB　的 messages 放在這 + + _________ + + 大家好！ + 我身爲國語有限制的外國人。 + 在現在發生的事件，資料已經多得難以處理。 + 很多有意思的話，我注意不到。 + text 和 video 的 livestream 事實上我大腦無法處理。（你們會有這個問題嗎？） + 看非中文的 livestream 就有可能無法處理 XD + 找到的資料大部分都是 chronological. + 現在有沒有計劃作一個讓人look back at what have been said + just like a search engine over the transcript/videos + 或許有而有只是沒發現。 + 沒有的話，我願意來負責 + + Kirby Wu 我有 logbot 跟 bbs 的 crawler, 文字直播跟鄉民的消息都可以做.. (logbot 其實比較好的是拿 dump database or api endpoint ) + video 則需要有逐字稿或至少自動翻譯.. facebook 可以建立一個專頁, 請大家把相關訊息轉入, 再用程式自動備份...其他資料來源就得 case by case? + PTT 八卦板上有很多其他資料的 link, 至少廣為流傳的都不會漏掉 + + Pierre Magistry I was thinking about starting with the live transcription archive (and 文播記錄) that are already timestamped, this may help to align Mandarin and English and maybe even video if possible + Pierre Magistry one usecase would also be to allow people that were not following to catch up (including foreign press) + Pierre Magistry 有些無關的事我得先處理。晚一點會回來。 + 有興趣的人請舉手。計劃會需要比我多瞭解情況和資料的人(算.txt租吧)還有UI-designer 我個人會看一下怎麼用elasticsearch 作　search and classification，也可以pre-process data。 + 如果你在臺大附近，今晚就可以見面，不然網路上也可以 + + Ymow Wu 我前天做好一個，有點像是包在kirby你說的架構底下，我原本是希望報名上台發表言論的人，能夠用這系統報名，直接接到文字轉播系統，但是目前可能會暫緩，先用新聞瀏覽器的模式擴大傳播效應，自動翻譯我原本要把文字轉播丟到google翻譯再接回來直接放在Androdi app，但是 + @Sunny Chien 說： + google翻譯不可行, 那成品還不夠精準，都要中英日三語直播，我們可以幫忙的譯者會挺你們到底 + 所以我覺得能用logbot解決當然很好，現在可能先以HumanAPI為主，當然如果logbot翻譯水準很高，又另當別論 + * + *FYI: http://share.inside.com.tw/posts/4292