python分辨中文英文_python怎麼判斷中文字元編碼

① python 判斷是否含有數字，英文字元和漢字

str=''
這里到str代表任意字元串
1.判斷是否含有數字
if str >= u'\u4e00' and str =< u'\u9fa5':
return "包含漢字"
else:
return "不包含漢字"
2.判斷一個unicode是否是英文字母
if (str>= u'\u0041' and str<=u'\u005a') or (str >= u'\u0061'and str<=u'\u007a'):
return "包含"
else:
return "不包含"
3.判斷是否非漢字，數字和英文字元
if not (is_chinese(uchar) or is_number(uchar) or is_alphabet(uchar)):
return True
else:
return False

② python如何判斷一個字元是中文還是英文

逐個字元用ord()判斷ascii碼

a - z : 97 - 122

A - Z : 65 - 90

defis_english_char(ch):
iford(ch)notin(97,122)andord(ch)notin(65,90):
returnFalse
returnTrue

上面函數可以辨別字元是否為英文字元

③ python怎麼判斷中文字元編碼

#!/usr/bin/env python
# -*- coding:GBK -*-

"""漢字處理的工具:
判斷unicode是否是漢字，數字，英文，或者其他字元。
全形符號轉半形符號。"""

__author__="internetsweeper <[email protected]>"
__date__="2007-08-04"

def is_chinese(uchar):
"""判斷一個unicode是否是漢字"""
if uchar >= u'\u4e00' and uchar<=u'\u9fa5':
return True
else:
return False

def is_number(uchar):
"""判斷一個unicode是否是數字"""
if uchar >= u'\u0030' and uchar<=u'\u0039':
return True
else:
return False

def is_alphabet(uchar):
"""判斷一個unicode是否是英文字母"""
if (uchar >= u'\u0041' and uchar<=u'\u005a') or (uchar >= u'\u0061' and uchar<=u'\u007a'):
return True
else:
return False

def is_other(uchar):
"""判斷是否非漢字，數字和英文字元"""
if not (is_chinese(uchar) or is_number(uchar) or is_alphabet(uchar)):
return True
else:
return False

def B2Q(uchar):
"""半形轉全形"""
inside_code=ord(uchar)
if inside_code<0x0020 or inside_code>0x7e: #不是半形字元就返回原來的字元
return uchar
if inside_code==0x0020: #除了空格其他的全形半形的公式為:半形=全形-0xfee0
inside_code=0x3000
else:
inside_code+=0xfee0
return unichr(inside_code)

def Q2B(uchar):
"""全形轉半形"""
inside_code=ord(uchar)
if inside_code==0x3000:
inside_code=0x0020
else:
inside_code-=0xfee0
if inside_code<0x0020 or inside_code>0x7e: #轉完之後不是半形字元返回原來的字元
return uchar
return unichr(inside_code)

def stringQ2B(ustring):
"""把字元串全形轉半形"""
return "".join([Q2B(uchar) for uchar in ustring])

def uniform(ustring):
"""格式化字元串，完成全形轉半形，大寫轉小寫的工作"""
return stringQ2B(ustring).lower()

def string2List(ustring):
"""將ustring按照中文，字母，數字分開"""
retList=[]
utmp=[]
for uchar in ustring:
if is_other(uchar):
if len(utmp)==0:
continue
else:
retList.append("".join(utmp))
utmp=[]
else:
utmp.append(uchar)
if len(utmp)!=0:
retList.append("".join(utmp))
return retList

if __name__=="__main__":
#test Q2B and B2Q
for i in range(0x0020,0x007F):
print Q2B(B2Q(unichr(i))),B2Q(unichr(i))

#test uniform
ustring=u'中國人名a高頻A'
ustring=uniform(ustring)
ret=string2List(ustring)
print ret

以上轉自http://hi..com/fenghua1893/item/d1a71d5ac47ffdcfd3e10cd1

這個問題是做 MkIV 預處理程序時搞定的，就是把一個混合了中英文混合字串分離為英文與中文的子字串，譬如，將」我的 English 學的不好「分離為「我的"、" English 」與 "學的不好" 三個子字串。
1. 中英文混合字串的統一編碼表示中英文混合字串處理最省力的辦法就是把它們的編碼都轉成 Unicode，讓一個漢字與一個英文字母的內存位寬都是相等的。這個工作用 Python 來做，比較合適，因為 Python 內碼採用的是 Unicode，並且為了支持 Unicode 字串的操作，Python 做了一個 Unicode 內建模塊，把 string 對象的全部方法重新實現了一遍，另外提供了 Codecs 對象，解決各種編碼類型的字元串解碼與編碼問題。
譬如下面的 Python 代碼，可實現 UTF-8 編碼的中英文混合字串向 Unicode 編碼的轉換：# -*-
coding:utf-8 -*-
a = "我的 English 學的不好"
print type(a),len (a), a
b = unicode (a, "utf-8")
print type(b), len (b), b字元串 a 是 utf-8 編碼，使用 python 的內建對象 unicode 可將其轉換為 Unicode 編碼的字元串 b。上述代碼執行後的輸出結果如下所示，比較字串 a 與字串 b 的長度，顯然 len (b) 的輸出結果是合理的。<type 'str'> 27 我的 English 學的不好
<type 'unicode'> 15 我的 English 學的不好要注意的一個問題是 Unicode 雖然號稱是「統一碼」，不過也是存在著兩種形式，即：
UCS-2：為 16 位碼，具有 2^16 = 65536 個碼位； UCS-4：為 32 位碼，目前的規定是其首位元組的首位為 0，因此具有 2^31 = 2147483648 個碼位，不過現在的只使用了 0x00000000 － 0x0010FFFF 之間的碼位，共 1114112 個。
使用Python sys 模塊提供的一個變數 maxunicode 的值可以判斷當前 Python 所使用的 Unicode 類型是 UCS-2 的還是 UCS-4 的。import sys
print sys.maxunicode若 sys.maxunicode 的值為 1114111，即為 UCS-4；若為 65535，則為 UCS-2。

2. 中英文混合字串的分離一旦中英文字串的編碼獲得統一，那麼對它們進行分裂就是很簡單的事情了。首先要為中文字串與英文字串分別准備一個收集器，使用兩個空的字串對象即可，譬如 zh_gather 與 en_gather；然後要准備一個列表對象，負責按分離次序存儲 zh_gather 與 en_gather 的值。下面這個 Python 函數接受一個中英文混合的 Unicode 字串，並返回存儲中英文子字串的列表。def split_zh_en (zh_en_str):

zh_en_group = []
zh_gather = ""
en_gather = ""
zh_status = False

for c in zh_en_str:
if not zh_status and is_zh (c):
zh_status = True
if en_gather != "":
zh_en_group.append ([mark["en"],en_gather])
en_gather = ""
elif not is_zh (c) and zh_status:
zh_status = False
if zh_gather != "":
zh_en_group.append ([mark["zh"], zh_gather])
if zh_status:
zh_gather += c
else:
en_gather += c
zh_gather = ""

if en_gather != "":
zh_en_group.append ([mark["en"],en_gather])
elif zh_gather != "":
zh_en_group.append ([mark["zh"],zh_gather])

return zh_en_group上述代碼所實現的功能細節是：對中英文混合字串 zh_en_str 的遍歷過程中進行逐字識別，若當前字元為中文，則將其添加到 zh_gather 中；若當前字元為英文，則將其添加到 en_gather 中。zh_status 表示中英文字元的切換狀態，當 zh_status 的值發生突變時，就將所收集的中文子字串或英文子字串添加到 zh_en_group 中去。
判斷字串 zh_en_str 中是否包含中文字元的條件語句中出現了一個 is_zh () 函數，它的實現如下：def is_zh (c):
x = ord (c)
# Punct & Radicals
if x >= 0x2e80 and x <= 0x33ff:
return True

# Fullwidth Latin Characters
elif x >= 0xff00 and x <= 0xffef:
return True

# CJK Unified Ideographs &
# CJK Unified Ideographs Extension A
elif x >= 0x4e00 and x <= 0x9fbb:
return True
# CJK Compatibility Ideographs
elif x >= 0xf900 and x <= 0xfad9:
return True

# CJK Unified Ideographs Extension B
elif x >= 0x20000 and x <= 0x2a6d6:
return True

# CJK Compatibility Supplement
elif x >= 0x2f800 and x <= 0x2fa1d:
return True

else:
return False這段代碼來自 jjgod 寫的 XeTeX 預處理程序。
對於分離出來的中文子字串與英文子字串，為了使用方便，在將它們存入 zh_en_group 列表時，我對它們分別做了標記，即 mark["zh"] 與 mark["en"]。mark 是一個 dict 對象，其定義如下：mark = {"en":1, "zh":2}如果要對 zh_en_group 中的英文字串或中文字串進行處理時，標記的意義在於快速判定字串是中文的，還是英文的，譬如：for str in zh_en_group:
if str[0] = mark["en"]:
do somthing
else:
do somthing

④ python 判斷是不是中文字

法一：

isinstance(s, str) 用來判斷是否為一般字元串

isinstance(s, unicode) 用來判斷是否為unicode

或

if type(str).__name__!="unicode":
str=unicode(str,"utf-8")
else:
pass

法二：

Python chardet 字元編碼判斷
使用 chardet 可以很方便的實現字元串/文件的編碼檢測。尤其是中文網頁，有的頁面使用GBK/GB2312，有的使用UTF8，如果你需要去爬一些頁面，知道網頁編碼很重要的，雖然HTML頁面有charset標簽，但是有些時候是不對的。那麼chardet就能幫我們大忙了。

chardet實例
>>> import urllib
>>> rawdata = urllib.urlopen('http://www.google.cn/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'confidence': 0.98999999999999999, 'encoding': 'GB2312'}
>>>chardet可以直接用detect函數來檢測所給字元的編碼。函數返回值為字典，有2個元數，一個是檢測的可信度，另外一個就是檢測到的編碼。

chardet 安裝
下載chardet後，解壓chardet壓縮包，直接將chardet文件夾放在應用程序目錄下，就可以使用import chardet開始使用chardet了。

或者使用setup.py安裝文件，將chardet拷貝到Python系統目錄下，這樣所有的python程序只要用import chardet就可以了。

⑤ python 判斷字元串中是否含有英文

使用isalpha()方法來進行判斷。Python isalpha() 方法檢測字元串是否只由字母組成。如果字元串至少有一個字元並且所有字元都是字母則返回 True，否則返回 False。

isalpha()方法要檢測的字元。它可以是一個有效的字元（被轉換為 int 類型），也可以是 EOF（表示無效的字元）。

(5)python分辨中文英文擴展閱讀

通常認為只有"abc...xyzABC...XYZ"才是字母，其實這是不對的。字母並不是固定的，不同的語言文化可能會包含不同的字母，例如在「簡體中文」環境中，西里爾文БГЁ、希臘文ΣΩΔΨΦ（數學物理公式中常用希臘字母）等都將成為字母。

可以通過 setlocale() 函數改變程序的地域設置，讓程序使用不同的字元集，從而支持不同的語言文化。一個字母要麼是小寫字母，要麼是大寫字母；並且一個小寫字母必定對應一個大寫字母，反之亦然。這種說法雖然適用於默認的地域設置（默認為"C"），但是並不一定適用於其它的地域設置。

⑥ python如何判斷字元是中文還是英文字母

判斷如下：

1、逐個字元用ord()判斷ascii碼：a - z : 97 - 122，A - Z : 65 - 90。

2、def is_english_char(ch)：if ord(ch) not in (97,122) and ord(ch) not in (65,90)：return False，return True。

Python在設計上堅持了清晰劃一的風格，這使得Python成為一門易讀、易維護，並且被大量用戶所歡迎的、用途廣泛的語言。

(6)python分辨中文英文擴展閱讀：

Python的控制語句：

1、if語句，當條件成立時運行語句塊。經常與else, elif(相當於else if) 配合使用。

2、for語句，遍歷列表、字元串、字典、集合等迭代器，依次處理迭代器中的每個元素。

3、while語句，當條件為真時，循環運行語句塊。

4、try語句，與except,finally配合使用處理在程序運行中出現的異常情況。

5、class語句，用於定義類型。

⑦ Python判斷字元串中是否有中文字元

首先，在Python中字元串的表示是用unicode編碼。所以在做編碼轉換時，通常要以unicode作為中間編碼。
decode的作用是將其他編碼的字元串轉換成unicode編碼，比如 a.decode('utf-8')，表示將utf-8編碼的字元串轉換成unicode編碼
encode的作用是將unicode編碼的字元串轉換成其他編碼格式的字元串，比如b.encode('utf-8')，表示將unicode編碼格式轉換成utf-8編碼格式的字元串

判斷一個字元串中是否含有中文字元：
好了，有了以上知識，就可以很容易的解決這個問題了。這是代碼

1 #-*- coding:utf-8 -*-
2
3 import sys
4 reload(sys)
5 sys.setdefaultencoding('utf8')
6
7 def check_contain_chinese(check_str):
8 for ch in check_str.decode('utf-8'):
9 if u'\u4e00' <= ch <= u'\u9fff':
10 return True
11 return False
12
13 if __name__ == "__main__":
14 print check_contain_chinese('中國')
15 print check_contain_chinese('xxx')
16 print check_contain_chinese('xx中國')
17
18 結果：
19 True
20 False
21 True

⑧ python 如何利用字元串索引來分含有中英文的字元串，字元串如「府綢 poplin 府綢縐 crepe poplin......」

python 3 這其實是掃描的中英文字典，所以要一一對應，最好可以保存到excel裡面。

⑨ python 判斷字元串是否是一段中文

嘻嘻，渣度機器人隊再得隊1分。13級巨神又遭調戲。

導航:首頁 > 編程語言 > python分辨中文英文

python分辨中文英文

與python分辨中文英文相關的資料